簡易檢索 / 詳目顯示

研究生: 曾文海
Tseng, Wen-Hai
論文名稱: PTVM:結合點追蹤與視覺匹配之動態幀率穩健多目標追蹤器
PTVM: A Robust Multi-Object Tracker for Dynamic Frame Rates via Point Tracking and Visual Matching
指導教授: 許志仲
Hsu, Chih-Chung
鄭順林
Jeng, Shuen-Lin
學位類別: 碩士
Master
系所名稱: 管理學院 - 數據科學研究所
Institute of Data Science
論文出版年: 2025
畢業學年度: 113
語文別: 中文
論文頁數: 87
中文關鍵詞: 多目標追蹤物體偵測重新識別(Re-ID)追蹤任意點
外文關鍵詞: Multi-Object Tracking, Object Detection, Re-Identification (Re-ID), Tracking Any Point
相關次數: 點閱:4下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 多目標追蹤(MOT)在現實場景中仍具挑戰性,尤其在幀率不穩定造成的物件位移幅度、移動軌跡不穩定與外觀變化劇烈的情況下。傳統基於檢測的追蹤方法通常依賴高幀率與時序一致性,導致在上述條件下效能顯著下降。為解決此問題,我們提出 PTVM,一個結合任意點追蹤與視覺特徵比對的追蹤架構。

    本方法使用 TAPIR 的兩階段點追蹤,提升幾何一致性,即使在快速移動或遮擋下也能維持追蹤穩定性。另一方面,透過 CLIP 的圖像編碼器提取高階視覺語義特徵,加強對外觀相似目標的辨識能力。

    為適應幀率波動,我們提出一種融合的匹配代價函數,與過去的擇優的思路不同,我們將預測幾何距離與視覺相似度結合,有效利用幾何資訊與視覺特徵改變以往的只能依靠 ReID 的視覺匹配。我們在 MOT17 、 MOT20 、 KITTI 、 2024 AICUP 中正車輛追蹤資料集進行實驗 ,實驗結果顯示,在幀率正常的情況下,PTVM (Point Tracking and Visual Matching) 與傳統方法在身份穩定性與軌跡連續性上有相當的表現,而在幀率不穩定的應用情境比傳統方法有更好的穩定性。

    此外,PTVM 採用解耦式模組設計,並且具備三項優勢使其無需重新訓練即可泛化至不同資料集:其一,點追蹤模組可獨立運作並對快速位移與不規則運動軌跡具備強韌性;其二,所用的 CLIP 提取圖像特徵具備跨場景語義表徵能力;其三,匹配策略融合了空間與視覺信息,能自適應不同場景的目標對應關係,實現穩定且高效的推論。

    Multi-object tracking (MOT) remains a challenging task in real-world scenarios, especially under conditions of unstable frame rates, large object displacements, irregular motion trajectories, and drastic appearance changes. Traditional detection-based tracking approaches typically rely on high frame rates and temporal consistency, leading to significant performance degradation under these conditions. To address this issue, we propose extbf{PTVM}, a tracking framework that integrates arbitrary point tracking with visual feature matching.

    The proposed method employs TAPIR 's two-stage point tracking to enhance geometric consistency, ensuring stable tracking even under rapid motion or occlusion. On the other hand, high-level visual features are extracted using the image encoder of CLIP, thereby improving the model's capability to distinguish between visually similar objects.

    To accommodate fluctuating frame rates, we introduce a novel fused matching cost function. Unlike traditional approaches that prioritize either motion prediction or visual similarity, our method combines predicted geometric distance and visual similarity. This allows for a more balanced and robust matching strategy that does not rely solely on ReID-based appearance matching. We evaluate our method on several benchmarks, including MOT17, MOT20, KITTI, and the 2024 AICUP vehicle tracking dataset. Experimental results demonstrate that under normal frame rate conditions, PTVM (Point Tracking and Visual Matching) achieves competitive performance in terms of identity stability and trajectory continuity. More importantly, in low or unstable frame rate scenarios, PTVM significantly outperforms traditional methods in maintaining tracking stability.

    Furthermore, PTVM adopts a modular and decoupled architecture with three key advantages that enable generalization across different datasets without the need for retraining: (1) the point tracking module operates independently and is resilient to large displacements and irregular motion patterns; (2) CLIP-based visual feature extraction provides rich representations with strong cross-domain transferability; and (3) the matching strategy fuses spatial and visual cues, allowing the system to adaptively associate objects across diverse scenes, resulting in stable and efficient inference.

    中文摘要 I Abstract III 誌謝 XII 目錄 XIII 表目錄 XVI 圖目錄 XVIII 第一章 緒論 1 1-1 研究動機 1 1-2 本文結構 2 1-3 本文貢獻 3 第二章 相關研究 4 2-1 Tracking-by-Detection方法與改進演化 4 2-1.1 實作參考方法 5 2-1.2 Kalman Filter在MOT中的角色與運作原理 6 2-2 基於Transformer的端到端方法 8 2-3 點追蹤技術 8 2-4 Re-identification 9 第三章 研究方法 10 3-1 研究方法 10 3-1.1 總體流程說明 10 3-2 TAPIR 兩階段點追蹤 12 3-2.1 TAPIR與KalmanFilter 追蹤效果比較 15 3-2.2 Point Tracking模組在 PTVM 架構中的必要性 16 3-2.3 Point Tracking與Object Tracking在AICUP資料集之比較 17 3-2.4 不同物件尺寸下 Point Tracking與Object Tracking效能比較 18 3-3 CLIP 特徵提取與特徵相似度比對 19 3-4 距離融合方法 22 3-5 匹配與身份關聯策略 24 第四章 實驗結果 27 4-1 實驗結果 27 4-1.1 實驗設置 27 4-1.2 追蹤效能評估指標 28 4-1.3 IoU分布視覺化分析 31 4-1.4 MOT17與MOT20追蹤效能指標綜合分析 37 4-1.5 KITTI資料集追蹤效能指標綜合分析 43 4-1.6 2024 AICUP中正車輛追蹤資料集追蹤效能指標綜合分析 47 4-1.7 動態幀率影片的多目標追蹤視覺化 49 第五章 結論 58 5-1 結論 58 第六章 未來工作 60 6-1 未來工作 60 參考文獻 62

    [1]Nir Aharon, Roy Orfaig, and Ben-Zion Bobrovsky.Bot-SORT: Robust associations multi-pedestrian tracking, 2022.

    [2]Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft.Simple Online and Realtime Tracking.In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, September 2016.

    [3]Robert Grover Brown and Patrick Y. C. Hwang.Introduction to Random Signals and Applied Kalman Filtering: With MATLAB Exercises and Solutions; 3rd ed.Wiley, New York, NY, 1997.

    [4]Jinkun Cao, Jiangmiao Pang, Xinshuo Weng, Rawal Khirodkar, and Kris Kitani.Observation-centric SORT: Rethinking SORT for robust multi-object tracking, 2023.

    [5]Peng Chu, Jiang Wang, Quanzeng You, Haibin Ling, and Zicheng Liu.TransMOT: Spatial-temporal graph transformer for multiple object tracking, 2021.

    [6]MOE AI Competition and labeled data acquisition project.AICUP, 2024.

    [7]Patrick Dendorfer, Aljoša Ošep, Anton Milan, Konrad Schindler, Daniel Cremers, Ian Reid, Stefan Roth, and Laura Leal-Taixé.MOTChallenge: A benchmark for single-camera multiple target tracking, 2020.

    [8]Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, and Laura Leal-Taixé.MOT20: A benchmark for multi-object tracking in crowded scenes, 2020.

    [9]Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman.TAPIR: Tracking any point with per-frame initialization and temporal refinement, 2023.

    [10]Alexey Dosovitskiy et al.An image is worth 16x16 words: Transformers for image recognition at scale, 2021.

    [11]Yunhao Du et al.StrongSORT: Make DeepSORT great again, 2023.

    [12]Andreas Ess, Bastian Leibe, and Luc Van Gool.Depth and appearance for mobile scene analysis.In 2007 IEEE 11th International Conference on Computer Vision, pages 1–8, 2007.

    [13]Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan.Object detection with discriminatively trained part-based models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.

    [14]Zheng Ge et al.YOLOX: Exceeding YOLO series in 2021, 2021.

    [15]Andreas Geiger, Philip Lenz, and Raquel Urtasun.Are we ready for autonomous driving? The KITTI vision benchmark suite.In CVPR, 2012.

    [16]Adam W. Harley, Zhaoyuan Fang, and Katerina Fragkiadaki.Particle video revisited: Tracking through occlusions using point trajectories, 2022.

    [17]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition, 2015.

    [18]Harold W. Kuhn.The Hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1–2):83–97, March 1955.

    [19]Laura Leal-Taixé et al.MOTChallenge 2015: Towards a benchmark for multi-target tracking, 2015.

    [20]Ji Lin, Chuang Gan, Kuan Wang, and Song Han.TSM: Temporal shift module for efficient and scalable video understanding on edge device, 2021.

    [21]Tsung-Yi Lin et al.Microsoft COCO: Common objects in context, 2015.

    [22]Chen Long, Ai Haizhou, Zhuang Zijie, and Shang Chong.Real-time multiple people tracking with deeply learned candidate selection and person re-identification.In ICME, 2018.

    [23]Jonathon Luiten et al.HOTA: A higher order metric for evaluating multi-object tracking.International Journal of Computer Vision, 129(2):548–578, October 2020.

    [24]Tim Meinhardt et al.TrackFormer: Multi-object tracking with transformers, 2022.

    [25]Anton Milan et al.MOT16: A benchmark for multi-object tracking, 2016.

    [26]Alec Radford et al.Learning transferable visual models from natural language supervision, 2021.

    [27]Shaoqing Ren et al.Faster R-CNN: Towards real-time object detection with region proposal networks, 2016.

    [28]Olga Russakovsky et al.ImageNet Large Scale Visual Recognition Challenge, 2015.

    [29]Ilya Tolstikhin et al.MLP-Mixer: An all-MLP architecture for vision, 2021.

    [30]Nicolai Wojke, Alex Bewley, and Dietrich Paulus.Simple Online and Realtime Tracking with a Deep Association Metric, 2017.

    [31]Yifu Zhang et al.ByteTrack: Multi-object tracking by associating every detection box, 2022.

    [32]Yifu Zhang et al.FairMOT: On the fairness of detection and re-identification in multiple object tracking.International Journal of Computer Vision, 129(11):3069–3087, September 2021.

    [33]Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl.Tracking Objects as Points, 2020.

    無法下載圖示 校內:2026-02-12公開
    校外:2026-02-12公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE