| 研究生: |
曾文海 Tseng, Wen-Hai |
|---|---|
| 論文名稱: |
PTVM:結合點追蹤與視覺匹配之動態幀率穩健多目標追蹤器 PTVM: A Robust Multi-Object Tracker for Dynamic Frame Rates via Point Tracking and Visual Matching |
| 指導教授: |
許志仲
Hsu, Chih-Chung 鄭順林 Jeng, Shuen-Lin |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 數據科學研究所 Institute of Data Science |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 中文 |
| 論文頁數: | 87 |
| 中文關鍵詞: | 多目標追蹤 、物體偵測 、重新識別(Re-ID) 、追蹤任意點 |
| 外文關鍵詞: | Multi-Object Tracking, Object Detection, Re-Identification (Re-ID), Tracking Any Point |
| 相關次數: | 點閱:4 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
多目標追蹤(MOT)在現實場景中仍具挑戰性,尤其在幀率不穩定造成的物件位移幅度、移動軌跡不穩定與外觀變化劇烈的情況下。傳統基於檢測的追蹤方法通常依賴高幀率與時序一致性,導致在上述條件下效能顯著下降。為解決此問題,我們提出 PTVM,一個結合任意點追蹤與視覺特徵比對的追蹤架構。
本方法使用 TAPIR 的兩階段點追蹤,提升幾何一致性,即使在快速移動或遮擋下也能維持追蹤穩定性。另一方面,透過 CLIP 的圖像編碼器提取高階視覺語義特徵,加強對外觀相似目標的辨識能力。
為適應幀率波動,我們提出一種融合的匹配代價函數,與過去的擇優的思路不同,我們將預測幾何距離與視覺相似度結合,有效利用幾何資訊與視覺特徵改變以往的只能依靠 ReID 的視覺匹配。我們在 MOT17 、 MOT20 、 KITTI 、 2024 AICUP 中正車輛追蹤資料集進行實驗 ,實驗結果顯示,在幀率正常的情況下,PTVM (Point Tracking and Visual Matching) 與傳統方法在身份穩定性與軌跡連續性上有相當的表現,而在幀率不穩定的應用情境比傳統方法有更好的穩定性。
此外,PTVM 採用解耦式模組設計,並且具備三項優勢使其無需重新訓練即可泛化至不同資料集:其一,點追蹤模組可獨立運作並對快速位移與不規則運動軌跡具備強韌性;其二,所用的 CLIP 提取圖像特徵具備跨場景語義表徵能力;其三,匹配策略融合了空間與視覺信息,能自適應不同場景的目標對應關係,實現穩定且高效的推論。
Multi-object tracking (MOT) remains a challenging task in real-world scenarios, especially under conditions of unstable frame rates, large object displacements, irregular motion trajectories, and drastic appearance changes. Traditional detection-based tracking approaches typically rely on high frame rates and temporal consistency, leading to significant performance degradation under these conditions. To address this issue, we propose extbf{PTVM}, a tracking framework that integrates arbitrary point tracking with visual feature matching.
The proposed method employs TAPIR 's two-stage point tracking to enhance geometric consistency, ensuring stable tracking even under rapid motion or occlusion. On the other hand, high-level visual features are extracted using the image encoder of CLIP, thereby improving the model's capability to distinguish between visually similar objects.
To accommodate fluctuating frame rates, we introduce a novel fused matching cost function. Unlike traditional approaches that prioritize either motion prediction or visual similarity, our method combines predicted geometric distance and visual similarity. This allows for a more balanced and robust matching strategy that does not rely solely on ReID-based appearance matching. We evaluate our method on several benchmarks, including MOT17, MOT20, KITTI, and the 2024 AICUP vehicle tracking dataset. Experimental results demonstrate that under normal frame rate conditions, PTVM (Point Tracking and Visual Matching) achieves competitive performance in terms of identity stability and trajectory continuity. More importantly, in low or unstable frame rate scenarios, PTVM significantly outperforms traditional methods in maintaining tracking stability.
Furthermore, PTVM adopts a modular and decoupled architecture with three key advantages that enable generalization across different datasets without the need for retraining: (1) the point tracking module operates independently and is resilient to large displacements and irregular motion patterns; (2) CLIP-based visual feature extraction provides rich representations with strong cross-domain transferability; and (3) the matching strategy fuses spatial and visual cues, allowing the system to adaptively associate objects across diverse scenes, resulting in stable and efficient inference.
[1]Nir Aharon, Roy Orfaig, and Ben-Zion Bobrovsky.Bot-SORT: Robust associations multi-pedestrian tracking, 2022.
[2]Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft.Simple Online and Realtime Tracking.In 2016 IEEE International Conference on Image Processing (ICIP). IEEE, September 2016.
[3]Robert Grover Brown and Patrick Y. C. Hwang.Introduction to Random Signals and Applied Kalman Filtering: With MATLAB Exercises and Solutions; 3rd ed.Wiley, New York, NY, 1997.
[4]Jinkun Cao, Jiangmiao Pang, Xinshuo Weng, Rawal Khirodkar, and Kris Kitani.Observation-centric SORT: Rethinking SORT for robust multi-object tracking, 2023.
[5]Peng Chu, Jiang Wang, Quanzeng You, Haibin Ling, and Zicheng Liu.TransMOT: Spatial-temporal graph transformer for multiple object tracking, 2021.
[6]MOE AI Competition and labeled data acquisition project.AICUP, 2024.
[7]Patrick Dendorfer, Aljoša Ošep, Anton Milan, Konrad Schindler, Daniel Cremers, Ian Reid, Stefan Roth, and Laura Leal-Taixé.MOTChallenge: A benchmark for single-camera multiple target tracking, 2020.
[8]Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, and Laura Leal-Taixé.MOT20: A benchmark for multi-object tracking in crowded scenes, 2020.
[9]Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman.TAPIR: Tracking any point with per-frame initialization and temporal refinement, 2023.
[10]Alexey Dosovitskiy et al.An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
[11]Yunhao Du et al.StrongSORT: Make DeepSORT great again, 2023.
[12]Andreas Ess, Bastian Leibe, and Luc Van Gool.Depth and appearance for mobile scene analysis.In 2007 IEEE 11th International Conference on Computer Vision, pages 1–8, 2007.
[13]Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan.Object detection with discriminatively trained part-based models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.
[14]Zheng Ge et al.YOLOX: Exceeding YOLO series in 2021, 2021.
[15]Andreas Geiger, Philip Lenz, and Raquel Urtasun.Are we ready for autonomous driving? The KITTI vision benchmark suite.In CVPR, 2012.
[16]Adam W. Harley, Zhaoyuan Fang, and Katerina Fragkiadaki.Particle video revisited: Tracking through occlusions using point trajectories, 2022.
[17]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition, 2015.
[18]Harold W. Kuhn.The Hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1–2):83–97, March 1955.
[19]Laura Leal-Taixé et al.MOTChallenge 2015: Towards a benchmark for multi-target tracking, 2015.
[20]Ji Lin, Chuang Gan, Kuan Wang, and Song Han.TSM: Temporal shift module for efficient and scalable video understanding on edge device, 2021.
[21]Tsung-Yi Lin et al.Microsoft COCO: Common objects in context, 2015.
[22]Chen Long, Ai Haizhou, Zhuang Zijie, and Shang Chong.Real-time multiple people tracking with deeply learned candidate selection and person re-identification.In ICME, 2018.
[23]Jonathon Luiten et al.HOTA: A higher order metric for evaluating multi-object tracking.International Journal of Computer Vision, 129(2):548–578, October 2020.
[24]Tim Meinhardt et al.TrackFormer: Multi-object tracking with transformers, 2022.
[25]Anton Milan et al.MOT16: A benchmark for multi-object tracking, 2016.
[26]Alec Radford et al.Learning transferable visual models from natural language supervision, 2021.
[27]Shaoqing Ren et al.Faster R-CNN: Towards real-time object detection with region proposal networks, 2016.
[28]Olga Russakovsky et al.ImageNet Large Scale Visual Recognition Challenge, 2015.
[29]Ilya Tolstikhin et al.MLP-Mixer: An all-MLP architecture for vision, 2021.
[30]Nicolai Wojke, Alex Bewley, and Dietrich Paulus.Simple Online and Realtime Tracking with a Deep Association Metric, 2017.
[31]Yifu Zhang et al.ByteTrack: Multi-object tracking by associating every detection box, 2022.
[32]Yifu Zhang et al.FairMOT: On the fairness of detection and re-identification in multiple object tracking.International Journal of Computer Vision, 129(11):3069–3087, September 2021.
[33]Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl.Tracking Objects as Points, 2020.
校內:2026-02-12公開