| 研究生: |
吳振瑋 Wu, Jhen-Wei |
|---|---|
| 論文名稱: |
通過注意力解耦神經網路與差異學習緩解 MOT 特徵衝突 Alleviating MOT Feature Conflicts through Attention Decoupling Neural Networks and Discrepancy Learning |
| 指導教授: |
蔡家齊
Tsai, Chia-Chi |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2024 |
| 畢業學年度: | 112 |
| 語文別: | 英文 |
| 論文頁數: | 84 |
| 中文關鍵詞: | 多目標追蹤 、機器學習 、電腦視覺 |
| 外文關鍵詞: | Multi-Object Tracking, machine learning, computer vision |
| 相關次數: | 點閱:58 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由於多任務學習領域的成功Joint Detection and Tracking 的方法開始受到關注,也稱為 one-shot MOT 方法,其速度/精度 trade-off 的成績令人印象深刻,不過人們往往喜歡使用 Joint detection and embedding based 的方法,因為 ReID 通常可以獲得更好的追踪性能,但是人們常常忽略了幾個問題,以至於 Joint detection and embedding 的性能始終不如 Tracking-by-detection paradigm,我們認為是人們 1)忽略了 MOT 實際上受限於資料關聯的方法實際較偏袒 Detection Task。 2)由於 MOT Datasets 相對於其他大型 Datasets 較小,往往需要使用其餘 Datasets 來一起訓練,這些 Datasets 正例過少且沒有連續的信息。 3)偵測任務和 ReID 任務本質上的模糊學習。
因此我們使用 Multi-input image 以及適當的 Data Augmentation 使我們的網絡適當的學習前後關係,使我們的跟踪更穩健,Attention Decoupled Network 來解決二者任務本質上的模糊學習。
最終透過消融實驗,表明我們各項方法的有效性,並且在 MOT17、MOT20 測試集上表明我們的方法具備一定的競爭力。
Due to the success in the field of multi-task learning, joint detection and tracking methods, also known as one-shot MOT methods, have garnered significant attention. These methods present impressive speed/accuracy trade-offs. However, many researchers prefer using joint detection and embedding-based methods because re-identification (ReID) typically achieves better tracking performance. Nonetheless, several issues are often overlooked, leading to the joint detection and embedding methods underperforming compared to the tracking-by-detection paradigm. We posit that this underperformance is due to several factors: 1) MOT is fundamentally constrained by data association methods that inherently favor the detection task. 2) MOT datasets are relatively smaller compared to other large-scale datasets, often necessitating the use of additional datasets for training. These supplementary datasets frequently have too few positive samples and lack continuous information. 3) The intrinsic ambiguity in learning between detection and ReID tasks.
Therefore, we utilize multi-input images and appropriate data augmentation to enable our network to effectively learn temporal relationships, enhancing the robustness of our tracking. Additionally, we employ an Attention Decoupled Network to address the fundamental ambiguity in learning between the two tasks.
Ultimately, through ablation experiments, we demonstrate the effectiveness of our methods. Our approach shows a competitive performance on the MOT17 and MOT20 test sets.
[1] N. Aharon, R. Orfaig, and B.-Z. Bobrovsky, "BoT-SORT: Robust associations multi-pedestrian tracking," arXiv preprint arXiv:2206.14651, 2022.
[2] L. Chen, H. Ai, Z. Zhuang, and C. Shang, "Real-time multiple people tracking with deeply learned candidate selection and person re-identification," in 2018 IEEE international conference on multimedia and expo (ICME), 2018: IEEE, pp. 1-6.
[3] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, "Simple online and realtime tracking," in 2016 IEEE international conference on image processing (ICIP), 2016: IEEE, pp. 3464-3468.
[4] Y. Du et al., "Strongsort: Make deepsort great again," IEEE Transactions on Multimedia, 2023.
[5] X. Zhou, D. Wang, and P. Krähenbühl, "Objects as points," arXiv preprint arXiv:1904.07850, 2019.
[6] X. Zhou, V. Koltun, and P. Krähenbühl, "Tracking objects as points," in European conference on computer vision, 2020: Springer, pp. 474-490.
[7] Z. Wang, L. Zheng, Y. Liu, Y. Li, and S. Wang, "Towards real-time multi-object tracking," in European conference on computer vision, 2020: Springer, pp. 107-122.
[8] Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, "Fairmot: On the fairness of detection and re-identification in multiple object tracking," International Journal of Computer Vision, vol. 129, pp. 3069-3087, 2021.
[9] Y. Du, J. Wan, Y. Zhao, B. Zhang, Z. Tong, and J. Dong, "Giaotracker: A comprehensive framework for mcmot with global information and optimizing strategies in visdrone 2021," in Proceedings of the IEEE/CVF International conference on computer vision, 2021, pp. 2809-2819.
[10] Z. Tian, C. Shen, H. Chen, and T. He, "FCOS: Fully convolutional one-stage object detection. arXiv 2019," arXiv preprint arXiv:1904.01355, 1904.
[11] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, "Feature pyramid networks for object detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117-2125.
[12] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, "Deep layer aggregation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2403-2412.
[13] S. Liu, D. Huang, and Y. Wang, "Learning spatial fusion for single-shot object detection," arXiv preprint arXiv:1911.09516, 2019.
[14] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-end object detection with transformers," in European conference on computer vision, 2020: Springer, pp. 213-229.
[15] L. Wang, W. Xiang, R. Xue, K. Zou, and L. Zhu, "AGSFCOS: Based on attention mechanism and Scale-Equalizing pyramid network of object detection," arXiv preprint arXiv:2105.09596, 2021.
[16] R. E. Kalman, "A new approach to linear filtering and prediction problems," 1960.
[17] H. W. Kuhn, "The Hungarian method for the assignment problem," Naval research logistics quarterly, vol. 2, no. 1‐2, pp. 83-97, 1955.
[18] C. Liang, Z. Zhang, X. Zhou, B. Li, S. Zhu, and W. Hu, "Rethinking the competition between detection and reid in multiobject tracking," IEEE Transactions on Image Processing, vol. 31, pp. 3182-3196, 2022.
[19] J. Li, Y. Ding, H.-L. Wei, Y. Zhang, and W. Lin, "Simpletrack: Rethinking and improving the jde approach for multi-object tracking," Sensors, vol. 22, no. 15, p. 5863, 2022.
[20] Y. Zhang, C. Wang, X. Wang, W. Liu, and W. Zeng, "Voxeltrack: Multi-person 3d human pose estimation and tracking in the wild," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 2, pp. 2613-2626, 2022.
[21] P. Bergmann, T. Meinhardt, and L. Leal-Taixe, "Tracking without bells and whistles," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 941-951.
[22] L. Zheng, M. Tang, Y. Chen, G. Zhu, J. Wang, and H. Lu, "Improving multiple object tracking with single object tracking," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2453-2462.
[23] B. Shuai, A. Berneshawi, X. Li, D. Modolo, and J. Tighe, "Siammot: Siamese multi-object tracking," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12372-12382.
[24] Y. Wang, K. Kitani, and X. Weng, "Joint object detection and multi-object tracking with graph neural networks," in 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021: IEEE, pp. 13708-13715.
[25] G. Brasó and L. Leal-Taixé, "Learning a neural solver for multiple object tracking," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6247-6257.
[26] F. Zeng, B. Dong, Y. Zhang, T. Wang, X. Zhang, and Y. Wei, "Motr: End-to-end multiple-object tracking with transformer," in European Conference on Computer Vision, 2022: Springer, pp. 659-675.
[27] T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer, "Trackformer: Multi-object tracking with transformers," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8844-8854.
[28] Y. Xu, Y. Ban, G. Delorme, C. Gan, D. Rus, and X. Alameda-Pineda, "TransCenter: Transformers with dense representations for multiple-object tracking," IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 6, pp. 7820-7835, 2022.
[29] P. Chu, J. Wang, Q. You, H. Ling, and Z. Liu, "Transmot: Spatial-temporal graph transformer for multiple object tracking," in Proceedings of the IEEE/CVF Winter Conference on applications of computer vision, 2023, pp. 4870-4880.
[30] P. Sun et al., "Transtrack: Multiple object tracking with transformer," arXiv preprint arXiv:2012.15460, 2020.
[31] Y. Zhang et al., "Bytetrack: Multi-object tracking by associating every detection box," in European conference on computer vision, 2022: Springer, pp. 1-21.
[32] A. Kendall, Y. Gal, and R. Cipolla, "Multi-task learning using uncertainty to weigh losses for scene geometry and semantics," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7482-7491.
[33] Y. Yan et al., "Anchor-free person search," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7690-7699.
[34] H. Luo et al., "A strong baseline and batch normalization neck for deep person re-identification," IEEE Transactions on Multimedia, vol. 22, no. 10, pp. 2597-2609, 2019.
[35] B. Pang, Y. Li, Y. Zhang, M. Li, and C. Lu, "Tubetk: Adopting tubes to track multi-object in a one-step training model," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6308-6318.
[36] J. Peng et al., "Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking," in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, 2020: Springer, pp. 145-161.
[37] J. Wu, J. Cao, L. Song, Y. Wang, M. Yang, and J. Yuan, "Track to detect and segment: An online multi-object tracker," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12352-12361.
[38] A. Ess, B. Leibe, K. Schindler, and L. Van Gool, "A mobile vision system for robust multi-person tracking," in 2008 IEEE conference on computer vision and pattern recognition, 2008: IEEE, pp. 1-8.
[39] P. Dollár, C. Wojek, B. Schiele, and P. Perona, "Pedestrian detection: A benchmark," in 2009 IEEE conference on computer vision and pattern recognition, 2009: IEEE, pp. 304-311.
[40] S. Zhang, R. Benenson, and B. Schiele, "Citypersons: A diverse dataset for pedestrian detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3213-3221.
[41] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang, "Joint detection and identification feature learning for person search," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3415-3424.
[42] L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, and Q. Tian, "Person re-identification in the wild," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1367-1376.
[43] S. Shao et al., "Crowdhuman: A benchmark for detecting human in a crowd," arXiv preprint arXiv:1805.00123, 2018.
[44] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, "Cbam: Convolutional block attention module," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3-19.
[45] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, "Focal loss for dense object detection," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980-2988.
[46] T.-Y. Lin et al., "Microsoft coco: Common objects in context," in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 2014: Springer, pp. 740-755.
[47] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler, "MOT16: A benchmark for multi-object tracking," arXiv preprint arXiv:1603.00831, 2016.
[48] P. Dendorfer et al., "Mot20: A benchmark for multi object tracking in crowded scenes," arXiv preprint arXiv:2003.09003, 2020.
[49] K. Bernardin and R. Stiefelhagen, "Evaluating multiple object tracking performance: the clear mot metrics," EURASIP Journal on Image and Video Processing, vol. 2008, pp. 1-10, 2008.
[50] A. Milan, K. Schindler, and S. Roth, "Challenges of ground truth evaluation of multi-target tracking," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 735-742.
[51] J. Luiten et al., "Hota: A higher order metric for evaluating multi-object tracking," International journal of computer vision, vol. 129, pp. 548-578, 2021.
[52] L. Liebel and M. Körner, "Auxiliary tasks in multi-task learning," arXiv preprint arXiv:1805.06334, 2018.
[53] S. Sun, N. Akhtar, H. Song, A. Mian, and M. Shah, "Deep affinity network for multiple object tracking," IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 1, pp. 104-119, 2019.
[54] J. Pang et al., "Quasi-dense similarity learning for multiple object tracking," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 164-173.
[55] S. Han, P. Huang, H. Wang, E. Yu, D. Liu, and X. Pan, "Mat: Motion-aware multi-object tracking," Neurocomputing, vol. 476, pp. 75-86, 2022.
[56] W. Li, Y. Xiong, S. Yang, M. Xu, Y. Wang, and W. Xia, "Semi-tcl: Semi-supervised track contrastive representation learning," arXiv preprint arXiv:2107.02396, 2021.
[57] E. Yu, Z. Li, S. Han, and H. Wang, "Relationtrack: Relation-aware multiple object tracking with decoupled representation," IEEE Transactions on Multimedia, 2022.
[58] P. Tokmakov, J. Li, W. Burgard, and A. Gaidon, "Learning to track with object permanence," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10860-10869.
[59] C. Shan et al., "Tracklets predicting based adaptive graph tracking," arXiv preprint arXiv:2010.09015, 2020.
[60] Q. Wang, Y. Zheng, P. Pan, and Y. Xu, "Multiple object tracking with correlation learning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3876-3886.
[61] F. Yang, X. Chang, S. Sakti, Y. Wu, and S. Nakamura, "ReMOT: A model-agnostic refinement for multiple object tracking," Image and Vision Computing, vol. 106, p. 104091, 2021.
校內:2029-07-03公開