| 研究生: |
陳亮嘉 Chen, Liang-Chia |
|---|---|
| 論文名稱: |
用於高效視覺追蹤的輕量化CNN-Vision Transformer 混合架構 HCV: Lightweight Hybrid CNN-Vision Transformer Architecture for Visual Tracking |
| 指導教授: |
朱威達
Chu, Wei-Ta |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2024 |
| 畢業學年度: | 113 |
| 語文別: | 中文 |
| 論文頁數: | 46 |
| 中文關鍵詞: | 單物體追蹤 、視覺追蹤 、Vision Transformer |
| 外文關鍵詞: | Single Object Tracking, Visual Tracking, Vision Transformer |
| 相關次數: | 點閱:118 下載:20 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
物體追蹤是電腦視覺領域中相當熱門的研究之一,近年來VisionTransformer (ViT) 以 attention 機制作為模型架構為電腦視覺領域拓展新的發展,然而,現今主流追蹤器多以精準度為優先,以致於有運算時間過長、需要大量資源能力進行運算與足夠參數量才能達成顯著效能的問題。為了解決此問題,在本論文中我們提出了一種名為HCV的輕量化單物體追蹤模型,透過我們提出的特徵融合模組,在CNN與Vision Transformer (ViT) 之間形成介接通道,實現利用CNN不同階層特徵,同時藉由attention 機制獲得全局特徵,達到提升準確度且降低模型運算量與參數量。我們在UAV123、LaSOT、GOT-10k 和 TrackingNet 數據集的評估驗證了這一想法,並證明了此輕量化追蹤模型的效率性。
Visual object tracking is one of the most fundamental research in computer vision. In recent years, Vision Transformer (ViT) has expanded the horizons of computer vision with its model architecture leveraging attention mechanisms. However, contemporary mainstream trackers prioritize accuracy, leading to issues such as prolonged computation time and the necessity of substantial computational resources and parameter quantities to achieve significant performance. To address this challenge, in this paper, we propose a lightweight single object tracking model named HCV. Through our proposed feature fusion module, we establish an interface between CNN and Vision Transformer (ViT), enabling the utilization of diverse hierarchical features from CNN while simultaneously acquiring global features through attention mechanisms. This approach aims to enhance accuracy while reducing computational overhead and parameter requirements. We evaluate this idea on the UAV123, LaSOT, GOT-10k, and TrackingNet datasets, demonstrating the efficiency of this lightweight tracking model.
[1] LucaBertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. In Proceedings of ECCV Workshops, pages 850–865, 2016.
[2] Philippe Blatter, Menelaos Kanakis, Martin Danelljan, and Luc Van Gool. Efficient visual tracking with exemplar transformers. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1571–1581, 2023.
[3] David S Bolme, J Ross Beveridge, Bruce A Draper, and Yui Man Lui. Visual object tracking using adaptive correlation filters. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2544–2550, 2010.
[4] Vasyl Borsuk, Roman Vei, Orest Kupyn, Tetiana Martyniuk, Igor Krashenyi, and Jiři Matas. Fear: Fast, efficient, accurate and robust visual tracker. In Proceedings of European Conference on Computer Vision, pages 644–663, 2022.
[5] Y. Cai, J. Liu, J. Tang, and G. Wu. Robust object modeling for visual tracking. In Proceedings of IEEE/CVF International Conference on Computer Vision, pages 9555-9566, 2023.
[6] Ziang Cao, Changhong Fu, Junjie Ye, Bowen Li, and Yiming Li. Hift: Hierarchical feature transformer for aerial tracking. In Proceedings of IEEE/CVF International Conference on Computer Vision, pages 15457–15466, 2021.
[7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Proceedings of European Conference on Computer Vision, pages 213–229, 2020.
[8] Boyu Chen, Peixia Li, Lei Bai, Lei Qiao, Qiuhong Shen, Bo Li, Weihao Gan, Wei Wu, and Wanli Ouyang. Backbone is all your need: A simplified architecture for visual object tracking. In Proceedings of European Conference on Computer Vision, pages 375–392, 2022.
[9] Xin Chen, Ben Kang, Dong Wang, Dongdong Li, and Huchuan Lu. Efficient visual tracking via hierarchical cross-attention transformer. In Proceedings of European Conference on Computer Vision, pages 461–477, 2022.
[10] Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8126–8135, 2021.
[11] Yu-Hsi Chen, Chien-Yao Wang, Cheng-Yun Yang, Hung-Shuo Chang, Youn-Long Lin, Yung-Yu Chuang, and Hong-Yuan Mark Liao. Neighbortrack: Single object tracking by bipartite matching with neighbor tracklets and its applications to sports. In Proceedings of IEEE/CVFConferenceonComputerVisionandPatternRecognitionWorkshops, pages 5139–5148, 2023.
[12] Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13608–13618, 2022.
[13] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. Atom: Accurate tracking by overlap maximization. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4660–4669, 2019.
[14] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. Eco: Efficient convolution operators for tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 6638–6646, 2017.
[15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of International Conference on Learning Representations, 2021.
[16] Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5374–5383, 2019.
[17] Shenyuan Gao, Chunluan Zhou, and Jun Zhang. Generalized relation modeling for transformer tracking. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18686–18695, 2023.
[18] Goutam Gopal and Maria Amer. Separable self and mixed attention transformers for efficient object tracking. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6694–6703, 2024.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In roceedingsofIEEEConferenceonComputerVisionandPattern Recognition, pages 770–778, 2016.
[20] João F Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583–596, 2014.
[21] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of IEEE/CVF International Conference on Computer Vision, pages 1314–1324, 2019.
[22] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 4700–4708, 2017.
[23] Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5):1562–1577, 2019.
[24] Ben Kang, Xin Chen, Dong Wang, Houwen Peng, and Huchuan Lu. Exploring lightweight hierarchical vision transformers for efficient visual tracking. In Proceedings of IEEE/CVF International Conference on Computer Vision, pages 9612–9621, 2023.
[25] Matej Kristan, Jiri Matas, Ales Leonardis, Michael Felsberg, Roman Pflugfelder, JoniKristian Kamarainen, Luka ˇCehovin Zajc, Ondrej Drbohlav, Alan Lukezic, Amanda Berg, et al. The seventh visual object tracking vot2019 challenge results. In Proceedings of IEEE/CVFInternationalConferenceonComputerVisionWorkshops, pages0–0, 2019.
[26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of Advances in Neural Information Processing Systems, volume 25, 2012.
[27] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Proceedings of European Conference on Computer Vision, pages 734–750, 2018.
[28] BoLi,WeiWu,QiangWang,FangyiZhang,JunliangXing,andJunjieYan. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of IEEE/CVFConferenceonComputerVisionandPatternRecognition,pages4282–4291, 2019.
[29] Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. High performance visual tracking with siamese region proposal network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 8971–8980, 2018.
[30] Liting Lin, Heng Fan, Zhipeng Zhang, YongXu, andHaibinLing. Swintrack: Asimple and strong baseline for transformer tracking. In Proceedings of Advances in Neural Information Processing Systems, 2022.
[31] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of IEEE International Conference on Computer Vision, pages 2980–2988, 2017.
[32] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Proceedings of European Conference on Computer Vision, pages 740–755, 2014.
[33] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In Proceedings of International Conference on Learning Representations, 2019.
[34] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of European Conference on Computer Vision, pages 116–131, 2018.
[35] Sachin Mehta and Mohammad Rastegari. Separable self-attention for mobile vision transformers. Transactions on Machine Learning Research, 2023.
[36] Matthias Mueller, Neil Smith, and Bernard Ghanem. A benchmark and simulator for uav tracking. In Proceedings of European Conference on Computer Vision, pages 445461, 2016.
[37] Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of European Conference on Computer Vision, pages 300–317, 2018.
[38] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards realtime object detection with region proposal networks. In Proceedings of Advances in Neural Information Processing Systems, 2015.
[39] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 658–666, 2019.
[40] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of IEEE Conference on Computer Visionand Pattern Recognition, pages 1–9, 2015.
[41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of Advances in Neural Information Processing Systems, volume 30, 2017.
[42] Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1571–1580, 2021.
[43] Yinda Xu, Zeyu Wang, Zuoxin Li, Ye Yuan, and Gang Yu. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of AAAI Conference on Artificial Intelligence, volume 34, pages 12549–12556, 2020.
[44] Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. Learning spatiotemporal transformer for visual tracking. In Proceedings of IEEE/CVF International Conference on Computer Vision, pages 10448–10457, 2021.
[45] Bin Yan, Houwen Peng, Kan Wu, Dong Wang, Jianlong Fu, and Huchuan Lu. Lighttrack: Finding lightweight neural networks for object tracking via one-shot architecture search. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15189, 2021.
[46] Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of European Conference on Computer Vision, pages 341–357, 2022.
[47] Zhipeng Zhang, HouwenPeng, Jianlong Fu, BingLi, and WeimingHu. Ocean: Objectaware anchor-free tracking. In Proceedings of European Conference on Computer Vision, pages 771–787, 2020.
[48] Yaozong Zheng, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shengping Zhang, and Xianxian Li. Odtrack: Online dense temporal token learning for visual tracking. In Proceedings of AAAI Conference on Artificial Intelligence, volume 38, pages 7588–7596, 2024.