| 研究生: |
陳居廷 Chen, Chu-Ting |
|---|---|
| 論文名稱: |
一個基於離散餘弦轉換的視覺轉換器高效頻率剪枝與融合方法 A DCT-Based Method for Efficient Frequency Pruning and Fusion in Vision Transformers |
| 指導教授: |
戴顯權
Tai, Shen-Chuan |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2024 |
| 畢業學年度: | 112 |
| 語文別: | 英文 |
| 論文頁數: | 80 |
| 中文關鍵詞: | 視覺轉換器 、模型壓縮 、離散餘弦轉換 |
| 外文關鍵詞: | Vision Transformers, Model compression, Discrete Cosine Transform |
| 相關次數: | 點閱:42 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來視覺轉換器(ViTs)在各種視覺任務中展現出卓越成果,成為新一代主要的模型架構,然而其出色的性能需要依賴高性能硬體設備來實現,這導致 ViTs 難以應用於資源有限的邊緣裝置,限制了其在實際應用中的範疇,因此如何縮小模型大小及計算量成為當務之急。
過去已提出多種壓縮 ViTs 的方法,這些方法多基於對卷積神經網絡(CNNs)的壓縮經驗進行修改,但由於 ViTs 與 CNNs 在結構上的顯著差異,因此並未能達到理想的效果,鑒於近期研究指出低頻訊號在 ViTs 模型中更具重要性,本文提出了一種基於離散餘弦變換(DCT)的 ViTs 模型剪枝方法,通過移除高頻資訊並保留低頻資訊來壓縮通道和標記,從而有效減少模型的大小和計算量,然而被移除的高頻標記可能仍包含重要資訊,在圖像識別中能起到關鍵作用,為此另外引入了一個模塊,將移除的高頻資訊重新融合回保留的標記中,以保證模型的性能。
實驗結果顯示,本文提出的方法在不同規模的模型剪枝中均表現優異,超越了現有的技術,在 ImageNet 分類任務中,與基準方法相比,我們的方法可以減少 45% 至 60% 的計算量,同時精度下降不超過 1%,這項研究為在資源受限的環境下將 ViTs 應用於各種實際場景提供了重要的技術支持。
In recent years, Vision Transformers (ViTs) have demonstrated exceptional results in various visual tasks, becoming a leading architectural model of the new generation. However, their exceptional performance is dependent on high-performance hardware, which has rendered ViTs challenging to implement on resource-limited edge devices, thus constraining their practical application scope. Consequently, reducing the size and computational demands of models is an imperative task at hand.
Previous attempts to compress these ViTs have adapted methods based on experiences with compressing convolutional neural networks (CNNs), but due to the structural differences between ViTs and CNNs, these approaches have not yielded ideal outcomes. In light of recent studies highlighting the significance of low-frequency signals in ViTs, this Thesis introduces a novel pruning method for ViTs based on Discrete Cosine Transform. By eliminating high-frequency information and preserving low-frequency information, it compresses channels and tokens, effectively reducing the model's size and computational load. However, the removed high-frequency tokens may still contain pivotal information essential for image recognition tasks. To address this, an additional module has been proposed to fuse the removed high-frequency information back into the preserved tokens, ensuring the model's performance and accuracy.
Experimental results indicate that the proposed method excels in various scales of model pruning, surpassing existing techniques. In the ImageNet classification task, the proposed method can reduce computational costs by 45% to 60% while limiting the accuracy drop to less than 1%. This Thesis provides significant technical support for applying ViTs in various real-world scenarios within resource-constrained environments.
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[2] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[3] Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022.
[4] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11264–11272, 2019.
[5] Yerlan Idelbayev and Miguel A ́ Carreira-Perpin ̃a ́n. An empirical comparison of quanti- zation, pruning and low-rank neural network compression using the lc toolkit. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.
[6] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablay- rolles,andHerve ́Je ́gou. Trainingdata-efficientimagetransformers&distillation through attention. In International conference on machine learning, pages 10347– 10357. PMLR, 2021.
[7] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
[8] Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Wei Niu, Mengshu Sun, Xuan Shen, Geng Yuan, Bin Ren, Hao Tang, et al. Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In European conference on computer vision, pages 620–640. Springer, 2022.
[9] Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorgani- zations. arXiv preprint arXiv:2202.07800, 2022.
[10] Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. Evo-vit: Slow-fast token evolution for dynamic vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2964–2972, 2022.
[11] Haohan Wang, Xindi Wu, Zeyi Huang, and Eric P Xing. High-frequency component helps explain the generalization of convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8684–8694, 2020.
[12] Jiawang Bai, Li Yuan, Shu-Tao Xia, Shuicheng Yan, Zhifeng Li, and Wei Liu. Im- proving vision transformers by revisiting high-frequency components. In European Conference on Computer Vision, pages 1–18. Springer, 2022.
[13] Zhenyu Wang, Hao Luo, Pichao Wang, Feng Ding, Fan Wang, and Hao Li. Vtc-lfc: Vision transformer compression with low-frequency components. Advances in Neural Information Processing Systems, 35:13974–13988, 2022.
[14] Ephraim Feig and Shmuel Winograd. Fast algorithms for the discrete cosine transform. IEEE Transactions on Signal processing, 40(9):2174–2193, 1992.
[15] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Re ́mi Louf, Morgan Funtowicz, et al. Trans- formers: State-of-the-art natural language processing. In Proceedings of the 2020 con- ference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020.
[16] Anthony Gillioz, Jacky Casas, Elena Mugellini, and Omar Abou Khaled. Overview of the transformer-based models for nlp tasks. In 2020 15th Conference on Computer Science and Information Systems (FedCSIS), pages 179–183. IEEE, 2020.
[17] ChenguangWang,MuLi,andAlexanderJSmola.Languagemodelswithtransformers. arXiv preprint arXiv:1904.09408, 2019.
[18] SalmanKhan,MuzammalNaseer,MunawarHayat,SyedWaqasZamir,FahadShahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022.
[19] Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022.
[20] Rene ́ Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021.
[21] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted win- dows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
[22] Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. Visual trans- formers: Token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677, 2020.
[23] YutingYang,LichengJiao,XuLiu,FangLiu,ShuyuanYang,ZhixiFeng,andXuTang. Transformers meet visual learning understanding: A comprehensive review. arXiv preprint arXiv:2203.12944, 2022.
[24] Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Omnimae: Single model masked pretraining on images and videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10406–10417, 2023.
[25] Yakoub Bazi, Laila Bashmal, Mohamad M Al Rahhal, Reham Al Dayil, and Naif Al Ajlan. Vision transformers for remote sensing image classification. Remote Sensing, 13(3):516, 2021.
[26] Wenxiao Wang, Wei Chen, Qibo Qiu, Long Chen, Boxi Wu, Binbin Lin, Xiaofei He, and Wei Liu. Crossformer++: A versatile vision transformer hinging on cross-scale attention. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[27] Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, and Luc Van Gool. Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707, 2021.
[28] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[30] Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision transformers see like convolutional neural networks? Advances in neural information processing systems, 34:12116–12128, 2021.
[31] Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh. Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11936–11945, 2021.
[32] Junting Pan, Adrian Bulat, Fuwen Tan, Xiatian Zhu, Lukasz Dudziak, Hongsheng Li, Georgios Tzimiropoulos, and Brais Martinez. Edgevits: Competing light-weight cnns on mobile devices with vision transformers. In European Conference on Computer Vision, pages 294–311. Springer, 2022.
[33] Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, and Arun K Somani. A survey of techniques for optimizing transformer inference. Journal of Systems Architecture, page 102990, 2023.
[34] Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, and Dacheng Tao. A survey on transformer compression. arXiv preprint arXiv:2402.05964, 2024.
[35] ManishGupta,VasudevaVarma,SonamDamani,andKedharNathNarahari.Compres- sion of deep learning models for nlp. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 3507–3508, 2020.
[36] Zhuo Li, Hengyi Li, and Lin Meng. Model compression for deep neural networks: A survey. Computers, 12(3):60, 2023.
[37] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
[38] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270, 2018.
[39] Yu-WeiHong,Jenq-ShiouLeu,MuhamadFaisal,andSetyaWidyawanPrakosa.Analy- sis of model compression using knowledge distillation. IEEE Access, 10:85095–85105, 2022.
[40] Sridhar Swaminathan, Deepak Garg, Rajkumar Kannan, and Frederic Andres. Sparse low rank factorization for deep neural network compression. Neurocomputing, 398:185–196, 2020.
[41] Sara Hooker, Nyalleng Moorosi, Gregory Clark, Samy Bengio, and Emily Denton. Characterising bias in compressed models. arXiv preprint arXiv:2010.03058, 2020.
[42] Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34:13937–13949, 2021.
[43] Xiangcheng Liu, Tianyi Wu, and Guodong Guo. Adaptive sparse vit: Towards learnable adaptive token pruning by fully exploiting self-attention. arXiv preprint arXiv:2209.13802, 2022.
[44] Joakim Bruslund Haurum, Sergio Escalera, Graham W Taylor, and Thomas B Moes- lund. Which tokens to use? investigating token reduction in vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 773–783, 2023.
[45] Lu Yu and Wei Xiang. X-pruner: explainable pruning for vision transformers. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24355–24363, 2023.
[46] Mingjian Zhu, Yehui Tang, and Kai Han. Vision transformer pruning. arXiv preprint arXiv:m2104.08500, 2021.
[47] Junzhu Mao, Yazhou Yao, Zeren Sun, Xingguo Huang, Fumin Shen, and Heng-Tao Shen. Attention map guided transformer pruning for occluded person re-identification on edge device. IEEE Transactions on Multimedia, 2023.
[48] Miao Yin, Burak Uzkent, Yilin Shen, Hongxia Jin, and Bo Yuan. Gohsp: A unified framework of graph and optimization-based heterogeneous structured pruning for vi- sion transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10954–10962, 2023.
[49] BangguoXu,TiankuiZhang,YapengWang,andZerenChen.Aknowledge-distillation - integrated pruning method for vision transformer. pages 210–215, 09 2022.
[50] Qingyuan Li, Bo Zhang, and Xiangxiang Chu. Eapruning: Evolutionary pruning for vision transformers and cnns. arXiv preprint arXiv:2210.00181, 2022.
[51] Zejiang Hou and Sun-Yuan Kung. Multi-dimensional vision transformer compression via dependency guided gaussian process search. pages 3668–3677, 06 2022.
[52] Tianyi Sun. Vision transformer pruning via matrix decomposition. arXiv preprint arXiv:2308.10839, 2023.
[53] Victoria Sanchez, Pedro Garcia, Antonio M Peinado, Jose ́ C Segura, and Antonio J Rubio. Diagonalizing properties of the discrete cosine transforms. IEEE transactions on Signal Processing, 43(11):2631–2641, 1995.
[54] Nasir Ahmed, T Natarajan, and Kamisetty R Rao. Discrete cosine transform. IEEE transactions on Computers, 100(1):90–93, 1974.
[55] Andy C Hung and Teresa H-Y Meng. Statistical inverse discrete cosine transforms for image compression. In Digital Video Compression on Personal Computers: Algorithms and Technologies, volume 2187, pages 196–207. SPIE, 1994.
[56] K Ramamohan Rao and Ping Yip. Discrete cosine transform: algorithms, advantages, applications. Academic press, 2014.
[57] Fabio Mariano Bayer and Renato Jose Cintra. Image compression via a fast dct ap- proximation. IEEE Latin America Transactions, 8(6):708–713, 2010.
[58] Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichten- hofer, and Judy Hoffman. Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461, 2022.
[59] Ao Wang, Hui Chen, Zijia Lin, Sicheng Zhao, Jungong Han, and Guiguang Ding. Cait: Triple-win compression towards high accuracy, fast inference, and favorable transfer- ability for vits. arXiv preprint arXiv:2309.15755, 2023.
[60] Zhanzhou Feng and Shiliang Zhang. Efficient vision transformer via token merger. IEEE Transactions on Image Processing, 2023.
[61] Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. Token fusion: Bridging the gap between token pruning and token merging. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1383– 1392, 2024.
[62] Siyuan Wei, Tianzhu Ye, Shen Zhang, Yao Tang, and Jiajun Liang. Joint token pruning and squeezing towards more aggressive compression of vision transformers. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2092–2101, 2023.
[63] Peihao Wang, Wenqing Zheng, Tianlong Chen, and Zhangyang Wang. Anti- oversmoothing in deep vision transformers via the fourier domain analysis: From the- ory to practice. arXiv preprint arXiv:2203.05962, 2022.
[64] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[65] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[66] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[67] Connor Shorten and Taghi M Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019.
[68] Ryo Takahashi, Takashi Matsubara, and Kuniaki Uehara. Data augmentation using random image cropping and patching for deep cnns. IEEE Transactions on Circuits and Systems for Video Technology, 30(9):2917–2931, 2019.
[69] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13001–13008, 2020.
[70] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
[71] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with local- izable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
[72] Chaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang. Eigendamage: Struc- tured pruning in the kronecker-factored eigenbasis. In International conference on machine learning, pages 6566–6575. PMLR, 2019.
校內:2029-07-31公開