| 研究生: |
趙佑軒 Chao, Yu-Hsuan |
|---|---|
| 論文名稱: |
透過適應性隨機噪音微調改善模型量化表現 Improving Model Quantization Accuracy through Adaptive Random Noise Fine-tuning |
| 指導教授: |
謝明得
Shieh, Ming-Der 林偉棻 Lin, Wei-Fen |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2024 |
| 畢業學年度: | 112 |
| 語文別: | 英文 |
| 論文頁數: | 77 |
| 中文關鍵詞: | 大型語言模型 、量化 、量化感知訓練 、後訓練量化 |
| 外文關鍵詞: | LLM, Quantization, Post-training Quantization, Quantization-aware Training |
| 相關次數: | 點閱:39 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
模型量化在近年隨著大型語言模型(Large Language Model, LLM)的發展,能見度與日俱增。由於進行量化可以在顯著降低運算過程中記憶體使用量的同時降低運算時間。因此,許多研究在探討如何在進行量化並同時不影響模型的精準度。
在本研究中,我們專注於如何透過模型微調(model fine-tuning)產生量化後能夠維持高精準度的模型。本文從量化方法的選擇開始,透過模型量化誤差的分析,我們首先理解到極端值(outlier)以及資料分佈(data distribution)對於量化會產生的影響。再者,我們基於所觀察到的量化誤差規律,提出了適應性隨機噪音的方法,用以在模型微調中改善量化後的模型精準度。最後,我們結合了同樣由我們團隊所提出的QuantTune方法,證實了兩種方法的結合可以更有效的改善量化後精度降低的情形,解決極端值以及資料分佈的問題。
我們的方法改善了 Vision Transformer (ViT) 模型中量化後的精準度,使精準度在各模型中平均進步了 17.57%。並且,我們透過實驗證實了所提出的方法可以被很簡單地融入於模型微調階段。除此之外,由於所提出的方法基於微調階段進行優化,在微調完成後我們仍保有對於後訓練量化(Post-training Quantization, PTQ)階段的優化空間。在本文中,我們也融合了先前的後訓練量化方法,並證實在經過融合過後能夠使特定方法模型精準度相較單純使用原方法提升。
Model quantization has gained increasing visibility in recent years with the development of large language models (LLMs). Quantization significantly reduces memory usage and computation time during model operations. As a result, numerous studies have focused on how to perform quantization without compromising model accuracy.
In this study, we concentrate on producing quantized models with high accuracy through model fine-tuning. We begin by selecting appropriate quantization methods and analyze model quantization errors. We first understand the impact of outliers and data distribution on quantization. Furthermore, based on observed patterns of quantization errors, we propose an adaptive random noise method to improve quantized model accuracy during fine-tuning. Finally, we demonstrate the effectiveness of combining our approach with QuantTune, another method developed by our team, to more effectively mitigate the accuracy degradation of post-training quantization.
Our methods have enhanced the accuracy of quantizing Vision Transformer (ViT) and Natural Language Processing (NLP) models, resulting in an average improvement of 17.57% across various models. Moreover, experiments confirm that our proposed methods can be easily implemented into model fine-tuning stage. Additionally, since our approach optimizes during the fine-tuning, we maintain optimization potential for PTQ even after fine-tuning completion. Furthermore, we integrate previous PTQ methods and demonstrate the effectiveness and portability under certain models of our approach in improving accuracy.
[1] M v Baalen, A Kuzmin, SS Nair, Y Ren, E Mahurin, C Patel, S Subramanian, S Lee, M Nagel, J Soriaga, et al. Fp8 versus int8 for efficient deep learning inference. arXiv preprint arXiv:2303.17951, 2023.
[2] Yash Bhalgat, Jinwon Lee, Markus Nagel, Tijmen Blankevoort, and Nojun Kwak. Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 696–697, 2020.
[3] Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Understanding and overcoming the challenges of efficient transformer quantization. arXiv preprint arXiv:2109.12948, 2021.
[4] Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Quantizable transform-ers: Removing outliers by helping attention heads do nothing. Advances in Neural Information Processing Systems, 36, 2024.
[5] Jiun-Man Chen, Yu-Hsuan Chao, Yu-Jie Wang, Ming-Der Shieh, Chih-Chung Hsu, and Wei-Fen Lin. Quanttune: Optimizing model quantization with adaptive outlier-driven fine tuning. arXiv preprint arXiv:2403.06497, 2024.
[6] Brian Chmiel, Ron Banner, Gil Shomron, Yury Nahshan, Alex Bronstein, Uri Weiser, et al. Robust quantization: One model to rule them all. Advances in neural information processing systems, 33:5308–5317, 2020.
[7] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalak-shmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085
, 2018.
[8] Yoni Choukroun, Eli Kravchik, Fan Yang, and Pavel Kisilev. Low-bit quantization of neural networks for efficient inference. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 3009–3018. IEEE, 2019.
[9] Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019.
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[11] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale, 2022. CoRR abs/2208.07339
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[13] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accu-rate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
[14] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
[15] Lu Hou and James T Kwok. Loss-aware weight quantization of deep networks. arXiv preprint arXiv:1802.08635, 2018.
[16] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018.
[17] Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, et al. A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322, 2019.
[18] Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. Attention is not only a weight: Analyzing transformers with vector norms. arXiv preprint arXiv:2004.10102, 2020.
[19] Olga Kovaleva, Saurabh Kulshreshtha, Anna Rogers, and Anna Rumshisky. Bert busters: Outlier dimensions that disrupt transformers. arXiv preprint arXiv:2105.06990, 2021.
[20] Arnav Kundu, Chungkuk Yoo, Srijan Mishra, Minsik Cho, and Saurabh Adya. Rˆ2: Range regularization for model compression and quantization. arXiv preprint arXiv:2303.08253, 2023.
[21] Andrey Kuzmin, Mart Van Baalen, Yuwei Ren, Markus Nagel, Jorn Peters, and Tijmen Blankevoort. Fp8 quantization: The power of the exponent. Advances in Neural Information Processing Systems, 35:14651–14662, 2022.
[22] Rundong Li, Yan Wang, Feng Liang, Hongwei Qin, Junjie Yan, and Rui Fan. Fully quantized network for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2810–2819, 2019.
[23] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
[24] Yang Lin, Tianyu Zhang, Peiqin Sun, Zheng Li, and Shuchang Zhou. Fq-vit: Post-training quantization for fully quantized vision transformer. arXiv preprint arXiv:2111.13824, 2021.
[25] Yijiang Liu, Huanrui Yang, Zhen Dong, Kurt Keutzer, Li Du, and Shanghang Zhang. Noisyquant: Noisy bias-enhanced post-training activation quantization for vision trans-formers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20321–20330, 2023.
[26] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
[27] Christos Louizos, Matthias Reisser, Tijmen Blankevoort, Efstratios Gavves, and Max Welling. Relaxed quantization for discretized neural networks. arXiv preprint arXiv:1810.01875, 2018.
[28] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
[29] Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. Fp8 formats for deep learning. arXiv preprint arXiv:2209.05433, 2022.
[30] Daisuke Miyashita, Edward H Lee, and Boris Murmann. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025, 2016.
[31] Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? adaptive rounding for post-training quantization. In Inter-national Conference on Machine Learning, pages 7197–7206. PMLR, 2020.
[32] Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1325–1334, 2019.
[33] Markus Nagel, Marios Fournarakis, Yelysei Bondarenko, and Tijmen Blankevoort. Overcoming oscillations in quantization-aware training. In International Conference on Machine Learning, pages 16318–16330. PMLR, 2022.
[34] Yury Nahshan, Brian Chmiel, Chaim Baskin, Evgenii Zheltonozhskii, Ron Banner, Alex M Bronstein, and Avi Mendelson. Loss aware post-training quantization. Machine Learning, 110(11):3245–3262, 2021.
[35] Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo Lee. Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models. arXiv preprint arXiv:2206.09557, 2022.
[36] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
[37] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablay-rolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
[38] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
[39] Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and Xianglong Liu. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. arXiv preprint arXiv:2304.09145, 2023.
[40] Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of low-bit transformer language models. Advances in Neural Information Processing Systems, 35:17402–17414, 2022.
[41] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language mod-els. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
[42] Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, and Yuxiong He. Exploring post-training quantization in llms from comprehensive study to low rank compensation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19377–19385, 2024.
[43] Penghang Yin, Jiancheng Lyu, Shuai Zhang, Stanley Osher, Yingyong Qi, and Jack Xin. Understanding straight-through estimator in training activation quantized neural nets. arXiv preprint arXiv:1903.05662, 2019.
[44] Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, and Guangyu Sun. Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization. In European conference on computer vision, pages 191–207. Springer, 2022.
[45] Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Chris De Sa, and Zhiru Zhang. Improv-ing neural network quantization without retraining using outlier channel splitting. In International conference on machine learning, pages 7543–7552. PMLR, 2019.
校內:2027-08-08公開