| 研究生: |
彭恩宇 Pong, En-Yu |
|---|---|
| 論文名稱: |
非線性函數近似及其於TVM編譯器的NPU運算子合法化 Non-linear Function Approximations and their NPU Operator Legalization in TVM Compiler |
| 指導教授: |
陳中和
Chen, Chung-Ho |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 中文 |
| 論文頁數: | 105 |
| 中文關鍵詞: | NPU 、非線性函數 、Transformer |
| 外文關鍵詞: | NPU, Non-linear Function, Transformer |
| 相關次數: | 點閱:54 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著人工智慧技術的快速發展,深度學習模型[1]在語音識別、計算機視覺和自然語言處理等領域取得了突破性進展。為了提高人工智慧計算效率,專用的神經網路處理單元(NPU)被廣泛應用於深度學習推理場景。與傳統的CPU和GPU相比,NPU提供了更高的計算效能和更低的功耗,特別適用於大規模神經網路的推理計算。隨著Transformer模型[2]的崛起,它已成為深度學習的主流架構,進一步推動了人工智慧應用的發展,但同時也帶來了更高的計算需求。
NPU在矩陣運算和卷積計算方面具有顯著優勢,能夠透過平行處理大量的乘加運算(MAC)來提高計算效率。然而,Transformer模型中的非線性函數(如Softmax、LayerNorm和GELU)對NPU來說較難直接支援。一些研究嘗試在NPU電路中設計專用單元(如Softmax單元),以處理特定的非線性運算,但這些專用單元通常只能支援單一運算,導致在模型執行過程中無法充分利用硬體資源。此外,不同的人工智慧模型可能包含多種非線性運算,針對單一非線性函數設計專用硬體將限制NPU的通用性,並降低其對不同模型的支援能力。
為了解決這一問題,本研究針對Vision Transformer (ViT)[3]模型中的三種非線性函數(Softmax、LayerNorm和GELU)設計了兩種版本的近似演算法,將這些非線性運算拆解為多個相對簡易的逐元素與降維運算,可部署於本實驗室開發的Novella系列NPU。第一種版本適用於整數量化的人工智慧模型,完全以整數運算近似上述非線性函數,並保持約1%的準確率損失;第二種版本則基於輕量化的浮點數資料型別Bfloat16(BF16)[4]開發,適用於支援浮點數運算的人工智慧模型,並且準確率損失僅為約0.1%。
在本研究中,透過基於TVM(Tensor Virtual Machine)[5]框架的模型編譯器,使用TVM Relay作為編譯模型之中介表示語言,以多個運算子(operator)構成模型的計算圖(computation graph)。本研究設計Relay層級的轉換過程,將計算圖中的非線性函數取代為整數或BF16的近似演算法。由於近似演算法由軟體開發,並僅由逐元素與降維運算構成,NPU硬體只需支援以上基本運算,便能夠透過編譯器實現對廣泛的非線性函數的間接支援。
With the rapid advancement of artificial intelligence (AI) technologies, deep learning models [1] have achieved groundbreaking progress in areas such as speech recognition, computer vision, and natural language processing. To enhance computational efficiency in AI, specialized Neural Processing Units (NPUs) are widely applied in deep learning inference scenarios. Compared to traditional CPUs and GPUs, NPUs offer higher computational performance and lower power consumption, making them especially suitable for large-scale neural network inference. With the rise of Transformer models [2], which have become the mainstream architecture in deep learning, AI applications have been further propelled forward, albeit with increasing computational demands.
NPUs excel in matrix and convolution operations, efficiently executing large volumes of multiply-accumulate (MAC) operations in parallel. However, nonlinear functions commonly found in Transformer models are more difficult for NPUs to support directly. To address this issue, this study proposes two versions of approximation algorithms for the nonlinear functions used in the Vision Transformer (ViT) [3] model. These nonlinear operations are decomposed into simpler element-wise and dimensionality-reduction operations. The first version targets integer-quantized AI models, approximating the nonlinear functions using only integer operations while maintaining an accuracy loss of about 1%. The second version is developed based on the lightweight floating-point data type Bfloat16 (BF16) [4], suitable for AI models that support floating-point computation, and achieves an accuracy loss of only about 0.1%.
In this study, we utilize the TVM (Tensor Virtual Machine) [5] framework and its intermediate representation language, TVM Relay, to construct the model's computation graph from multiple operators. We design Relay-level transformation processes that replace nonlinear functions in the computation graph with the proposed integer or BF16 approximation algorithms. Since these algorithms are implemented in software and composed only of basic element-wise and dimensionality-reduction operations, the NPU hardware only needs to support these fundamental operations to indirectly support a wide range of nonlinear functions through the compiler.
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.
[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017.
[3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
[4] Intel, “BFLOAT16 - hardware numerics definition.” https://www.intel.com/content/dam/develop/external/us/en/documents/bf16-hardware-numerics-definition-white-paper.pdf, 2018.
[5] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “TVM: An automated End-to-End optimizing compiler for deep learning,” in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), (Carlsbad, CA), pp. 578–594, USENIX Association, Oct. 2018.
[6] Y. Lin, T. Zhang, P. Sun, Z. Li, and S. Zhou, “Fq-vit: Post-training quantization for fully quantized vision transformer,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 1173–1179, 2022.
[7] Z. Li and Q. Gu, “I-vit: Integer-only quantization for efficient vision transformer inference,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17065–17075, 2023.
[8] C. Lomont, “Fast inverse square root,” in Technical Report, 2003.
[9] B. Darvish Rouhani, D. Lo, R. Zhao, M. Liu, J. Fowers, K. Ovtcharov, A. Vinogradsky, S. Massengill, L. Yang, R. Bittner, A. Forin, H. Zhu, T. Na, P. Patel, S. Che, L. Chand Koppaka, X. SONG, S. Som, K. Das, S. T, S. Reinhardt, S. Lanka, E. Chung, and D. Burger, “Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point,” in Advances in Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, eds.), vol. 33, pp. 10271–10281, Curran Associates, Inc., 2020.
[10] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou, “Training data-efficient image transformers & distillation through attention,” in Proceedings of the 38th International Conference on Machine Learning (M. Meila and T. Zhang, eds.), vol. 139 of Proceedings of Machine Learning Research, pp. 10347–10357, PMLR, 18–24 Jul 2021.
[11] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021.
[12] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z.-H. Jiang, F. E. Tay, J. Feng, and S. Yan, “Tokens-to-token vit: Training vision transformers from scratch on imagenet,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 558–567, 2021.
[13] G. Jawahar, B. Sagot, and D. Seddah, “What does BERT learn about the structure of language?,” in ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, (Florence, Italy), July 2019.
[14] M. Nagel, M. Fournarakis, R. A. Amjad, Y. Bondarenko, M. Van Baalen, and T. Blankevoort, “A white paper on neural network quantization,” arXiv preprint arXiv:2106.08295, 2021.
[15] NVIDIA, “TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x.” https://blogs.nvidia.com/blog/tensorfloat-32-precision-format/, 2020.
[16] S. Lu, M. Wang, S. Liang, J. Lin, and Z. Wang, “Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer,” in 2020 IEEE 33rd International System-on-Chip Conference (SOCC), pp. 84–89, 2020.
[17] H. Wang, Z. Zhang, and S. Han, “Spatten: Efficient sparse attention architecture with cascade token and head pruning,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 97–110, IEEE, 2021.
[18] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
[19] W. Kahan, “Further remarks on reducing truncation errors, commun,” Assoc. Comput. Mach, vol. 8, p. 40, 1965.
[20] B. P. Welford, “Note on a method for calculating corrected sums of squares and products,” Technometrics, vol. 4, no. 3, pp. 419–420, 1962.