成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	彭恩宇 Pong, En-Yu
論文名稱：	非線性函數近似及其於TVM編譯器的NPU運算子合法化 Non-linear Function Approximations and their NPU Operator Legalization in TVM Compiler
指導教授：	陳中和 Chen, Chung-Ho
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2025
畢業學年度：	113
語文別：	中文
論文頁數：	105
中文關鍵詞：	NPU 、非線性函數、Transformer
外文關鍵詞：	NPU, Non-linear Function, Transformer
相關次數：	點閱：54 下載：4
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著人工智慧技術的快速發展，深度學習模型[1]在語音識別、計算機視覺和自然語言處理等領域取得了突破性進展。為了提高人工智慧計算效率，專用的神經網路處理單元（NPU）被廣泛應用於深度學習推理場景。與傳統的CPU和GPU相比，NPU提供了更高的計算效能和更低的功耗，特別適用於大規模神經網路的推理計算。隨著Transformer模型[2]的崛起，它已成為深度學習的主流架構，進一步推動了人工智慧應用的發展，但同時也帶來了更高的計算需求。
NPU在矩陣運算和卷積計算方面具有顯著優勢，能夠透過平行處理大量的乘加運算（MAC）來提高計算效率。然而，Transformer模型中的非線性函數（如Softmax、LayerNorm和GELU）對NPU來說較難直接支援。一些研究嘗試在NPU電路中設計專用單元（如Softmax單元），以處理特定的非線性運算，但這些專用單元通常只能支援單一運算，導致在模型執行過程中無法充分利用硬體資源。此外，不同的人工智慧模型可能包含多種非線性運算，針對單一非線性函數設計專用硬體將限制NPU的通用性，並降低其對不同模型的支援能力。
為了解決這一問題，本研究針對Vision Transformer (ViT)[3]模型中的三種非線性函數（Softmax、LayerNorm和GELU）設計了兩種版本的近似演算法，將這些非線性運算拆解為多個相對簡易的逐元素與降維運算，可部署於本實驗室開發的Novella系列NPU。第一種版本適用於整數量化的人工智慧模型，完全以整數運算近似上述非線性函數，並保持約1%的準確率損失；第二種版本則基於輕量化的浮點數資料型別Bfloat16（BF16）[4]開發，適用於支援浮點數運算的人工智慧模型，並且準確率損失僅為約0.1%。
在本研究中，透過基於TVM（Tensor Virtual Machine）[5]框架的模型編譯器，使用TVM Relay作為編譯模型之中介表示語言，以多個運算子(operator)構成模型的計算圖(computation graph)。本研究設計Relay層級的轉換過程，將計算圖中的非線性函數取代為整數或BF16的近似演算法。由於近似演算法由軟體開發，並僅由逐元素與降維運算構成，NPU硬體只需支援以上基本運算，便能夠透過編譯器實現對廣泛的非線性函數的間接支援。

With the rapid advancement of artificial intelligence (AI) technologies, deep learning models [1] have achieved groundbreaking progress in areas such as speech recognition, computer vision, and natural language processing. To enhance computational efficiency in AI, specialized Neural Processing Units (NPUs) are widely applied in deep learning inference scenarios. Compared to traditional CPUs and GPUs, NPUs offer higher computational performance and lower power consumption, making them especially suitable for large-scale neural network inference. With the rise of Transformer models [2], which have become the mainstream architecture in deep learning, AI applications have been further propelled forward, albeit with increasing computational demands.
NPUs excel in matrix and convolution operations, efficiently executing large volumes of multiply-accumulate (MAC) operations in parallel. However, nonlinear functions commonly found in Transformer models are more difficult for NPUs to support directly. To address this issue, this study proposes two versions of approximation algorithms for the nonlinear functions used in the Vision Transformer (ViT) [3] model. These nonlinear operations are decomposed into simpler element-wise and dimensionality-reduction operations. The first version targets integer-quantized AI models, approximating the nonlinear functions using only integer operations while maintaining an accuracy loss of about 1%. The second version is developed based on the lightweight floating-point data type Bfloat16 (BF16) [4], suitable for AI models that support floating-point computation, and achieves an accuracy loss of only about 0.1%.
In this study, we utilize the TVM (Tensor Virtual Machine) [5] framework and its intermediate representation language, TVM Relay, to construct the model's computation graph from multiple operators. We design Relay-level transformation processes that replace nonlinear functions in the computation graph with the proposed integer or BF16 approximation algorithms. Since these algorithms are implemented in software and composed only of basic element-wise and dimensionality-reduction operations, the NPU hardware only needs to support these fundamental operations to indirectly support a wide range of nonlinear functions through the compiler.

摘要 i
英文延伸摘要 ii
誌謝 xxiv
目錄 xxv
表目錄 xxvii
圖目錄 xxviii
第 1 章 緒論 1
1 論文動機 1
2 論文貢獻 2
3 論文架構 2
第 2 章 背景知識與相關研究 3
1. ViT 模型 3
1.1. 前處理模組 3
1.2. Transformer Encoder 5
1.3. MLP 分類器 6
2. 模型量化（Model Quantization） 6
2.1. 量化方法分類 6
2.2. 量化時機 8
2.3. 量化校準方法（Calibration Methods） 9
3. Novella Andersen 運算子支援 9
4. Novella Brown 運算子支援 12
5. Apache TVM 模型編譯器 12
5.1. Apache TVM 的主要特點 12
5.2. Apache TVM 的運作流程 13
6. Novella 系列 NPU 模型部署流程 13
6.1. 編譯轉換流程 14
7 BF16 15 
8 MSFP 17
第 3 章 全整數非線性函數近似演算法 20
1. LayerNorm 21
1.1. 介紹 21
1.2. PTF 量化（Power of Two Factor Quantization） 22
1.3. 快速反平方根演算法 (Fast Inverse Square Root) 23
1.4. 消除 LayerNorm 中的除法運算 24
1.5. 全整數 LayerNorm 演算法設計 26
2. Softmax 27
2.1. 介紹 27
2.2. Softmax 輸入前處理 27
2.3. 指數函數查找表（Look-up Table, LUT）32
2.4. 利用反平方根快速演算法實現倒數近似 33
2.5. Softmax 輸出的整數資料型別選擇 33
2.6. 全整數 Softmax 演算法 34
3. GELU 35
3.1. 介紹 35
3.2. GELU 查找表設計 37
3.3. 在 FFN 模組中加速 GELU 推論 38
第 4 章 全 BF16 非線性函數近似演算法 41
1. 基於 BF16 之指數函數近似演算法 41
2. 基於 BF16 之反平方根近似演算法 44
3. 基於 BF16 之倒數近似演算法 45
4. 基於 BF16 之加總運算策略 45
5. 基於 BF16 之 Softmax 實作 47
6. 基於 BF16 之 LayerNorm 實作 49
7. 基於 BF16 之 GELU 實作 49
第 5 章 模型編譯器與位元精確模擬平台 52
1. 基於 TVM 框架的 Novella NPU 模型編譯流程 53
1.1. Quantization 53
1.2. Partition 57
1.3. Legalization 60
2. AlgoSim：位元精確的運算子層級模擬平台 63
第 6 章 實驗結果 65
1. 整數近似函數精確度分析 65
1.1. 反平方根近似演算法 65
1.2. 整數 GELU 近似演算法 65
2. BF16 近似函數精確度分析 66
2.1. 指數近似演算法 66
2.2. 反平方根近似演算法 66
2.3. 倒數近似演算法 68
2.4. GELU 近似演算法 68
3. ViT 模型準確率實驗 68
3.1. 整數 ViT 近似演算法實驗結果 69
3.2. BF16 ViT 近似演算法實驗結果 70
第 7 章 結論 72
參考文獻 73
                                    

[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.
[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 30, Curran Associates, Inc., 2017.
[3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
[4] Intel, “BFLOAT16 - hardware numerics definition.” https://www.intel.com/content/dam/develop/external/us/en/documents/bf16-hardware-numerics-definition-white-paper.pdf, 2018.
[5] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “TVM: An automated End-to-End optimizing compiler for deep learning,” in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), (Carlsbad, CA), pp. 578–594, USENIX Association, Oct. 2018.
[6] Y. Lin, T. Zhang, P. Sun, Z. Li, and S. Zhou, “Fq-vit: Post-training quantization for fully quantized vision transformer,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 1173–1179, 2022.
[7] Z. Li and Q. Gu, “I-vit: Integer-only quantization for efficient vision transformer inference,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17065–17075, 2023.
[8] C. Lomont, “Fast inverse square root,” in Technical Report, 2003.
[9] B. Darvish Rouhani, D. Lo, R. Zhao, M. Liu, J. Fowers, K. Ovtcharov, A. Vinogradsky, S. Massengill, L. Yang, R. Bittner, A. Forin, H. Zhu, T. Na, P. Patel, S. Che, L. Chand Koppaka, X. SONG, S. Som, K. Das, S. T, S. Reinhardt, S. Lanka, E. Chung, and D. Burger, “Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point,” in Advances in Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, eds.), vol. 33, pp. 10271–10281, Curran Associates, Inc., 2020.
[10] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou, “Training data-efficient image transformers & distillation through attention,” in Proceedings of the 38th International Conference on Machine Learning (M. Meila and T. Zhang, eds.), vol. 139 of Proceedings of Machine Learning Research, pp. 10347–10357, PMLR, 18–24 Jul 2021.
[11] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021.
[12] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z.-H. Jiang, F. E. Tay, J. Feng, and S. Yan, “Tokens-to-token vit: Training vision transformers from scratch on imagenet,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 558–567, 2021.
[13] G. Jawahar, B. Sagot, and D. Seddah, “What does BERT learn about the structure of language?,” in ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, (Florence, Italy), July 2019.
[14] M. Nagel, M. Fournarakis, R. A. Amjad, Y. Bondarenko, M. Van Baalen, and T. Blankevoort, “A white paper on neural network quantization,” arXiv preprint arXiv:2106.08295, 2021.
[15] NVIDIA, “TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x.” https://blogs.nvidia.com/blog/tensorfloat-32-precision-format/, 2020.
[16] S. Lu, M. Wang, S. Liang, J. Lin, and Z. Wang, “Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer,” in 2020 IEEE 33rd International System-on-Chip Conference (SOCC), pp. 84–89, 2020.
[17] H. Wang, Z. Zhang, and S. Han, “Spatten: Efficient sparse attention architecture with cascade token and head pruning,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 97–110, IEEE, 2021.
[18] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
[19] W. Kahan, “Further remarks on reducing truncation errors, commun,” Assoc. Comput. Mach, vol. 8, p. 40, 1965.
[20] B. P. Welford, “Note on a method for calculating corrected sums of squares and products,” Technometrics, vol. 4, no. 3, pp. 419–420, 1962.

校內：立即公開
校外：立即公開

簡易檢索 / 詳目顯示

相關論文