簡易檢索 / 詳目顯示

研究生: 陳奕瑋
Chen, I-Wei
論文名稱: 近似激活函數與模型量化的TVM實作與在CNN/Transformer加速器的應用
Exploration of Approximate Activation Functions and Model Quantization on TVM for CNN/Transformer Accelerator
指導教授: 陳中和
Chen, Chung-Ho
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 93
中文關鍵詞: 近似激活函數深度學習加速器模型量化模型部署
外文關鍵詞: Activation Function Approximation, Deep Learning Accelerator, Model Quantization, Model Deployment
相關次數: 點閱:56下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著深度學習技術的快速發展和廣泛應用,提升模型的計算效率和資源使用率成為了關鍵問題。傳統深度學習模型在推論階段的計算量和儲存需求通常較高,限制了其在資源受限環境中的應用。本研究旨在解決這一挑戰,提出並探討量化技術和激活函數近似演算法在深度學習模型中的應用,以提高推論效能並減少資料傳輸,同時保持模型的精度和效能。
    為此本論文設計了一種適用於多種激活函數的近似演算法,這個演算法具備精度控制功能並能進行線段最佳化進而有效簡化複雜的激活函數的計算;此外本論文也對多個深度學習模型進行量化處理,並在TVM上成功部署和驗證這些量化模型,證實了量化技術的可行性和有效性,除此之外我們將量化模型中的激活函數改寫為基於查找表(LUT)的形式,並驗證其適用於後續的硬體實作。接著在實驗室開發的CNN/Transformer Unified Accelerator中設計了基於查找表(LUT)的量化激活函數,進一步提升了模型在計算激活函數的計算效率。
    本論文的研究結果結合量化技術和激活函數近似演算法可以提升深度學習模型在資源受限環境中的效能,並為未來的硬體實作提供了有效的解決方案。

    With the rapid development and widespread application of deep learning technologies, enhancing the computational efficiency and resource utilization of models has become a critical issue. Traditional deep learning models often require significant computational power and storage during the inference phase, which limits their applicability in resource-constrained environments. This study aims to address this challenge by proposing and exploring the application of quantization techniques and activation function approximation algorithms in deep learning models to improve inference efficiency and reduce data transfer, while maintaining model accuracy and performance.
    To this end, this paper designs an approximation algorithm applicable to various activation functions. This algorithm features accuracy control and can perform segment optimization, thereby effectively simplifying the computation of complex activation functions. Additionally, this paper quantizes multiple deep learning models and successfully deploys and validates these quantized models on TVM, demonstrating the feasibility and effectiveness of the quantization techniques. Furthermore, we rewrite the activation functions in the quantized models into a lookup table (LUT)-based form and validate their suitability for subsequent hardware implementation. Subsequently, LUT-based quantized activation functions were designed in the CNN/Transformer Unified Accelerator developed by our lab, further enhancing the computational efficiency of the models in computing activation functions.
    The results of this study, which combine quantization techniques and activation function approximation algorithms, can improve the performance of deep learning models in resource-constrained environments and provide effective solutions for future hardware implementations.

    摘要 II SUMMARY III 誌謝 XV 目錄 XVI 表目錄 XVIII 圖目錄 XIX 第1章 緒論 1 1.1 論文動機 1 1.2 論文貢獻 1 1.3 論文架構 2 第2章 背景知識與相關研究 3 2.1 Activation Function 3 2.1.1 Activation Function 3 2.1.2 Widely Used Activation Function 7 2.2 Approximation Method 11 2.2.1 LUT 11 2.2.2 Taylor-series approximation 13 2.2.3 Piecewise linear/nonlinear approximation 14 2.3 深度學習編譯器- Tensor Virtual Machine (TVM) 15 2.3.1 Compilation Flow & Software Stack 15 2.3.2 Logical Architecture Components 18 2.3.3 Device/Target Interactions 20 2.3.4 Bring Your Own Codegen(BYOC) 21 2.4 Quantization Method 24 2.5 電子系統層級模擬& SystemC 27 第3章 設計方法 28 3.1 Approximate activation function 28 3.1.1 Approximation algorithm for activation function 28 3.1.2 Optimization Method based on Approximation Algorithm 33 3.2 Model Quantization 38 3.3 TVM 41 3.4 Quantized Activation Function Hardware Implementation 48 第4章 實驗結果與效能評估 55 4.1 實驗平台環境介紹 55 4.2 實驗結果與分析 56 4.2.1 Results of the Approximation Method on Activation Functions 56 4.2.2 Results of the Quantized Model Deployment using TVM 61 4.2.3 Results of the BiasReqAct Units 65 第5章 結論與未來展望 68 5.1 結論 68 5.2 未來展望 68 參考文獻 69

    [1] H. Amin, K. M. Curtis, and B. R. Hayes-Gill. 1997. Piecewise linear approximation applied to nonlinear function of a neural network. IEE Proc. - Circuits Devices Syst. 144, 6 (December 1997), 313–317. https://doi.org/10.1049/ip-cds:19971587
    [2] Andrea Apicella, Francesco Donnarumma, Francesco Isgrò, and Roberto Prevete. 2021. A survey on modern trainable activation functions. Neural Netw. 138, (June 2021), 14–32. https://doi.org/10.1016/j.neunet.2021.01.026
    [3] A. Armato, L. Fanucci, E.P. Scilingo, and D. De Rossi. 2011. Low-error digital hardware implementation of artificial neuron activation functions and their derivative. Microprocess. Microsyst. 35, 6 (August 2011), 557–567. https://doi.org/10.1016/j.micpro.2011.05.007
    [4] Mariusz Bajger and Amos Omondi. 2008. Low-error, High-speed Approximation of the Sigmoid Function for Large FPGA Implementations. J. Signal Process. Syst. 52, 2 (August 2008), 137–151. https://doi.org/10.1007/s11265-007-0140-z
    [5] K. Basterretxea, J. M. Tarela, and I. del Campo. 2002. Digital design of sigmoid approximator for artificial neural networks. Electron. Lett. 38, 1 (January 2002), 1–2.
    [6] S.R. Chiluveru, M. Tripathy, and B. Mohapatra. 2020. Accuracy controlled iterative method for efficient sigmoid function approximation. Electron. Lett. 56, 18 (September 2020), 914–916. https://doi.org/10.1049/el.2020.0854
    [7] Leonid Datta. 2020. A Survey on Activation Functions and their relation with Xavier and He Normal Initialization. Retrieved April 24, 2023 from http://arxiv.org/abs/2004.06632
    [8] Shiv Ram Dubey, Satish Kumar Singh, and Bidyut Baran Chaudhuri. 2022. Activation functions in deep learning: A comprehensive survey and benchmark. Neurocomputing 503, (September 2022), 92–108. https://doi.org/10.1016/j.neucom.2022.06.111
    [9] Włodzisław Duch and Norbert Jankowski. Survey of Neural Transfer Functions.
    [10] Hassene Faiedh, Chokri Souani, Kholdoun Torki, and Kamel Besbes. 2006. Digital Hardware Implementation of a Neural System Used for Nonlinear Adaptive Prediction. J. Comput. Sci. 2, 4 (April 2006), 355–362. https://doi.org/10.3844/jcssp.2006.355.362
    [11] A. Gerstlauer, C. Haubelt, A.D. Pimentel, T.P. Stefanov, D.D. Gajski, and J. Teich. 2009. Electronic System-Level Synthesis Methodologies. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 28, 10 (October 2009), 1517–1530. https://doi.org/10.1109/TCAD.2009.2026356
    [12] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. 2021. A Survey of Quantization Methods for Efficient Neural Network Inference. Retrieved March 13, 2024 from http://arxiv.org/abs/2103.13630
    [13] John H Holland. 1992. Genetic Algorithms. Sci. Am. (1992).
    [14] Armin Iske. 2018. Approximation Theory and Algorithms for Data Analysis. Springer International Publishing, Cham. https://doi.org/10.1007/978-3-030-05228-7
    [15] Sourabh Katoch, Sumit Singh Chauhan, and Vijay Kumar. 2021. A review on genetic algorithm: past, present, and future. Multimed. Tools Appl. 80, 5 (February 2021), 8091–8126. https://doi.org/10.1007/s11042-020-10139-6
    [16] Yue Li, Wei Cao, Xuegong Zhou, and Lingli Wang. 2020. A Low-Cost Reconfigurable Nonlinear Core for Embedded DNN Applications. In 2020 International Conference on Field-Programmable Technology (ICFPT), December 2020. IEEE, Maui, HI, USA, 35–38. https://doi.org/10.1109/ICFPT51103.2020.00014
    [17] Roberto Muscedere and Karl Leboeuf. 2008. A dynamic address decode circuit for implementing range addressable look-up tables. In 2008 IEEE International Symposium on Circuits and Systems (ISCAS), May 2008. IEEE, Seattle, WA, USA, 3326–3329. https://doi.org/10.1109/ISCAS.2008.4542170
    [18] D. J. Myers and R. A. Hutchinson. 1989. Efficient implementation of piecewise linear activation function for digital VLSI neural networks. Electron. Lett. 25, 24 (November 1989), 1662–1663. https://doi.org/10.1049/el:19891114
    [19] Ashkan Hosseinzadeh Namin, Karl Leboeuf, Roberto Muscedere, Huapeng Wu, and Majid Ahmadi. 2009. Efficient hardware implementation of the hyperbolic tangent sigmoid function. In 2009 IEEE International Symposium on Circuits and Systems, May 2009. IEEE, Taipei, Taiwan, 2117–2120. https://doi.org/10.1109/ISCAS.2009.5118213
    [20] Preeti Ranjan Panda. SystemC - A modeling platform supporting multiple design abstractions.
    [21] Zidi Qin, Yuou Qiu, Huaqing Sun, Zhonghai Lu, Zhongfeng Wang, Qinghong Shen, and Hongbing Pan. 2020. A Novel Approximation Methodology and Its Efficient VLSI Implementation for the Sigmoid Function. IEEE Trans. Circuits Syst. II Express Briefs 67, 12 (December 2020), 3422–3426. https://doi.org/10.1109/TCSII.2020.2999458
    [22] Babak Rokh, Ali Azarpeyvand, and Alireza Khanteymoori. A Comprehensive Survey on Model Quantization for Deep Neural Networks in Image Classification.
    [23] Vladislav Shatravin, Dmitriy Shashev, and Stanislav Shidlovskiy. 2022. Sigmoid Activation Implementation for Neural Networks Hardware Accelerators Based on Reconfigurable Computing Environments for Low-Power Intelligent Systems. Appl. Sci. 12, 10 (May 2022), 5216. https://doi.org/10.3390/app12105216
    [24] Abiy Tasissa. FUNCTION APPROXIMATION AND THE REMEZ ALGORITHM.
    [25] Ivan Tsmots, Oleksa Skorokhoda, and Vasyl Rabyk. 2019. Hardware Implementation of Sigmoid Activation Functions using FPGA. In 2019 IEEE 15th International Conference on the Experience of Designing and Application of CAD Systems (CADSM), February 2019. IEEE, Polyana, Ukraine, 34–38. https://doi.org/10.1109/CADSM.2019.8779253
    [26] Yonghye Kwon;Jongseok Lee;Donggyu Sim. 2020. Fast Opto-electrical Transformation of HDR Videos based on a LUT for Piecewise Linear Approximation. 2020. . Retrieved from https://doi.org/10.5573/IEIESPC.2020.9.2.112
    [27] Zhi Chen and Cody Yu, Amazon Web Services, Inc. 2020. How to Bring Your Own Codegen to TVM. Retrieved from https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm
    [28] Yuan 戴源/Tai. 2024. 通用於各內核大小卷積與矩陣運算之計算核心設計/Design of a Kernel-agnostic Compute Core for Convolution and GEMM. (2024).
    [29] 李秉軒/Ping-Hsuan Lee. 2024. 搭配 CNN/Transformer Unified 加速器之 Memory Layout Unit 硬體設計與 記憶體子系統分析/Memory Layout Unit Design of CNN/Transformer Unified Accelerator and Memory Subsystem Analysis. (2024).
    [30] 王祥宇 Hsiang-Yu Wang. 2023. 於可擴展 CASLab-DLA–TVM系統實現卷積神經網路指令排程優化/Instruction Scheduling Optimization for Convolution Neural Network on Scalable CASLab-DLA – TVM System. (2023).
    [31] Tzu-Yu 羅子渝/Lo. 2023. 於CASLab-DLA上對壓縮稀疏卷積運算進行軟硬體共同設計/Compressed Sparse Convolution Hardware and Software Co-design on CASLab-DLA. (2023).
    [32] 謝承佑/Xie Cheng-You. 2022. 卷積神經網路加速器架構與記憶體子系統之優化與其在CASLab-DLA之實現/Optimizations of CNN Micro-architecture and Memory Sub-system for CASLab-DLA. (2022).
    [33] 黃瀚群/Han-Qun Huang. 2022. CASLab-DLA之指令集架構設計與 整合機器學習編譯器框架/Integration of Machine Learning Compiler Framework with Custom Instruction Set Architecture for CASLab-DLA. (2022).
    [34] 2021. Arm® EthosTM-U55 NPU Technical reference manual. 02 (2021).
    [35] NVIDIA Deep Learning Accelerator (NVDLA). Retrieved from http://nvdla.org/hw/contents.html
    [36] Apache TVM Documentation — tvm 0.16.dev0 documentation. Retrieved from https://tvm.apache.org/docs/
    [37] ONNX Runtime. onnxruntime. Retrieved from https://onnxruntime.ai/docs/

    無法下載圖示 校內:2026-07-16公開
    校外:2026-07-16公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE