成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	黃瀚群 Huang, Han-Qun
論文名稱：	CASLab-DLA之指令集架構設計與整合機器學習編譯器框架 Integration of Machine Learning Compiler Framework with Custom Instruction Set Architecture Design for CASLab-DLA
指導教授：	陳中和 Chen, Chung-Ho
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering
論文出版年：	2022
畢業學年度：	110
語文別：	中文
論文頁數：	77
中文關鍵詞：	深度學習加速器、卷積神經網路、電子系統層級設計、TVM
外文關鍵詞：	Instruction Set Architecture, Deep Learning Accelerator, Convolutional Neural Network, ESL, TVM
相關次數：	點閱：102 下載：4
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著近幾年來人工智慧以及機器學習的蓬勃發展，同時也伴隨著各種複雜的神經網路以及龐大的運算量，以往都透過 GPU 的大量運算單元來處理這些應用，雖然運算速度可以大幅提升但隨之而來的是極大的功耗。而針對邊緣端的應用場景，目前解決的方式通常以設計 ASIC 晶片為主，而本實驗室所開發的 CASLab-DLA 即是針對此應用場景來設計。
本論文為 CASLab-DLA 設計了一套 Instruction Set Architecture，並根據此 ISA 設計了負責產生指令的 Instruction Generator(IG)，及修改原有的硬體架構使其支援這些指令。過去實驗室所開發的 CASLab-DLA 僅以加速卷積層為主，此指令集除了支援卷積層外，也支援了卷積神經網路中常見的全連接層及新興輕量化模型中的深度可分離卷積層。
此外，本論文將 CASLab-DLA 與 TVM 機器學習前端編譯器框架做整合，透過TVM提供的Bring Your Own Codegen介面新增了CASLab-DLA 的codegen與Runtime Library，使其在 Runtime 可以呼叫 IG 產生指令並交由加速器做執行完成模型的推論。最後，將編譯過的程式執行於利用 QEMU 模擬的 RISC-V CPU 並搭配 CASLAB-DLA 的虛擬平台上進行驗證及效能的分析評估。結果顯示，在執行 Yolov3-tiny、VGG16 以及 MobileNet 模型，對比單純用 CPU 執行模型的時間，分別可以得到約 24x、49x 與7x 左右的效能提升。

As the rapid development of deep learning models, AI accelerators need to support more operators to handle different use-case scenarios. In this thesis, we designed an Instruction Set Architecture for CASLab-DLA and modified the original hardware to support instructions. On the software side, we also designed an instruction generator to generate the instruction stream. In addition to the implementation mentioned above, we also integrated CASLab-DLA with TVM, an open-source machine learning compiler framework. With TVM, we can import the models from different neural network frameworks. Throughout internal optimization at different levels, we can deploy them to the target hardware platform to complete the end-to-end model inference.
In summary, we validate and analyze the results on the QEMU-SYSTEMC virtual platform. As the result, 24x, 49x and 7x of performance improvement was observed in Yolov3-tiny, VGG16, and MobileNet models respectively while running with CASLab-DLA assisting the CPU compared to running on CPU alone.

摘要 I
誌謝 VII
目錄 VIII
表目錄 X
圖目錄 XI
第1章 序論 1
1 論文動機 1
2 論文貢獻 2
3 論文架構 3
第2章 背景知識 4
1 TVM 4
1.1 Compilation Flow 6
1.2 Data Structures 7
1.3 Transformation 9
1.4 BYOC Framework 13
2 電子系統層級設計(Electronic System Level Design) 16
3 QEMU 17
4 Linux Device Driver 18
5 CASLAB-DLA 19
5.1 Processing Element (PE) 20
5.2 Memory Hierarchy 21
5.3 Oversize kernel Optimization 22
5.4 1*1 kernel size Optimization 23
第3章 設計方法 25
1 CASDLA 指令集架構設計 25
1.1 CASDLA 指令集介紹 26
1.2 Dependency check 32
1.3 硬體架構設計及執行流程 35
2 Instruction Generator 37
2.1 3 * 3 kernel size 40
2.2 1 * 1 kernel size 45
3 New Operator Support 49
3.1 Dense layer Support 49
3.2 Depthwise Convolution Support 55
4 整合 TVM 與 CASDLA 57
4.1 Execute CNN model on CPU 57
4.2 Execute CNN model on CPU with CASDLA 58
第4章 實驗結果與效能分析 63
1 實驗環境介紹 63
2 效能分析 65
第5章 結論與未來展望 74
1 結論 74
2 未來展望 75
參考文獻 76
                                    

[1] Ting-Jia Wu, “A one-dimensional convolution accelerator supporting data reuse and multiple dimensional filters/支援資料復用及過濾器尺寸可擴展性之一維卷積加速器設計與其電子系統層級驗證平台,” Natl. Cheng K. Univ. - NCKU, 2020
[2] Cheng-Chih Hsiao, “Quantization Implementation for Neural Network Accelerator based on CNN Inference Engine/基於卷積神經網路推論引擎建立支援參數量化方法之硬體加速器,” Natl. Cheng K. Univ. -NCKU, 2021
[3] Wei-Chung Tseng, “Layer-wise Fixed Point Quantization for Deep Convolutional Neural Networks and Implementation of YOLOv3 Inference Engine/深度卷積網路之逐層定點數量化方法與實作 YOLOv3 推論引擎,” Natl. Cheng K. Univ. - NCKU, 2019
[4] Min-Zhi Ji, “Optimization of YOLOv3 Inference Engine for Edge Device/優化YOLOv3 推論引擎並實現於終端裝置,” Natl. Cheng K. Univ. - NCKU, 2018.
[5] Joseph Redmon, “Darknet: Open source neural networks in c,”2013, [Online], Available: http://pjreddie.com/darknet/.
[6] Tianqi Chen, and et al., “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning,” 2018, [Online], Available: https://tvm.apache.org/.
[7] Martin Abadi, and et al., "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," arXiv preprint arXiv:1603.04467, 2016.
[8] Adam Paszke, and et al., PyTorch: An Imperative Style, High-Performance Deep Learning Library,” arXiv preprint arXiv: arXiv:1912.01703, 2019.
[9] “Bring Your Own Codegen to TVM”, [Online], Available:
https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm
[10] LeCun Yann, and et al.,“Gradient-based learning applied to document recognition, ”Proceedings of the IEEE 86.11: 2278-2324, 1998
[11] Andreas Gerstlauer, and et al., “Electronic System-Level Synthesis Methodologies,”IEEE Trans. Comput. Des. Integr. Circuits and Syst., vol. 28, no. 10, pp. 1517-1530, 2009.
[12] Fabrice Bellard, “QEMU, a Fast and Portable Dynamic Translator,” Proceeding of USENIX Annual Technical Conference, pp. 41-46, 2005
[13] Thierry Moreau, and et al., “A Hardware-Software Blueprint for Flexible Deep Learning Specialization”, arXiv preprint arXiv: 1807.04188, 2019
[14] Andrew G.Howard, and et al., “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv preprint arXiv: 1704.04861, 2017.
[15] Redmon, Joseph, et al. “You only look once: Unified, real-time object detection,”Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
[16] Karen Simonyan, Andrew Zisserman. “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[17] “TVM Docs: Design and Architecture,” [Online], Available : https://tvm.apache.org/docs/dev/index.html

2024-10-01公開

簡易檢索 / 詳目顯示

相關論文