成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	王志瑋 Wang, Jhih-Wei
論文名稱：	類神經網路運算優化設計於具有TVM的CASLab-GPGPU Computation Optimization for Neural Network on CASLab-GPGPU with TVM
指導教授：	陳中和 Chen, Chung-Ho
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2021
畢業學年度：	109
語文別：	中文
論文頁數：	110
中文關鍵詞：	通用型繪圖處理器、運算優化、TVM 、卷積運算、全連接層、機器學習、框架支援
外文關鍵詞：	GPGPU, Hardware-aware Optimization, TVM, AI framework, convolution, dense layer
相關次數：	點閱：382 下載：46
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年來，隨著製程技術的提升，以往在雲端才能實現的高算力裝置，現在也能夠體現在終端裝置上，而人工智慧的領域也因硬體算力提升，發展出越來越複雜的深度學習模型。而因深度學習的特有的運算模式，能夠高效率處理大量資料的運算單元如繪圖型處理器、DSP甚至是針對特定運算開發的ASIC（Application Specific Integrated Circuit）被大量的應用在人工智慧的場域當中。本實驗室開發的CASLab-GPGPU即為一通用型繪圖處理器（General Purpose Graphic Processing Unit），設計上使用SIMT架構（Single Instruction Multiple Thread）的設計，透過大量平行計算提升大量資料的運算效能。而經過多年的努力，CASLab-GPGPU除了已開發了Cycle Accurate的ESL Model之外，在軟體端也開發了自家的OpenCL Runtime Library、LLVM OpenCL Compiler等支援OpenCL的應用，更是透過修改TensorFlow函式庫並整合clBLAS、clDNN等運算函式庫向上支援至TensorFlow的應用。不過目前CASLab-GPGPU僅支援TensorFlow 0.11版本，且運算函式庫並未針對硬體實現優化，因此在機器學習框架支援度以及運算函式庫優化方面都有很大的發展空間。本論文則透過整合近年由華盛頓大學的研究團隊開發的一深度學習編譯器-TVM（Tensor Virtual Machine）至CASLab-GPGPU系統，實現機器學習框架支援的擴展，以及針對深度學習推論的運算進行軟體層的優化。
在近年深度學習的應用中，由其影像處理而生的CNN（Convolutional Neural Network）發展迅速，而在CNN中，卷積運算（Convolution）則為其中一非常重要的運算。而全連接層（Fully Connected Layer）運算，更是在所有深度學習的應用中佔有一席之地的運算。因此本論文整合TVM至CASLab-GPGPU系統之中之外，更使用TVM內的優化工具，針對卷積運算與全連接層設計針對CASLab-GPGPU硬體的優化方案，在擴展機器學習框架支援的同時，並獲得單一運算最高13倍的效能優化。

With the growth of manufacturing technology in this decade, computing units now can drive complex computations with better performance than they ever could. AI applications such as Deep Learning have grown with the trend. As a result, devices such as GPU, DSP, along with some ASIC for specific NN operations are commonly used in this field. CASLab-GPGPU is a general-purpose Graphic Processing Unit developed by CASLab since 2015. It was designed in the SIMT fashion and optimized for heavy computations. To achieve better performance and more efficient computation, hardware-aware optimization in computing libraries plays an important role in this field. In the CASLab-GPGPU system, TensorFlow[25] is supported by modifying its library and integrated with clBLAS[4]. However, clBLAS is not well optimized for CASLab-GPGPU and NN computations, which leaves huge room for improvement in the software stack. On the other hand, ML framework supportability is also critical for supporting different kinds of models. In this thesis, TVM[29] (a deep learning compiler), is introduced and ported in the CASLab-GPGPU system. Various ML frameworks are now supported as TVM included in our system. Moreover, TVM provides schedule primitives to help engineers optimize computations for their target backend. Thus, hardware-aware optimization for convolutional and dense layers is also proposed for CASLab-GPGPU resulting up to 13x speedup for a single layer. LeNet-5 is deployed on CASLab-GPGPU with a 2.6 speedup.

摘要 I
誌謝 IX
目錄 X
表目錄 XIII
圖目錄 XIV
第1章 序論 1
1 論文動機	2
1.1 問題分析 2
1.2 文獻回顧 3
1.3 相關研究 3
1.4 動機 5
2 論文貢獻 6
3 論文組織架構	6
第2章 背景知識與相關研究 7
1 CASLab-GPGPU 7
1.1 CASLab-GPGPU微架構 8
1.2 Streaming Multiprocessor 9
1.3 Workgroup Initializer 10
1.4 Instruction Buffer 11
1.5 Dependency Check Unit 11
1.6 Warp Scheduler 12
1.7 Execution Unit (EXE) & Load Store Unit (LSU) 12
1.8 Non-blocking Cache & Miss Status Hold Register 13
1.8 Memory Hierarchy 14
1.9 Issue Rate & Stall Factor 15
2 CASLAB-GPGPU軟體架構 17
2.1 CASLab-TensorFlow 17
2.2 OpenCL Programming Model 20
2.3 HSA Runtime 26
3 TVM 28
3.1 TVM Architecture & Compilation Flow 29
3.2 Graph-Level Optimization 33
3.3 Generating Tensor Operations 36
3.4 Automating Optimization 41
第3章 問題分析與設計方法 45
1 CASLab-TensorFlow架構與效能分析 45
1.1 機器學習框架支援性 46
1.2 卷積層的效能瓶頸分析 48
1.3 全連接層的效能瓶頸分析 52
2 整合TVM的CASLAB-GPGPU平台架構 56
2.1軟體系統整合 56
2.2整合TVM帶來為CASLab-GPGPU系統架構帶來的好處 58
3 卷積運算演算法分析與優化 59
3.1 Convolution Implementation with NPC 60
3.2 Parameters Searching 66
4 全連接層演算法分析與優化 79
4.1 Parallel Reduction 79
4.2 Dense Layer Optimization 81
第4章 實驗結果與效能評估 84
1實驗硬體與效能評估指標 84
1.1 實驗硬體 84
1.2 效能評估指標 85
2 卷積運算 87
3全連接層 90
4 CNN Models 94
4.1 使用Keras搭建LeNet-5並部署至CASLab-GPGPU 94
4.2 LeNet-5 95
4.3 VGG 97
4.4 Yolov3-tiny 101
第5章 結論 106
參考文獻 107
                                    

[1] “Auto Scheduler” [online], available:
https://tvm.apache.org/2021/03/03/intro-auto-scheduler
[2] “Caffe Official Website” [online], available:
https://caffe.berkeleyvision.org
[3] “Caffe2 Official Website” [online], available:
https://caffe2.ai
[4] “clBLAS github” [online]. available:
https://github.com/clMathLibraries/clBLAS
[5] “Core ML Official Website” [online], available:
https://developer.apple.com/documentation/coreml
[6] “d2ltvm” [online], available:
https://tvm.d2l.ai
[7] “Darknet Official Website” [online], available:
https://pjreddie.com
[8] “Dive Into Deep Learning Compiler, Operator Optimizations on GPUs, 5 Convolution” [online], available:
http://tvm.d2l.ai/chapter_gpu_schedules/conv.html
[9] Dun-Jie Chen, “LLVM-based OpenCL Compiler for CASLab-GPU” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan. 2019.
[10] Feng-Ming Hsu, “Tensor Process Unit (TPU) design and TPU APIs implementation for CASLab-GPU” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan, 2021.
[11] “HSA Technologies” [online], avaiable:
http://www.hsafoundation.com/
[12] “Keras Official Website” [online], available:
https://keras.io
[13] “LLVM” [online], available:
https://llvm.org/
[14] Mark Harris, “Optimizing Parallel Reduction in CUDA”, NVIDIA Developer Technology
[15] “MXNet Official Website” [online], available:
https://mxnet.apache.org/versions/1.8.0/
[16] NVIDIA Corporation Technical Staff, NVIDIA CUDA Programming Guide 2.2, NVIDIA Corporation, 2009.
[17] “ONNX Official Website” [online], available:
https://onnx.ai
[18] “OpenCL Official Website” [online], available:
https://www.khronos.org/opencl/
[19] “OpenMP API User’s Guide, Chapter 4: Nested Parallelism” [Online], Available:
https://docs.oracle.com/cd/E19205-01/819-5270/aewbc/index.html
[20] “OpenMP Official Website” [online], available:
https://www.openmp.org
[21] “Pytorch Official Website” [online], Available:
https://pytorch.org
[22] RAGAN-KELLEY,J.,BARNES,C.,ADAMS,A.,PARIS,S.,DU- RAND, F., AND AMARASINGHE, S. Halide, “A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines”, In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2013), PLDI ’13, ACM, pp. 519–530.
[23] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer, “cuDNN: Efficient Primitives for Deep Learning”, arXiv.org (cs), 2014
[24] “TensorFlow-Lite Official Website” [online], available:
https://www.tensorflow.org/lite
[25] “TensorFlow Official Website” [online], available:
https://www.tensorflow.org
[26] “TF-Coriander” [online]. available:
https://github.com/hughperkins/tf-coriander
[27] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy, “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning”, OSDI’18
[28] “TVM Docs: Design and Architecture” [online], available:
https://tvm.apache.org/docs/dev/index.html
[29] “TVM Official Website”, [online], available:
https://tvm.apache.org
[30] “Welcome to AMD ROCm™ Platform”, [online], available:
https://rocmdocs.amd.com/en/latest/
[31] Yu-Xiang Su, “Porting TensorFlow to CASLAB-GPUSIM and Optimization of Matrix Multiplication Library” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan, 2018.
[32] Yu-Hsiang Wang, “CASLAB-GPU Verification on FPGA and Optimization of Warp Scheduling and Memory Subsystem”, the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan, 2021.

校內：2023-08-01公開
校外：2023-08-01公開

簡易檢索 / 詳目顯示

相關論文