| 研究生: |
王志瑋 Wang, Jhih-Wei |
|---|---|
| 論文名稱: |
類神經網路運算優化設計於具有TVM的CASLab-GPGPU Computation Optimization for Neural Network on CASLab-GPGPU with TVM |
| 指導教授: |
陳中和
Chen, Chung-Ho |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 中文 |
| 論文頁數: | 110 |
| 中文關鍵詞: | 通用型繪圖處理器 、運算優化 、TVM 、卷積運算 、全連接層 、機器學習 、框架支援 |
| 外文關鍵詞: | GPGPU, Hardware-aware Optimization, TVM, AI framework, convolution, dense layer |
| 相關次數: | 點閱:382 下載:46 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,隨著製程技術的提升,以往在雲端才能實現的高算力裝置,現在也能夠體現在終端裝置上,而人工智慧的領域也因硬體算力提升,發展出越來越複雜的深度學習模型。而因深度學習的特有的運算模式,能夠高效率處理大量資料的運算單元如繪圖型處理器、DSP甚至是針對特定運算開發的ASIC(Application Specific Integrated Circuit)被大量的應用在人工智慧的場域當中。本實驗室開發的CASLab-GPGPU即為一通用型繪圖處理器(General Purpose Graphic Processing Unit),設計上使用SIMT架構(Single Instruction Multiple Thread)的設計,透過大量平行計算提升大量資料的運算效能。而經過多年的努力,CASLab-GPGPU除了已開發了Cycle Accurate的ESL Model之外,在軟體端也開發了自家的OpenCL Runtime Library、LLVM OpenCL Compiler等支援OpenCL的應用,更是透過修改TensorFlow函式庫並整合clBLAS、clDNN等運算函式庫向上支援至TensorFlow的應用。不過目前CASLab-GPGPU僅支援TensorFlow 0.11版本,且運算函式庫並未針對硬體實現優化,因此在機器學習框架支援度以及運算函式庫優化方面都有很大的發展空間。本論文則透過整合近年由華盛頓大學的研究團隊開發的一深度學習編譯器-TVM(Tensor Virtual Machine)至CASLab-GPGPU系統,實現機器學習框架支援的擴展,以及針對深度學習推論的運算進行軟體層的優化。
在近年深度學習的應用中,由其影像處理而生的CNN(Convolutional Neural Network)發展迅速,而在CNN中,卷積運算(Convolution)則為其中一非常重要的運算。而全連接層(Fully Connected Layer)運算,更是在所有深度學習的應用中佔有一席之地的運算。因此本論文整合TVM至CASLab-GPGPU系統之中之外,更使用TVM內的優化工具,針對卷積運算與全連接層設計針對CASLab-GPGPU硬體的優化方案,在擴展機器學習框架支援的同時,並獲得單一運算最高13倍的效能優化。
With the growth of manufacturing technology in this decade, computing units now can drive complex computations with better performance than they ever could. AI applications such as Deep Learning have grown with the trend. As a result, devices such as GPU, DSP, along with some ASIC for specific NN operations are commonly used in this field. CASLab-GPGPU is a general-purpose Graphic Processing Unit developed by CASLab since 2015. It was designed in the SIMT fashion and optimized for heavy computations. To achieve better performance and more efficient computation, hardware-aware optimization in computing libraries plays an important role in this field. In the CASLab-GPGPU system, TensorFlow[25] is supported by modifying its library and integrated with clBLAS[4]. However, clBLAS is not well optimized for CASLab-GPGPU and NN computations, which leaves huge room for improvement in the software stack. On the other hand, ML framework supportability is also critical for supporting different kinds of models. In this thesis, TVM[29] (a deep learning compiler), is introduced and ported in the CASLab-GPGPU system. Various ML frameworks are now supported as TVM included in our system. Moreover, TVM provides schedule primitives to help engineers optimize computations for their target backend. Thus, hardware-aware optimization for convolutional and dense layers is also proposed for CASLab-GPGPU resulting up to 13x speedup for a single layer. LeNet-5 is deployed on CASLab-GPGPU with a 2.6 speedup.
[1] “Auto Scheduler” [online], available:
https://tvm.apache.org/2021/03/03/intro-auto-scheduler
[2] “Caffe Official Website” [online], available:
https://caffe.berkeleyvision.org
[3] “Caffe2 Official Website” [online], available:
https://caffe2.ai
[4] “clBLAS github” [online]. available:
https://github.com/clMathLibraries/clBLAS
[5] “Core ML Official Website” [online], available:
https://developer.apple.com/documentation/coreml
[6] “d2ltvm” [online], available:
https://tvm.d2l.ai
[7] “Darknet Official Website” [online], available:
https://pjreddie.com
[8] “Dive Into Deep Learning Compiler, Operator Optimizations on GPUs, 5 Convolution” [online], available:
http://tvm.d2l.ai/chapter_gpu_schedules/conv.html
[9] Dun-Jie Chen, “LLVM-based OpenCL Compiler for CASLab-GPU” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan. 2019.
[10] Feng-Ming Hsu, “Tensor Process Unit (TPU) design and TPU APIs implementation for CASLab-GPU” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan, 2021.
[11] “HSA Technologies” [online], avaiable:
http://www.hsafoundation.com/
[12] “Keras Official Website” [online], available:
https://keras.io
[13] “LLVM” [online], available:
https://llvm.org/
[14] Mark Harris, “Optimizing Parallel Reduction in CUDA”, NVIDIA Developer Technology
[15] “MXNet Official Website” [online], available:
https://mxnet.apache.org/versions/1.8.0/
[16] NVIDIA Corporation Technical Staff, NVIDIA CUDA Programming Guide 2.2, NVIDIA Corporation, 2009.
[17] “ONNX Official Website” [online], available:
https://onnx.ai
[18] “OpenCL Official Website” [online], available:
https://www.khronos.org/opencl/
[19] “OpenMP API User’s Guide, Chapter 4: Nested Parallelism” [Online], Available:
https://docs.oracle.com/cd/E19205-01/819-5270/aewbc/index.html
[20] “OpenMP Official Website” [online], available:
https://www.openmp.org
[21] “Pytorch Official Website” [online], Available:
https://pytorch.org
[22] RAGAN-KELLEY,J.,BARNES,C.,ADAMS,A.,PARIS,S.,DU- RAND, F., AND AMARASINGHE, S. Halide, “A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines”, In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (New York, NY, USA, 2013), PLDI ’13, ACM, pp. 519–530.
[23] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer, “cuDNN: Efficient Primitives for Deep Learning”, arXiv.org (cs), 2014
[24] “TensorFlow-Lite Official Website” [online], available:
https://www.tensorflow.org/lite
[25] “TensorFlow Official Website” [online], available:
https://www.tensorflow.org
[26] “TF-Coriander” [online]. available:
https://github.com/hughperkins/tf-coriander
[27] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy, “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning”, OSDI’18
[28] “TVM Docs: Design and Architecture” [online], available:
https://tvm.apache.org/docs/dev/index.html
[29] “TVM Official Website”, [online], available:
https://tvm.apache.org
[30] “Welcome to AMD ROCm™ Platform”, [online], available:
https://rocmdocs.amd.com/en/latest/
[31] Yu-Xiang Su, “Porting TensorFlow to CASLAB-GPUSIM and Optimization of Matrix Multiplication Library” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan, 2018.
[32] Yu-Hsiang Wang, “CASLAB-GPU Verification on FPGA and Optimization of Warp Scheduling and Memory Subsystem”, the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan, 2021.