| 研究生: | 林聖堯 Lin, Sheng-Yao | 
|---|---|
| 論文名稱: | 在CASLab-GPU with Tensor Core上優化卷積運算 Optimizing Convolution Computing on CASLab-GPU with Tensor Core | 
| 指導教授: | 陳中和 Chen, Chung-Ho | 
| 學位類別: | 碩士 Master | 
| 系所名稱: | 電機資訊學院 - 電機工程學系 Department of Electrical Engineering | 
| 論文出版年: | 2022 | 
| 畢業學年度: | 111 | 
| 語文別: | 中文 | 
| 論文頁數: | 102 | 
| 中文關鍵詞: | 通用型繪圖處理器 、張量處理器 、卷積運算 | 
| 外文關鍵詞: | GPGPU, Tensor Processing Unit, Convolution | 
| 相關次數: | 點閱:74 下載:2 | 
| 分享至: | 
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 | 
隨著機器學習、人工智慧、IOT物聯網的高速發展以及資料安全和即時應用日趨受到大眾所重視,這類需要高計算能力的應用需求逐漸從雲端擴展至離使用者較接近的終端裝置(Edge Device)。我們基於SIMT(Single Instruction Multiple Thread)架構開發CASLab-GPU搭配OpenCL、Tensorflow、TVM等軟體堆疊(Software Stack)的支援來加速這類高度平行化的通用型運算。
在大多數的類神經網路中,卷積運算(Convolution)的執行時間佔比為最大宗,為提高CASLab-GPU執行卷積運算的速度,我們先前在原架構下添加張量處理單元Tensor Processing Unit(TPU)(Tensor Core)來加速透過Im2col + Gemm方法計算的卷積運算。然而,Im2col在將Feature Map展開成Matrix時會造成許多同樣的資料被複製多份(Duplicated Data),導致對外部記憶體存取量增加,以至於運算效能一直不如預期。
本論文分析原有TPU的硬體架構以及資料流,提出針對卷積運算的TPU優化方案,並透過TPU的基本算力分析結果,搭配TVM框架開發一卷積運算軟體函式庫,進一步提升卷積運算在CASLab-GPU上的執行效率。
關鍵字 : 通用型繪圖處理器、張量處理器、卷積運算
As machine learning, artificial intelligence, and the Internet of Things (IoT) continue to advance rapidly and data security and real-time applications become increasingly important, there is a growing demand for applications that require high computing power. These applications are increasingly being extended from the cloud to edge devices that are closer to the user. We have developed CASLab-GPU based on the SIMT architecture, with support from software stacks such as OpenCL [38], Tensorflow [5], and TVM [35] to accelerate highly parallelized general-purpose linear operations.
In most neural networks, convolution operations account for the majority of execution time. To improve the speed of convolution operations on the CASLab-GPU, we previously added a Tensor Processing Unit (TPU) (Tensor Core) to the original architecture to accelerate convolution calculations using the Im2col + Gemm [8] method. However, the duplication of data that occurs when the input feature map is expanded into a matrix using Im2col results in increased external memory access and less than ideal performance.
In this paper, we analyze the hardware architecture and data flow of the existing TPU and propose an optimization method for TPUs in convolution operations. We then use the TVM framework to develop a convolution computing library based on the TPU's basic computational power analysis results, further improving the execution efficiency of convolutions on the CASLab-GPU.
Keywords: GPGPU, Tensor Processing Unit, Convolution
[1]      	"NVIDIA TURING GPU ARCHITECTURE," [Online]. Available: https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf.
[2] 	"Nvidia Tensor Cores website," [Online]. Available: https://developer.nvidia.com/tensor-cores.
[3] 	C. Introduction. [Online]. Available: https://developer.nvidia.com/cuda-zone.
[4] 	C. W. P. V. J. C. J. T. B. C. E. S. Sharan Chetlur, "cuDNN: Efficient Primitives for Deep Learning," arXiv.org (cs), 2014. 
[5] 	"TensorFlow Official Website," [Online]. Available: https://www.tensorflow.org.
[6] 	"TVM Official Website," [Online]. Available: https://tvm.apache.org.
[7] 	Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy, "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning".OSDI’18. 
[8] 	"im2col+gemm," [Online]. Available: https://blog.csdn.net/u013701860/article/details/124688668.
[9] 	B. Sander, "HSAIL: Portable compiler IR for HSA," IEEE Hot Chips 25 Symposium (HCS), pp. 1-32, 2013. 
[10] 	"Static random-access memory," [Online]. Available: https://en.wikipedia.org/wiki/Static_random-access_memory.
[11] 	"Divergent Branch," [Online]. Available: https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/sourcelevel/divergentbranch.htm.
[12] 	"Pipeline Hazard," [Online]. Available: https://king0980692.medium.com/computer-architecture-cheat-sheet-pipeline-hazard-ee27d0d66e89.
[13] 	F. E. a. M. F. S. Sandokji, "A survey of techniques for warp scheduling in GPUs," IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS), pp. 600-606, 2015. 
[14] 	J. L. J.-C. L. a. C. L. C.-C. L. S. C. Sau-Gee Chen, “New Systolic Arrays for Matrix Multiplication,” 於 1994 Internatonal Conference on Parallel Processing Vol. 2, North Carolina, USA, 1994. 
[15] 	"Using CUDA Warp-Level Primitives," [Online]. Available: https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/.
[16] 	F.-M. Hsu, "Tensor Process Unit (TPU) design and TPU APIs implementation for CASLab-GPU, the thesis for Master of Science," National Cheng Kung University, Tainan, Taiwan, 2021.
[17] 	"Intel® Distribution of OpenVINO™ Toolkit," [Online]. Available: https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html.
[18] 	"ARM NN SDK," [Online]. Available: https://www.arm.com/zh-TW/products/silicon-ip-cpu/ethos/arm-nn.
[19] 	D.-J. Chen, "LLVM-based OpenCL Compiler for CASLab-GPU, the thesis for Master of Science," National Cheng Kung University, Tainan, Taiwan, 2019.
[20] 	H. Perkins, "CUDA-on-CL: a compiler and runtime for running NVIDIA® CUDA™ C++ 11 applications on OpenCL™ 1.2 Devices," Proceedings of the 5th International Workshop on OpenCL, Proceedings of the 5th International Workshop on OpenCL. 
[21] 	"TF-Coriander," [Online]. Available: https://github.com/hughperkins/tf-coriander.
[22] 	C. Nugteren., "CLBlast: A Tuned OpenCL BLAS Library," arXiv preprint arXiv:1705.05249(2017), 0–7.
[23] 	"cuBLAS," [Online]. Available: https://docs.nvidia.com/cuda/cublas/index.html.
[24] 	"clBLAS," [Online]. Available: https://github.com/clMathLibraries/clBLAS.
[25] 	J.-W. Wang, "Computation Optimization for Neural Network on CASLab-GPGPU with TVM, the thesis for Master of Science," National Cheng Kung University, Tainan, Taiwan, 2021.
[26] 	"LLVM," [Online]. Available: https://llvm.org/.
[27] 	"Clang: a C language family frontend for LLVM," [Online]. Available: https://clang.llvm.org.
[28] 	"Keras," [Online]. Available: https://keras.io.
[29] 	"MXnet," [Online]. Available: https://mxnet.apache.org/versions/1.9.1/.
[30] 	"Pytorch," [Online]. Available: https://pytorch.org.
[31] 	"Intel® Advanced Vector Extensions 512)," [Online]. Available: https://www.intel.com.tw/content/www/tw/zh/architecture-and-technology/avx-512-overview.html.
[32] 	"KL530," [Online]. Available: https://www.kneron.com/tw/news/blog/141/.
[33] 	"Introduction to Relay IR," [Online]. Available: https://tvm.apache.org/docs/arch/relay_intro.html.
[34] 	Numpy, "https://numpy.org," [Online]. 
[35] 	"TVM User Tutorial," [Online]. Available: https://tvm.apache.org/docs/tutorial/index.html.
[36] 	"transform.py," [Online]. Available: https://tvm.apache.org/docs/reference/api/python/relay/transform.html.
[37] 	"apache / tvm," [Online]. Available: https://github.com/apache/tvm/tree/main/python/tvm/ir.
[38] 	"OpenCL Official Website," [Online]. Available: https://www.khronos.org/opencl/.
[39] 	"HSA Technologies," [Online]. Available: http://www.hsafoundation.com/.