簡易檢索 / 詳目顯示

研究生: 林聖堯
Lin, Sheng-Yao
論文名稱: 在CASLab-GPU with Tensor Core上優化卷積運算
Optimizing Convolution Computing on CASLab-GPU with Tensor Core
指導教授: 陳中和
Chen, Chung-Ho
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2022
畢業學年度: 111
語文別: 中文
論文頁數: 102
中文關鍵詞: 通用型繪圖處理器張量處理器卷積運算
外文關鍵詞: GPGPU, Tensor Processing Unit, Convolution
相關次數: 點閱:74下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著機器學習、人工智慧、IOT物聯網的高速發展以及資料安全和即時應用日趨受到大眾所重視,這類需要高計算能力的應用需求逐漸從雲端擴展至離使用者較接近的終端裝置(Edge Device)。我們基於SIMT(Single Instruction Multiple Thread)架構開發CASLab-GPU搭配OpenCL、Tensorflow、TVM等軟體堆疊(Software Stack)的支援來加速這類高度平行化的通用型運算。
    在大多數的類神經網路中,卷積運算(Convolution)的執行時間佔比為最大宗,為提高CASLab-GPU執行卷積運算的速度,我們先前在原架構下添加張量處理單元Tensor Processing Unit(TPU)(Tensor Core)來加速透過Im2col + Gemm方法計算的卷積運算。然而,Im2col在將Feature Map展開成Matrix時會造成許多同樣的資料被複製多份(Duplicated Data),導致對外部記憶體存取量增加,以至於運算效能一直不如預期。
    本論文分析原有TPU的硬體架構以及資料流,提出針對卷積運算的TPU優化方案,並透過TPU的基本算力分析結果,搭配TVM框架開發一卷積運算軟體函式庫,進一步提升卷積運算在CASLab-GPU上的執行效率。
    關鍵字 : 通用型繪圖處理器、張量處理器、卷積運算

    As machine learning, artificial intelligence, and the Internet of Things (IoT) continue to advance rapidly and data security and real-time applications become increasingly important, there is a growing demand for applications that require high computing power. These applications are increasingly being extended from the cloud to edge devices that are closer to the user. We have developed CASLab-GPU based on the SIMT architecture, with support from software stacks such as OpenCL [38], Tensorflow [5], and TVM [35] to accelerate highly parallelized general-purpose linear operations.
    In most neural networks, convolution operations account for the majority of execution time. To improve the speed of convolution operations on the CASLab-GPU, we previously added a Tensor Processing Unit (TPU) (Tensor Core) to the original architecture to accelerate convolution calculations using the Im2col + Gemm [8] method. However, the duplication of data that occurs when the input feature map is expanded into a matrix using Im2col results in increased external memory access and less than ideal performance.
    In this paper, we analyze the hardware architecture and data flow of the existing TPU and propose an optimization method for TPUs in convolution operations. We then use the TVM framework to develop a convolution computing library based on the TPU's basic computational power analysis results, further improving the execution efficiency of convolutions on the CASLab-GPU.
    Keywords: GPGPU, Tensor Processing Unit, Convolution

    摘要 I 誌謝 VIII 目錄 IX 表目錄 XI 圖目錄 XI 第1章 序論 1 1.1 論文動機 1 1.2 論文貢獻 2 1.3 論文組織與架構 2 第2章 背景知識與相關研究 3 2.1 CASLab-GPU 3 2.1.1 CASLab-GPU Microarchitecture and Memory Hierarchy 3 2.1.2 Streaming Multiprocessor (SM) 5 2.1.3 Workgroup Initializer 6 2.1.4 Instruction Buffer 7 2.1.5 Dependency Check Unit 8 2.1.6 Warp Scheduler 8 2.1.7 Execution Unit & Load Store Unit 9 2.1.8 Tensor Processing Unit (TPU) 10 2.2 CASLab-GPU Software Stack 12 2.2.1 CASLab Tensorflow 13 2.2.2 CLBlast Library 18 2.2.3 OpenCL Programming Model 19 2.2.4 CASLab LLVM Compiler 22 2.3 TVM 25 2.3.1 TVM Compilation Flow 26 2.3.2 TVM Framework 28 2.3.3 Graph-Level Optimization 36 2.3.4 Generating Tensor Operator & Tensor-Level Optimization 39 2.3.5 Porting TVM to CASLab-GPU Software Stack 41 2.4 Convolution Computing Optimization with CASLab-GPU 42 2.4.1 Version1: Using Tensorflow Framework with tpu Instruction 43 2.4.2 Version2: Using TVM Framework without tpu Instruction 45 第3章 問題分析與設計方法 52 3.1 卷積運算在 Tensor Core 執行的限制與缺陷 53 3.2 針對卷積運算在TPU架構上的優化 59 3.3 在TVM框架部署卷積運算至CASLab-GPU with Tensor Core 63 3.3.1 任務切割分配對效能之影響: 64 3.3.2 切割與分配卷積運算: 66 3.3.3 從程式碼看卷積運算任務切割分配: 68 3.4 針對CASLab-GPU with Tensor Core之卷積運算參數搜索演算法 74 3.4.1 參數搜索演算法: 74 3.4.2 Case Study: 79 第4章 實驗結果與效能評估 85 4.1 實驗模擬平台配置及時間量測 85 4.2 CASLab-GPU with Tensor Core基本算力分析 87 4.2.1 分析tpu指令執行效率 88 4.2.2 在資源限縮下分析tpu指令執行效率 90 4.2.3 決定參數搜索演算法kmin 91 4.2.4 決定參數搜索演算法Tmin 93 4.3 卷積運算執行時間比較 94 第5章 結論 98 參考文獻 99

    [1] "NVIDIA TURING GPU ARCHITECTURE," [Online]. Available: https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf.
    [2] "Nvidia Tensor Cores website," [Online]. Available: https://developer.nvidia.com/tensor-cores.
    [3] C. Introduction. [Online]. Available: https://developer.nvidia.com/cuda-zone.
    [4] C. W. P. V. J. C. J. T. B. C. E. S. Sharan Chetlur, "cuDNN: Efficient Primitives for Deep Learning," arXiv.org (cs), 2014.
    [5] "TensorFlow Official Website," [Online]. Available: https://www.tensorflow.org.
    [6] "TVM Official Website," [Online]. Available: https://tvm.apache.org.
    [7] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy, "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning".OSDI’18.
    [8] "im2col+gemm," [Online]. Available: https://blog.csdn.net/u013701860/article/details/124688668.
    [9] B. Sander, "HSAIL: Portable compiler IR for HSA," IEEE Hot Chips 25 Symposium (HCS), pp. 1-32, 2013.
    [10] "Static random-access memory," [Online]. Available: https://en.wikipedia.org/wiki/Static_random-access_memory.
    [11] "Divergent Branch," [Online]. Available: https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/sourcelevel/divergentbranch.htm.
    [12] "Pipeline Hazard," [Online]. Available: https://king0980692.medium.com/computer-architecture-cheat-sheet-pipeline-hazard-ee27d0d66e89.
    [13] F. E. a. M. F. S. Sandokji, "A survey of techniques for warp scheduling in GPUs," IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS), pp. 600-606, 2015.
    [14] J. L. J.-C. L. a. C. L. C.-C. L. S. C. Sau-Gee Chen, “New Systolic Arrays for Matrix Multiplication,” 於 1994 Internatonal Conference on Parallel Processing Vol. 2, North Carolina, USA, 1994.
    [15] "Using CUDA Warp-Level Primitives," [Online]. Available: https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/.
    [16] F.-M. Hsu, "Tensor Process Unit (TPU) design and TPU APIs implementation for CASLab-GPU, the thesis for Master of Science," National Cheng Kung University, Tainan, Taiwan, 2021.
    [17] "Intel® Distribution of OpenVINO™ Toolkit," [Online]. Available: https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html.
    [18] "ARM NN SDK," [Online]. Available: https://www.arm.com/zh-TW/products/silicon-ip-cpu/ethos/arm-nn.
    [19] D.-J. Chen, "LLVM-based OpenCL Compiler for CASLab-GPU, the thesis for Master of Science," National Cheng Kung University, Tainan, Taiwan, 2019.
    [20] H. Perkins, "CUDA-on-CL: a compiler and runtime for running NVIDIA® CUDA™ C++ 11 applications on OpenCL™ 1.2 Devices," Proceedings of the 5th International Workshop on OpenCL, Proceedings of the 5th International Workshop on OpenCL.
    [21] "TF-Coriander," [Online]. Available: https://github.com/hughperkins/tf-coriander.
    [22] C. Nugteren., "CLBlast: A Tuned OpenCL BLAS Library," arXiv preprint arXiv:1705.05249(2017), 0–7.
    [23] "cuBLAS," [Online]. Available: https://docs.nvidia.com/cuda/cublas/index.html.
    [24] "clBLAS," [Online]. Available: https://github.com/clMathLibraries/clBLAS.
    [25] J.-W. Wang, "Computation Optimization for Neural Network on CASLab-GPGPU with TVM, the thesis for Master of Science," National Cheng Kung University, Tainan, Taiwan, 2021.
    [26] "LLVM," [Online]. Available: https://llvm.org/.
    [27] "Clang: a C language family frontend for LLVM," [Online]. Available: https://clang.llvm.org.
    [28] "Keras," [Online]. Available: https://keras.io.
    [29] "MXnet," [Online]. Available: https://mxnet.apache.org/versions/1.9.1/.
    [30] "Pytorch," [Online]. Available: https://pytorch.org.
    [31] "Intel® Advanced Vector Extensions 512)," [Online]. Available: https://www.intel.com.tw/content/www/tw/zh/architecture-and-technology/avx-512-overview.html.
    [32] "KL530," [Online]. Available: https://www.kneron.com/tw/news/blog/141/.
    [33] "Introduction to Relay IR," [Online]. Available: https://tvm.apache.org/docs/arch/relay_intro.html.
    [34] Numpy, "https://numpy.org," [Online].
    [35] "TVM User Tutorial," [Online]. Available: https://tvm.apache.org/docs/tutorial/index.html.
    [36] "transform.py," [Online]. Available: https://tvm.apache.org/docs/reference/api/python/relay/transform.html.
    [37] "apache / tvm," [Online]. Available: https://github.com/apache/tvm/tree/main/python/tvm/ir.
    [38] "OpenCL Official Website," [Online]. Available: https://www.khronos.org/opencl/.
    [39] "HSA Technologies," [Online]. Available: http://www.hsafoundation.com/.

    下載圖示 校內:2024-12-28公開
    校外:2024-12-28公開
    QR CODE