| 研究生: |
許峰銘 Hsu, Feng-Ming |
|---|---|
| 論文名稱: |
設計張量處理單元與實作其應用程式介面於CASLab-GPU Tensor Process Unit (TPU) design and TPU APIs implementation for CASLab-GPU |
| 指導教授: |
陳中和
Chen, Chung-Ho |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 中文 |
| 論文頁數: | 76 |
| 中文關鍵詞: | 通用型繪圖處理器 、硬體加速器 、編譯器 、矩陣乘法 |
| 外文關鍵詞: | GPGPU, Accelerator, Compiler, Matrix Multiplication |
| 相關次數: | 點閱:133 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在機器學習、人工智慧發展迅速的今日,這些應用加速擴展,除原本的雲端運算才具有的高計算能力,為了滿足資料安全與即時應用的需求,終端運算的需求日漸增加,也因此出現了很多為終端裝置所設計的硬體加速器; 本實驗室所開發的CASLab-GPU 透過 SIMD 架構 (single instruction multiple data) 來實現硬體加速,加上軟體堆疊層便可支援OpenCL 運算與 Tensorflow 的運算,但是在很多大型的神經網路運算中,運算速度還是有優化的空間,本論文透過研究硬體微架構的調整與設計,加上與編譯器和運算軟體函式庫的配合,研究如何突破傳統架構上的運算瓶頸。
在機器學習相關的演算法中,以矩陣乘法的影響最為巨大,也發現有許多硬體加速器就是專為這類運算而設計,本論文在原有的 CASLab-GPU 上,運用其原有的浮點乘法累加單元 (multiply–accumulate unit),透過重新排列這些運算器,組成一個矩陣乘法器,作為矩陣乘法的特殊運算單元,不需要額外增加太多的硬體成本,也能夠在 CASLab-GPU 加入額外的硬體加速器。
除了硬體設計外,為了使原本的 OpenCL 與 Tensorflow 所撰寫的應用程式能夠使用到硬體加速器,本論文在CASLab-GPU 所需要的 CASLab-GPU Compiler 中加入特殊指令,在硬體收到特殊指令時,便可切換到不同於原本的運算模式,進而使用矩陣乘累加單元來加速運算。因為對於此類運算所使用的指令之改變,所以原本所使用的線性代數函式也需要重新設計優化,本論文也為此提出一套演算法,透過不同的需求來決定如何執行運算,透過以上從硬體微架構的設計改變以及編譯器與函式庫的修改與優化,來減少 CASLab-GPU 在機器學習演算法上這類運算的時間。
Because of Artificial Intelligence (AI) widely applying for various fields, it is important to use GPGPU (General-Purpose Graphics Processing Unit) or ASIC (Application Specific Intergrated) to accelerate computation. We implement a virtual platform, CASLab GPU, which is a GPGPU with SIMT (Single Instruction Multiple Thread) architecture. Although GPGPU can support many different applications by software stack, the implementation of software library has a great impact on performance of GPGPU. On the other hand, ASIC has great performance on the specific application, but it lacks versatility. In this thesis, we design a new process unit TPU (Tensor Process Unit) for CASLab GPU. The TPU is used to accelerate matrix multiplication related computation. Because it is a new hardware added on the GPU, we also need to design new instruction and its compiler so that the programmer can use the accelerator conveniently. This software design flow is also provided for other accelerator in the future. The experimental results of LeNet-5 and some matrix multiplication applications on CASLab GPU show that usng TPU to compute can reduce execution time by 20%.
[1] “Nvidia Tensor Cores website” [Online]. Available https://developer.nvidia.com/tensor-cores
[2] A. Munshi.: “The OpenCL specification”, in Hot Chips 21 Symposium(HCS), 2009 IEEE. IEEE, 2009.
[3] “ LLVM Language Reference Manual” [Online]. Available: https://llvm.org/docs/LangRef.html
[4] S. C. Sau-Gee Chen, J. L. Jiann-Cherng Lee and C. L. Chieh-Chih Li, "New Systolic Arrays for Matrix Multiplication," 1994 Internatonal Conference on Parallel Processing Vol. 2, North Carolina, USA, 1994, pp. 211-215
[5] K. Chen, F. Lombardi and J. Han, "Matrix multiplication by an inexact systolic array," Proceedings of the 2015 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH´15), Boston, MA, 2015, pp. 151-156, doi: 10.1109/NANOARCH.2015.7180604.
[6] “TF-Coriander” [Online]. Available: https://github.com/hughperkins/tf-coriander
[7] “LLVM Selection DAG nod types - LLVM ISD” [Online] Available: https://llvm.org/doxygen/namespacellvm_1_1ISD.html
[8] “NVIDIA TURING GPU ARCHITECTURE”[Online] Available: https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
[9] Dun-Jie Chen, “LLVM-based OpenCL Compiler for CASLab-GPU” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan. 2019.
[10] Tsung-Han Tsou,” Optimization of Stride Prefetching Mechanism and Dependent Warp Scheduling on GPGPU” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan. 2019.
[11] Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. Performance, Design, and Autotuning of Batched GEMM for GPUs. In 31st International Conference on High Performance Computing. Springer.
[12] Cedric Nugteren. 2018. CLBlast: A Tuned OpenCL BLAS Library. In Proceedings of the International Workshop on OpenCL (IWOCL '18). Association for Computing Machinery, New York, NY, USA, Article 5, 1–10.
[13] B. Sander, "HSAIL: Portable compiler IR for HSA," 2013 IEEE Hot Chips 25 Symposium (HCS), Stanford, CA, 2013, pp. 1-32
[14] “Tensorflow Lite” [Online]. Avaiable: https://www.tensorflow.org/lite
[15] “CUDA Introduction” [Online]. Avaiable: https://developer.nvidia.com/cuda-zone
[16] Perkins, Hugh.: “CUDA-on-CL: a compiler and runtime for running NVIDIA® CUDA™ C++ 11 applications on OpenCL™ 1.2 Devices”, Proceedings of the 5th International Workshop on OpenCL, 2017.
[17] “Directed acyclic graph” [Online]. Avaiable: https://en.wikipedia.org/wiki/Directed_acyclic_graph
[18] Enhua Wu and Youquan Liu, "Emerging technology about GPGPU," APCCAS 2008 - 2008 IEEE Asia Pacific Conference on Circuits and Systems, Macao, 2008
[19] E. Li, L. Zeng, Z. Zhou and X. Chen, "Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing," in IEEE Transactions on Wireless Communications, vol. 19, no. 1, pp. 447-457, Jan. 2020
[20] “HSA Technologies” [Online]. Avaiable: http://www.hsafoundation.com/
校內:2026-01-29公開