成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	許峰銘 Hsu, Feng-Ming
論文名稱：	設計張量處理單元與實作其應用程式介面於CASLab-GPU Tensor Process Unit (TPU) design and TPU APIs implementation for CASLab-GPU
指導教授：	陳中和 Chen, Chung-Ho
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering
論文出版年：	2021
畢業學年度：	109
語文別：	中文
論文頁數：	76
中文關鍵詞：	通用型繪圖處理器、硬體加速器、編譯器、矩陣乘法
外文關鍵詞：	GPGPU, Accelerator, Compiler, Matrix Multiplication
相關次數：	點閱：133 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在機器學習、人工智慧發展迅速的今日，這些應用加速擴展，除原本的雲端運算才具有的高計算能力，為了滿足資料安全與即時應用的需求，終端運算的需求日漸增加，也因此出現了很多為終端裝置所設計的硬體加速器; 本實驗室所開發的CASLab-GPU 透過 SIMD 架構 (single instruction multiple data) 來實現硬體加速，加上軟體堆疊層便可支援OpenCL 運算與 Tensorflow 的運算，但是在很多大型的神經網路運算中，運算速度還是有優化的空間，本論文透過研究硬體微架構的調整與設計，加上與編譯器和運算軟體函式庫的配合，研究如何突破傳統架構上的運算瓶頸。
在機器學習相關的演算法中，以矩陣乘法的影響最為巨大，也發現有許多硬體加速器就是專為這類運算而設計，本論文在原有的 CASLab-GPU 上，運用其原有的浮點乘法累加單元 (multiply–accumulate unit)，透過重新排列這些運算器，組成一個矩陣乘法器，作為矩陣乘法的特殊運算單元，不需要額外增加太多的硬體成本，也能夠在 CASLab-GPU 加入額外的硬體加速器。
除了硬體設計外，為了使原本的 OpenCL 與 Tensorflow 所撰寫的應用程式能夠使用到硬體加速器，本論文在CASLab-GPU 所需要的 CASLab-GPU Compiler 中加入特殊指令，在硬體收到特殊指令時，便可切換到不同於原本的運算模式，進而使用矩陣乘累加單元來加速運算。因為對於此類運算所使用的指令之改變，所以原本所使用的線性代數函式也需要重新設計優化，本論文也為此提出一套演算法，透過不同的需求來決定如何執行運算，透過以上從硬體微架構的設計改變以及編譯器與函式庫的修改與優化，來減少 CASLab-GPU 在機器學習演算法上這類運算的時間。

Because of Artificial Intelligence (AI) widely applying for various fields, it is important to use GPGPU (General-Purpose Graphics Processing Unit) or ASIC (Application Specific Intergrated) to accelerate computation. We implement a virtual platform, CASLab GPU, which is a GPGPU with SIMT (Single Instruction Multiple Thread) architecture. Although GPGPU can support many different applications by software stack, the implementation of software library has a great impact on performance of GPGPU. On the other hand, ASIC has great performance on the specific application, but it lacks versatility. In this thesis, we design a new process unit TPU (Tensor Process Unit) for CASLab GPU. The TPU is used to accelerate matrix multiplication related computation. Because it is a new hardware added on the GPU, we also need to design new instruction and its compiler so that the programmer can use the accelerator conveniently. This software design flow is also provided for other accelerator in the future. The experimental results of LeNet-5 and some matrix multiplication applications on CASLab GPU show that usng TPU to compute can reduce execution time by 20%.

摘要	I
SUMMARY	II
誌謝	VII
目錄	VIII
表目錄	XI
圖目錄	XII
第1章 序論	1
1 論文動機	1
2 論文貢獻	2
3 論文架構	3
第2章 背景知識與相關研究	4
1 CASLAB-GPU	4
1.1 CASLab-GPU 微架構	5
1.2 Streaming Multiprocessor (SM)	6
1.3 Workgroup Initializer (WG)	7
1.4 Instruction Buffer (IB)	8
1.5 Dependency Check Unit (DCU) & Warp Scheduler (WS)	8
1.6 Execution Unit (EXE) & Load/Store Unit (LSU)	9
1.7 Local Memory (LM)	9
2 TensorFlow	11
2.1 TensorFlow Runtime	12
2.2 TF-coriander	14
3 OpenCL and HSA Runtime	15
3.1 OpenCL Programming Model	15
3.2 HSA Runtime	17
4 CASLab LLVM Compiler	18
4.1 Compiler Front-End – Clang	20
4.2 LLVM-IR	21
4.3 Intrinsic Functions	22
4.4 CASLab-GPU Compiler Backend	23
5 Nvidia Tensor Core	25
5.1 GPU 與 Tensor Core 硬體架構	26
5.2 Tensor Core 執行模型	28
5.3 Tensor Core 軟體支援	28
第3章 設計方法	32
1 CASLab GPU 平台架構與Tensor Process Unit	32
2 Tensor Process Unit 微架構	34
2.1 TPU 運算單元內部設計	36
2.2 Data Loader 設計與存取	47
3 TPU Instruction and Compiler	50
3.1 TPU指令定義與實作	51
3.2 TPU Instruction Selection	56
4 TPU 軟體程式庫	59
第4章 實驗結果與效能評估	66
1 實驗平台環境介紹	66
2 實驗結果與分析	68
2.1 OpenCL API測試	69
2.2 TensorFlow API 測試	71
2.3 Convolution測試	72
第5章 結論	74
參考文獻	75


                                    

[1] “Nvidia Tensor Cores website” [Online]. Available https://developer.nvidia.com/tensor-cores
[2] A. Munshi.: “The OpenCL specification”, in Hot Chips 21 Symposium(HCS), 2009 IEEE. IEEE, 2009.
[3] “ LLVM Language Reference Manual” [Online]. Available: https://llvm.org/docs/LangRef.html
[4] S. C. Sau-Gee Chen, J. L. Jiann-Cherng Lee and C. L. Chieh-Chih Li, "New Systolic Arrays for Matrix Multiplication," 1994 Internatonal Conference on Parallel Processing Vol. 2, North Carolina, USA, 1994, pp. 211-215
[5] K. Chen, F. Lombardi and J. Han, "Matrix multiplication by an inexact systolic array," Proceedings of the 2015 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH´15), Boston, MA, 2015, pp. 151-156, doi: 10.1109/NANOARCH.2015.7180604.
[6] “TF-Coriander” [Online]. Available: https://github.com/hughperkins/tf-coriander
[7] “LLVM Selection DAG nod types - LLVM ISD” [Online] Available: https://llvm.org/doxygen/namespacellvm_1_1ISD.html
[8] “NVIDIA TURING GPU ARCHITECTURE”[Online] Available: https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
[9] Dun-Jie Chen, “LLVM-based OpenCL Compiler for CASLab-GPU” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan. 2019.
[10] Tsung-Han Tsou,” Optimization of Stride Prefetching Mechanism and Dependent Warp Scheduling on GPGPU” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan. 2019.
[11] Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra. 2016. Performance, Design, and Autotuning of Batched GEMM for GPUs. In 31st International Conference on High Performance Computing. Springer.
[12] Cedric Nugteren. 2018. CLBlast: A Tuned OpenCL BLAS Library. In Proceedings of the International Workshop on OpenCL (IWOCL '18). Association for Computing Machinery, New York, NY, USA, Article 5, 1–10.
[13] B. Sander, "HSAIL: Portable compiler IR for HSA," 2013 IEEE Hot Chips 25 Symposium (HCS), Stanford, CA, 2013, pp. 1-32
[14] “Tensorflow Lite” [Online]. Avaiable: https://www.tensorflow.org/lite
[15] “CUDA Introduction” [Online]. Avaiable: https://developer.nvidia.com/cuda-zone
[16] Perkins, Hugh.: “CUDA-on-CL: a compiler and runtime for running NVIDIA® CUDA™ C++ 11 applications on OpenCL™ 1.2 Devices”, Proceedings of the 5th International Workshop on OpenCL, 2017.
[17] “Directed acyclic graph” [Online]. Avaiable: https://en.wikipedia.org/wiki/Directed_acyclic_graph
[18] Enhua Wu and Youquan Liu, "Emerging technology about GPGPU," APCCAS 2008 - 2008 IEEE Asia Pacific Conference on Circuits and Systems, Macao, 2008
[19] E. Li, L. Zeng, Z. Zhou and X. Chen, "Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing," in IEEE Transactions on Wireless Communications, vol. 19, no. 1, pp. 447-457, Jan. 2020
[20] “HSA Technologies” [Online]. Avaiable: http://www.hsafoundation.com/

校內：2026-01-29公開
校外：2026-01-29公開電子論文尚未授權公開，紙本請查館藏目錄

簡易檢索 / 詳目顯示

相關論文