成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳惇介 Chen, Dun-Jie
論文名稱：	CASLab-GPU OpenCL LLVM編譯器實作與優化 LLVM-based OpenCL Compiler for CASLab-GPU
指導教授：	陳中和 Chen, Chung-Ho
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering
論文出版年：	2019
畢業學年度：	108
語文別：	中文
論文頁數：	100
中文關鍵詞：	終端裝置、通用繪圖處理器、編譯器、編譯最佳化
外文關鍵詞：	Compiler, Compiler optimization, Edge device, GPGPU, LLVM
相關次數：	點閱：138 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著機器學習相關應用開始蓬勃發展，日益複雜的模型架構、資料隱私與即時反應的需求提升，運算單元逐漸從雲端計算移向終端計算；為了能夠使本實驗室所開發的CASLab-GPU可與不同架構的CPU組成終端運算裝置、且能提供一對於開發人員便利使用的開源執行環境，本實驗室設計之CASLab-GPU所採用的語言即採用開源的OpenCL；但由於先前所採用的AMD OpenCL offline compiler (AMD CLOC)為一Closed source專案，其編譯器僅提供x86平台使用，對於本實驗室的CASLab-GPU將成為平台移植的一大阻礙；且原先的編譯流程（AMD CLOC + Finalizer）效率不佳，對於OpenCL這類對Compilation time敏感的語言也會是一效能瓶頸。
因此本論文基於Open source LLVM Compiler Infrastructure Project設計了一CASLab-GPU OpenCL offline compiler，並整合至本實驗室先前開發的OpenCL Runtime、Tensorflow Runtime，透過以OpenCL所撰寫的應用程式／Tensroflow上所執行的Neural network model、搭配CASLab-GPUSim來模擬實際應用情境，並驗證整體編譯流程的正確性。
為了提升CASLab-GPU的執行效率，本論文所設計的Compiler即針對CASLab-GPU所採用的ISA與執行架構提出了Branch optimization、Load/Store optimization與Instruction optimization等平台相關的優化方式，讓CASLab-GPUSim platform有能力執行更貼近實際應用的OpenCL應用程式／Tensorflow Neural model；藉由OpenCL編譯流程改善、本論文所設計的OpenCL compiler，使Tensorflow上所執行的Neural network inference能達到15%的整體效能提升。而除了硬體的執行效能提升外，也使運行於CPU上的OpenCL runtime達到高達85%的效能提升。

With the increasing popularity of machine learning applications, the computing model of machine learning applications has gradually extended from cloud computing to edge computing. In order to provide an edge-computing platform with the GPGPU hardware designed by CASLab called CASLab-GPU, we have implemented the software development environment including OpenCL runtime, HSA runtime, and compilation tools.
To support the software development environment for CASLab-GPU platform, this thesis implements an OpenCL complier with optimization methods that greatly increase the execution efficiency on CASLab-GPU. As a result, this new compiler replaces the original AMD CLOC closed source compiler used by the CASLab-GPU. According to our experiment result, we have achieved an average of 7.6 speed up in OpenCL runtime execution and 1.4 execution speed up in various OpenCL benchmarks, including a Tensorflow CNN Model LeNet-5.

摘要 I
誌謝 VII
目錄 VIII
表目錄 X
圖目錄 XI
第1章 序論 1
1 論文動機 2
2 論文貢獻 4
3 論文架構 4
第2章 背景知識 5
1 Tensorflow 5
1.1 Tensorflow Runtime 7
1.2 Tensorflow Stream Executor 8
1.3 TF-Coriander 9
2 OpenCL Runtime 10
2.1 OpenCL Programming model 11
3 CASLab-GPU Architecture 13
3.1 HSA Runtime 16
3.2 Device memory model 17
4 LLVM Compiler Infrastructure Project 18
4.1 LLVM Frontend - Clang 20
4.2 LLVM-IR 22
4.3 Intrinsic Functions 23
4.4 LLVM Container Structure 25
4.5 Directed Acyclic Graph(DAG) 26
4.6 LLVM Pass 27
4.7 Tablegen Language 32
第3章 Compiler Infrastructure for CASLab-GPU 35
1 Overall structure of CASLab-GPU within LLVM 35
2 ABI definition 36
3 Instruction Definition 40
4 Instruction Selection 48
5 Intrinsic Function 55
6 Optimizations 57
6.1 Branch Optimization 58
6.2 Memory Operation Optimizations 62
7 Code emitter 64
7.1 Assembly printer 64
7.2 Binary code emitter 67
8 ELF Linker 71
第4章 OpenCL Runtime and Device libraries 72
1 OpenCL resources management 72
2 OpenCL Compilation flow 76
3 Device Libraries 78
第5章 Tensorflow Kernel Operators 79
1 Tensorflow Kernel Operator registration 79
2 Tensorflow Kernel Functor 80
第6章 實驗結果與效能評估 82
1 Experiment Environment and Benchmarks 82
2 Verification of Compiler Infrastructure 86
3 Performance of CASLab-GPU OpenCL compiler 91
3.1 Static analysis 91
3.2 Execution performance analysis 94
第7章 結論 98
參考文獻 99
                                    

[1] Shi, W., Cao, J., Zhang, Q., et al.: “Edge computing: vision and challenges”,IEEE Internet Things J., 2016, 3, (5), pp. 637–646
[2] “Movidius official website” [Online]. Avaiable: https://www.movidius.com
[3] “Nvidia Jetson TX2 website” [Online]. Avaiable: https://www.nvidia.com/zh-tw/autonomous-machines/embedded-systems/jetson-tx2/
[4] A. Munshi.: “The OpenCL specification”, in Hot Chips 21 Symposium(HCS), 2009 IEEE. IEEE, 2009. doi: 10.1109/HOTCHIPS.2009.7478342 pp. 1–314.
[5] “AMD CLOC” [Online]. Avaiable: https://github.com/HSAFoundation/CLOC
[6] “HSA Technologies” [Online]. Avaiable: http://www.hsafoundation.com/
[7] “Tensorflow Lite” [Online]. Avaiable: https://www.tensorflow.org/lite
[8] “CUDA Overview” [Online]. Avaiable: https://developer.nvidia.com/cuda-zone
[9] Perkins, Hugh.: “CUDA-on-CL: a compiler and runtime for running NVIDIA® CUDA™ C++ 11 applications on OpenCL™ 1.2 Devices”, Proceedings of the 5th International Workshop on OpenCL, 2017.
[10] “TF-Coriander” [Online]. Avaiable: https://github.com/hughperkins/tf-coriander
[11] C. Nugteren.: “CLBlast: A Tuned OpenCL BLAS Library”. arXiv preprint arXiv:1705.05249(2017), 0–7
[12] “Compute Library for Deep Neural Networks” [Online]. Avaiable: https://github.com/intel/clDNN
[13] “ LLVM Language Reference Manual” [Online]. Avaiable: https://llvm.org/docs/LangRef.html
[14] “ Static single assignment form” [Online]. Avaiable: https://en.wikipedia.org/wiki/Static_single_assignment_form
[15] “Directed acyclic graph” [Online]. Avaiable: https://en.wikipedia.org/wiki/Directed_acyclic_graph
[16] Lengauer, Thomas; and Tarjan; Robert Endre.: “A fast algorithm for finding dominators in a flowgraph”. In: ACM Transactions on Programming Languages and Systems. 1 (1): 121–141. CiteSeerX 10.1.1.117.8843. doi:10.1145/357062.357071. (July 1979)
[17] Georgiadis, L., Werneck, R.F., Tarjan, R.E., Triantafyllis, S., August, D.I.: “Finding dominators in practice”. In: Albers, S., Radzik, T. (eds.) ESA 2004. LNCS, vol. 3221, pp. 677–688. Springer, Heidelberg (2004)
[18] “LLVM Selection DAG nod types - LLVM ISD” [Online] Avaiable: https://llvm.org/doxygen/namespacellvm_1_1ISD.html
[19] “ Executable and Linkable Format” [Online] Avaiable: https://en.wikipedia.org/wiki/Executable_and_Linkable_Format
[20] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P.: “Gradient-based learning applied to document recognition”. In: Proceedings of the IEEE 86, 2278–2324 (1998).

校外：不公開電子論文及紙本論文均尚未授權公開

簡易檢索 / 詳目顯示

相關論文