簡易檢索 / 詳目顯示

研究生: 陳惇介
Chen, Dun-Jie
論文名稱: CASLab-GPU OpenCL LLVM編譯器實作與優化
LLVM-based OpenCL Compiler for CASLab-GPU
指導教授: 陳中和
Chen, Chung-Ho
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2019
畢業學年度: 108
語文別: 中文
論文頁數: 100
中文關鍵詞: 終端裝置通用繪圖處理器編譯器編譯最佳化
外文關鍵詞: Compiler, Compiler optimization, Edge device, GPGPU, LLVM
相關次數: 點閱:104下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著機器學習相關應用開始蓬勃發展,日益複雜的模型架構、資料隱私與即時反應的需求提升,運算單元逐漸從雲端計算移向終端計算;為了能夠使本實驗室所開發的CASLab-GPU可與不同架構的CPU組成終端運算裝置、且能提供一對於開發人員便利使用的開源執行環境,本實驗室設計之CASLab-GPU所採用的語言即採用開源的OpenCL;但由於先前所採用的AMD OpenCL offline compiler (AMD CLOC)為一Closed source專案,其編譯器僅提供x86平台使用,對於本實驗室的CASLab-GPU將成為平台移植的一大阻礙;且原先的編譯流程(AMD CLOC + Finalizer)效率不佳,對於OpenCL這類對Compilation time敏感的語言也會是一效能瓶頸。
    因此本論文基於Open source LLVM Compiler Infrastructure Project設計了一CASLab-GPU OpenCL offline compiler,並整合至本實驗室先前開發的OpenCL Runtime、Tensorflow Runtime,透過以OpenCL所撰寫的應用程式/Tensroflow上所執行的Neural network model、搭配CASLab-GPUSim來模擬實際應用情境,並驗證整體編譯流程的正確性。
    為了提升CASLab-GPU的執行效率,本論文所設計的Compiler即針對CASLab-GPU所採用的ISA與執行架構提出了Branch optimization、Load/Store optimization與Instruction optimization等平台相關的優化方式,讓CASLab-GPUSim platform有能力執行更貼近實際應用的OpenCL應用程式/Tensorflow Neural model;藉由OpenCL編譯流程改善、本論文所設計的OpenCL compiler,使Tensorflow上所執行的Neural network inference能達到15%的整體效能提升。而除了硬體的執行效能提升外,也使運行於CPU上的OpenCL runtime達到高達85%的效能提升。

    With the increasing popularity of machine learning applications, the computing model of machine learning applications has gradually extended from cloud computing to edge computing. In order to provide an edge-computing platform with the GPGPU hardware designed by CASLab called CASLab-GPU, we have implemented the software development environment including OpenCL runtime, HSA runtime, and compilation tools.
    To support the software development environment for CASLab-GPU platform, this thesis implements an OpenCL complier with optimization methods that greatly increase the execution efficiency on CASLab-GPU. As a result, this new compiler replaces the original AMD CLOC closed source compiler used by the CASLab-GPU. According to our experiment result, we have achieved an average of 7.6 speed up in OpenCL runtime execution and 1.4 execution speed up in various OpenCL benchmarks, including a Tensorflow CNN Model LeNet-5.

    摘要 I 誌謝 VII 目錄 VIII 表目錄 X 圖目錄 XI 第1章 序論 1 1.1 論文動機 2 1.2 論文貢獻 4 1.3 論文架構 4 第2章 背景知識 5 2.1 Tensorflow 5 2.1.1 Tensorflow Runtime 7 2.1.2 Tensorflow Stream Executor 8 2.1.3 TF-Coriander 9 2.2 OpenCL Runtime 10 2.2.1 OpenCL Programming model 11 2.3 CASLab-GPU Architecture 13 2.3.1 HSA Runtime 16 2.3.2 Device memory model 17 2.4 LLVM Compiler Infrastructure Project 18 2.4.1 LLVM Frontend - Clang 20 2.4.2 LLVM-IR 22 2.4.3 Intrinsic Functions 23 2.4.4 LLVM Container Structure 25 2.4.5 Directed Acyclic Graph(DAG) 26 2.4.6 LLVM Pass 27 2.4.7 Tablegen Language 32 第3章 Compiler Infrastructure for CASLab-GPU 35 3.1 Overall structure of CASLab-GPU within LLVM 35 3.2 ABI definition 36 3.3 Instruction Definition 40 3.4 Instruction Selection 48 3.5 Intrinsic Function 55 3.6 Optimizations 57 3.6.1 Branch Optimization 58 3.6.2 Memory Operation Optimizations 62 3.7 Code emitter 64 3.7.1 Assembly printer 64 3.7.2 Binary code emitter 67 3.8 ELF Linker 71 第4章 OpenCL Runtime and Device libraries 72 4.1 OpenCL resources management 72 4.2 OpenCL Compilation flow 76 4.3 Device Libraries 78 第5章 Tensorflow Kernel Operators 79 5.1 Tensorflow Kernel Operator registration 79 5.2 Tensorflow Kernel Functor 80 第6章 實驗結果與效能評估 82 6.1 Experiment Environment and Benchmarks 82 6.2 Verification of Compiler Infrastructure 86 6.3 Performance of CASLab-GPU OpenCL compiler 91 6.3.1 Static analysis 91 6.3.2 Execution performance analysis 94 第7章 結論 98 參考文獻 99

    [1] Shi, W., Cao, J., Zhang, Q., et al.: “Edge computing: vision and challenges”,IEEE Internet Things J., 2016, 3, (5), pp. 637–646
    [2] “Movidius official website” [Online]. Avaiable: https://www.movidius.com
    [3] “Nvidia Jetson TX2 website” [Online]. Avaiable: https://www.nvidia.com/zh-tw/autonomous-machines/embedded-systems/jetson-tx2/
    [4] A. Munshi.: “The OpenCL specification”, in Hot Chips 21 Symposium(HCS), 2009 IEEE. IEEE, 2009. doi: 10.1109/HOTCHIPS.2009.7478342 pp. 1–314.
    [5] “AMD CLOC” [Online]. Avaiable: https://github.com/HSAFoundation/CLOC
    [6] “HSA Technologies” [Online]. Avaiable: http://www.hsafoundation.com/
    [7] “Tensorflow Lite” [Online]. Avaiable: https://www.tensorflow.org/lite
    [8] “CUDA Overview” [Online]. Avaiable: https://developer.nvidia.com/cuda-zone
    [9] Perkins, Hugh.: “CUDA-on-CL: a compiler and runtime for running NVIDIA® CUDA™ C++ 11 applications on OpenCL™ 1.2 Devices”, Proceedings of the 5th International Workshop on OpenCL, 2017.
    [10] “TF-Coriander” [Online]. Avaiable: https://github.com/hughperkins/tf-coriander
    [11] C. Nugteren.: “CLBlast: A Tuned OpenCL BLAS Library”. arXiv preprint arXiv:1705.05249(2017), 0–7
    [12] “Compute Library for Deep Neural Networks” [Online]. Avaiable: https://github.com/intel/clDNN
    [13] “ LLVM Language Reference Manual” [Online]. Avaiable: https://llvm.org/docs/LangRef.html
    [14] “ Static single assignment form” [Online]. Avaiable: https://en.wikipedia.org/wiki/Static_single_assignment_form
    [15] “Directed acyclic graph” [Online]. Avaiable: https://en.wikipedia.org/wiki/Directed_acyclic_graph
    [16] Lengauer, Thomas; and Tarjan; Robert Endre.: “A fast algorithm for finding dominators in a flowgraph”. In: ACM Transactions on Programming Languages and Systems. 1 (1): 121–141. CiteSeerX 10.1.1.117.8843. doi:10.1145/357062.357071. (July 1979)
    [17] Georgiadis, L., Werneck, R.F., Tarjan, R.E., Triantafyllis, S., August, D.I.: “Finding dominators in practice”. In: Albers, S., Radzik, T. (eds.) ESA 2004. LNCS, vol. 3221, pp. 677–688. Springer, Heidelberg (2004)
    [18] “LLVM Selection DAG nod types - LLVM ISD” [Online] Avaiable: https://llvm.org/doxygen/namespacellvm_1_1ISD.html
    [19] “ Executable and Linkable Format” [Online] Avaiable: https://en.wikipedia.org/wiki/Executable_and_Linkable_Format
    [20] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P.: “Gradient-based learning applied to document recognition”. In: Proceedings of the IEEE 86, 2278–2324 (1998).

    無法下載圖示 校內:2024-11-01公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE