簡易檢索 / 詳目顯示

研究生: 齊元
Chi, Yuan
論文名稱: 運用OpenCL核心特性預測之SIMT/MIMD雙模架構運算模式
OpenCL Kernel Attribute Prediction for Operation Mode Selection in SIMT/MIMD Dual-mode Architecture
指導教授: 陳中和
Chen, Chung-Ho
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2016
畢業學年度: 104
語文別: 英文
論文頁數: 72
中文關鍵詞: 編譯器協助分析,多指令流多數據流,OpenCL核心特性,運算模式預測,單指令流多數據流
外文關鍵詞: Compiler Analysis, MIMD, OpenCL Kernel Attribute, Operation Mode Prediction, SIMT
相關次數: 點閱:79下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在現代高性能運算領域中,平行運算變的日益重要。OpenCL程式框架可以讓程式設計師很方便並且快速的實現平行運算程式在異質性平台上。在多指令流多數據流和單指令流多數據流的平台上,只要其支援了OpenCL的程式框架就可以順利的執行OpenCL程式,而這兩種平台上在執行OpenCL程式有其差異。透過觀察我們的雙模平台,我們發現了一些程式的特性會嚴重地影響執行效能。所以,根據程式特性並讓程式執行在其適合的運行模式上是非常重要的。因此,這篇論文我們提出了一個編譯器協助分析的程式框架,其由兩個部分組成,程式特性分析器和效能預測模型。藉由這個框架我們可以透過分析程式核心特性並且預測程式核心的最佳執行模式,此預測準確率達到95%。並且因為這樣的高預測準確率下,我們可以得到讓核心程式運行在對的模式上的好處,也就是在70支待測物比起跑在不適合的模式上可以有平均1.5倍的效能提升。

    In the field of high performance computing, parallel computing has become increasingly important and has been broadly used. Since OpenCL Framework provides a friendly and convenient environment to programmers, the programmers could implement parallel computing programs on the heterogeneous platform as long as the platform supports the OpenCL model. As for the SIMT and MIMD architecture, they could also execute the OpenCL applications by their own if they could support the OpenCL framework. While OpenCL applications can be executed on SIMT or MIMD architecture respectively, each of them has its own executed features. Through the observations of our dual-mode SIMT/MIMD platform, we find out some features will have a dramatic impact on performance. Thus, putting OpenCL applications on the proper operation mode turns out to be an important issue. In this thesis, we design a compiler analysis framework, which has its own code-feature-analysis tools and prediction training models to predict the most suitable executing mode for each kernels. Because of that, we could make a kernel executed in its best mode. Having the compiler analysis framework, which has 95% accuracy, we can speed the efficiency up to 1.5X of 70 benchmarks on average.

    摘要 I Abstract II List of Tables VII List of Figures VIII Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 1 1.3 Organization 2 Chapter 2 Background 3 2.1 OpenCL Framework 3 2.1.1 OpenCL Platform Model 3 2.1.2 OpenCL Execution Model 5 2.1.3 OpenCL Memory Hierarchy 6 2.1.4 OpenCL Runtime System Architecture 7 2.2 LLVM Framework 8 2.2.1 LLVM Front-End 9 2.2.2 LLVM Intermediate Representation 11 2.2.3 LLVM Back-End 11 2.2.4 Clang Plugins 12 2.3 Machine Learning Training Algorithm 12 2.3.1 Decision Tree Learning 13 2.3.2 Support Vector Machine 13 Chapter 3 Related Work 15 3.1 OpenCL Applications in MIMD mode 15 3.2 OpenCL Applications in SIMT mode GPGPU 16 3.3 Scheduling OpenCL Application on Heterogeneous Platforms 16 Chapter 4 SIMT/MIMD Dual-Mode Processor Architecture 17 4.1 Architecture Overview 17 4.2 SIMT Mode Operation 18 4.3 MIMD Mode Operation 19 4.4 Dual-Mode Performance Discussion 20 Chapter 5 OpenCL Kernel Code Attribute Prediction 22 5.1 Performance Prediction Factors 22 5.1.1 Static Synchronization Frequency 22 5.1.2 Data locality 25 5.1.3 Initialization Overhead 29 5.1.4 Branch Divergence 31 5.2 Compiler and Assembler Analysis 33 5.2.1 Overview of compiler analysis framework 33 5.2.2 Static Kernel Code analysis 34 5.2.3 Assembly Code analysis 38 5.3 Prediction Training Models 41 5.3.1 Decision Tree Learning and Weighted Decision Tree Learning 43 5.3.2 Support Vector Machine and Grouped Support Vector Machine 44 5.3.3 Grouped Support Vector Machine with Credit Voting Mechanism 46 Chapter 6 Experiment Result 48 6.1 Training Result 52 6.1.1 Decision Tree Learning 52 6.1.2 Weighted Decision Tree 54 6.1.3 Support Vector Machine 56 6.1.4 Grouped Support Vector Machine 58 6.1.5 Weighted Decision Tree with Voting Grouped Support Vector Machine 60 6.1.6 Discussion of prediction correct rate from training models 64 6.2 Hybrid Mode Operation Result 65 Chapter 7 Conclusion 67 References 68

    [1] KHRONOS Group INC. (2014) “The OpenCL Specification version 2.0” [Online]
    Available: https://www.khronos.org/registry/cl/specs/opencl-2.0.pdf
    [2] The LLVM Compiler Infrastructure. [Online] Available: http://llvm.org/
    [3] clang: a C language family frontend for LLVM. [Online]
    Available: http://clang.llvm.org/
    [4] clang: Driver Design & Internals. [Online]
    Available: http://clang.llvm.org/docs/DriverInternals.html
    [5] clang: C Interface to Clang. [Online]
    Available: http://clang.llvm.org/doxygen/group__CINDEX.html
    [6] Decision tree learning. [Online]
    Available: https://en.wikipedia.org/wiki/Decision_tree_learning
    [7] Support vector machine [Online]
    Available: https://en.wikipedia.org/wiki/Support_vector_machine
    [8] H-S. Kim, I. Hajj, J. Stratton, S. Lumetta, and W-M. Hwu, “Locality-Centric Thread Scheduling for Bulk-Synchronous Programming Models on CPU Architectures,” in the Proceeding of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15), pp. 257-268, Feb. 2015.
    [9] J. Stratton, S. Stone, and W-W. Hwu, “Efficient Compilation of Fine-grained SPMD-threaded Programs for Multicore CPUs,” in the Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’10), pp. 111-119, Apr. 2010.
    [10] J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee, “SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters,” in Proceedings of the 26th ACM International Conference on Supercomputing (ICS), pp. 341-352, Jun. 2012.
    [11] G. Jo, W. Jeon, W. Jung, G. Taft, and J. Lee, “OpenCL framework for ARM processors with NEON support,” in the Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing, pp. 33-40, 2014.
    [12] P. Jääskeläinen, C. de La Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg, “pocl: A performance-portable opencl implementation,” in International Journal of Parallel Programming, 43(5), pp. 752-785, 2014.
    [13] J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. Gaster, and B. Zheng, “Twin Peaks: a Software Platform for Heterogeneous Computing on General-Purpose and Graphics Processors,” in International Conference on Parallel Architecture and Compilation Techniques (PACT’10), pp. 205-216, Sep. 2010.
    [14] J. Lee, J. Kim, S Seo, S Kim, J Park, and H Kim, “An OpenCL Framework for Heterogeneous Multicores with Local Memory, ” in International Conference on Parallel Architecture and Compilation Techniques (PACT’10), pp. 193-204, Sep. 2010.
    [15] G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H. Wu, and S. Yalamanchill, “SIMD Re-Convergence At Thread Frontiers,” in the Proceeding of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 477-488, Dec. 2011.
    [16] N. Brunie, S. Collange, and G. Diamos, “Simultaneous Branch and Warp Interweaving for Sustained GPU Performance,” in Proceedings of 39th Annual International Symposium on Computer Architecture (ISCA), pp. 49-60, Jun. 2012.
    [17] W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” in ACM Transactions on Architecture and Code Optimization (TACO), Vol. 6, No. 2, Article 7, 2009.
    [18] W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow,” in the Proceeding of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07), pp. 407-420, Dec. 2007.
    [19] W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” in the Proceeding of the 17th International Symposium on High Performance Computer Architecture (HPCA’11), pp. 25-36, Feb, 2011.
    [20] M. Rhu and M. Erez, “The Dual-Path Execution Model for Efficient GPU Control Flow,” in the Proceeding of the 19th International Symposium on High Performance Computer Architecture (HPCA’13), pp. 591-602, Feb, 2013.
    [21] Y. Wen, Z. Wang, and M. O’Boyle, “Smart Multi-Task Scheduling for OpenCL Programs on CPU/GPU Heterogeneous Platforms,” International Conference on High Performance Computing (HiPC’21), pp. 1-10, Dec, 2014.
    [22] V. Ravi, M. Becchi, W Jiang, G. Agrawal, and S. Chakradhar, “Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes,” International Symposium on Cluster, Cloud and Grid Computing (CCGRID’12), pp. 140-147, May, 2012.
    [23] P. Phothilimthana, J. Ansel, J. Ragan-Kelley ,and S. Amarasinghe, “Portable Performance on Heterogeneous Architectures,” International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13), pp 431-444, March, 2013
    [24] D. Grewe, Z. Wang, and M. O’Boyle “OpenCL Task Partitioning in the Presence of GPU Contention,” Languages and Compilers for Parallel Computing(LCPC), Volume 8664 of the series Lecture Notes in Computer Science, pp 87-101, Oct, 2014
    [25] M. Steuwer, “Performance Prediction of OpenCL Applications in Parallel Heterogeneous Systems,” HPC-EUROPA2 project (project number: 228398)
    [26] K-C. Chen, Y. Chi, C-W. Lin, C-H. Chen, “Spatiotemporal SIMT Design on Multiprocessor for Efficient Data-Parallel Processing,” Submitted to IEEE Transactions on Computers, July, 2016.
    [27] NVidia OpenCL Benchmarks [Online]
    Available:https://github.com/ashwinraghav/Grafight/tree/master/gpgpu-sim/benchmarks/OpenCL
    [28] Rodinia OpenCL Benchmarks [Online]
    Available:http://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/Rodinia:Accelerating_Compute-Intensive_Applications_with_Accelerators
    [29] AMD OpenCL Benchmarks [Online]
    Available: https://developer.nvidia.com/cuda-toolkit
    [30] Parboil OpenCL Benchmarks [Online]
    Available: https://github.com/abduld/Parboil

    下載圖示 校內:2021-08-31公開
    校外:2021-08-31公開
    QR CODE