| 研究生: |
齊元 Chi, Yuan |
|---|---|
| 論文名稱: |
運用OpenCL核心特性預測之SIMT/MIMD雙模架構運算模式 OpenCL Kernel Attribute Prediction for Operation Mode Selection in SIMT/MIMD Dual-mode Architecture |
| 指導教授: |
陳中和
Chen, Chung-Ho |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2016 |
| 畢業學年度: | 104 |
| 語文別: | 英文 |
| 論文頁數: | 72 |
| 中文關鍵詞: | 編譯器協助分析,多指令流多數據流,OpenCL核心特性,運算模式預測,單指令流多數據流 |
| 外文關鍵詞: | Compiler Analysis, MIMD, OpenCL Kernel Attribute, Operation Mode Prediction, SIMT |
| 相關次數: | 點閱:79 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在現代高性能運算領域中,平行運算變的日益重要。OpenCL程式框架可以讓程式設計師很方便並且快速的實現平行運算程式在異質性平台上。在多指令流多數據流和單指令流多數據流的平台上,只要其支援了OpenCL的程式框架就可以順利的執行OpenCL程式,而這兩種平台上在執行OpenCL程式有其差異。透過觀察我們的雙模平台,我們發現了一些程式的特性會嚴重地影響執行效能。所以,根據程式特性並讓程式執行在其適合的運行模式上是非常重要的。因此,這篇論文我們提出了一個編譯器協助分析的程式框架,其由兩個部分組成,程式特性分析器和效能預測模型。藉由這個框架我們可以透過分析程式核心特性並且預測程式核心的最佳執行模式,此預測準確率達到95%。並且因為這樣的高預測準確率下,我們可以得到讓核心程式運行在對的模式上的好處,也就是在70支待測物比起跑在不適合的模式上可以有平均1.5倍的效能提升。
In the field of high performance computing, parallel computing has become increasingly important and has been broadly used. Since OpenCL Framework provides a friendly and convenient environment to programmers, the programmers could implement parallel computing programs on the heterogeneous platform as long as the platform supports the OpenCL model. As for the SIMT and MIMD architecture, they could also execute the OpenCL applications by their own if they could support the OpenCL framework. While OpenCL applications can be executed on SIMT or MIMD architecture respectively, each of them has its own executed features. Through the observations of our dual-mode SIMT/MIMD platform, we find out some features will have a dramatic impact on performance. Thus, putting OpenCL applications on the proper operation mode turns out to be an important issue. In this thesis, we design a compiler analysis framework, which has its own code-feature-analysis tools and prediction training models to predict the most suitable executing mode for each kernels. Because of that, we could make a kernel executed in its best mode. Having the compiler analysis framework, which has 95% accuracy, we can speed the efficiency up to 1.5X of 70 benchmarks on average.
[1] KHRONOS Group INC. (2014) “The OpenCL Specification version 2.0” [Online]
Available: https://www.khronos.org/registry/cl/specs/opencl-2.0.pdf
[2] The LLVM Compiler Infrastructure. [Online] Available: http://llvm.org/
[3] clang: a C language family frontend for LLVM. [Online]
Available: http://clang.llvm.org/
[4] clang: Driver Design & Internals. [Online]
Available: http://clang.llvm.org/docs/DriverInternals.html
[5] clang: C Interface to Clang. [Online]
Available: http://clang.llvm.org/doxygen/group__CINDEX.html
[6] Decision tree learning. [Online]
Available: https://en.wikipedia.org/wiki/Decision_tree_learning
[7] Support vector machine [Online]
Available: https://en.wikipedia.org/wiki/Support_vector_machine
[8] H-S. Kim, I. Hajj, J. Stratton, S. Lumetta, and W-M. Hwu, “Locality-Centric Thread Scheduling for Bulk-Synchronous Programming Models on CPU Architectures,” in the Proceeding of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15), pp. 257-268, Feb. 2015.
[9] J. Stratton, S. Stone, and W-W. Hwu, “Efficient Compilation of Fine-grained SPMD-threaded Programs for Multicore CPUs,” in the Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’10), pp. 111-119, Apr. 2010.
[10] J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee, “SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters,” in Proceedings of the 26th ACM International Conference on Supercomputing (ICS), pp. 341-352, Jun. 2012.
[11] G. Jo, W. Jeon, W. Jung, G. Taft, and J. Lee, “OpenCL framework for ARM processors with NEON support,” in the Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing, pp. 33-40, 2014.
[12] P. Jääskeläinen, C. de La Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg, “pocl: A performance-portable opencl implementation,” in International Journal of Parallel Programming, 43(5), pp. 752-785, 2014.
[13] J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. Gaster, and B. Zheng, “Twin Peaks: a Software Platform for Heterogeneous Computing on General-Purpose and Graphics Processors,” in International Conference on Parallel Architecture and Compilation Techniques (PACT’10), pp. 205-216, Sep. 2010.
[14] J. Lee, J. Kim, S Seo, S Kim, J Park, and H Kim, “An OpenCL Framework for Heterogeneous Multicores with Local Memory, ” in International Conference on Parallel Architecture and Compilation Techniques (PACT’10), pp. 193-204, Sep. 2010.
[15] G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H. Wu, and S. Yalamanchill, “SIMD Re-Convergence At Thread Frontiers,” in the Proceeding of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 477-488, Dec. 2011.
[16] N. Brunie, S. Collange, and G. Diamos, “Simultaneous Branch and Warp Interweaving for Sustained GPU Performance,” in Proceedings of 39th Annual International Symposium on Computer Architecture (ISCA), pp. 49-60, Jun. 2012.
[17] W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” in ACM Transactions on Architecture and Code Optimization (TACO), Vol. 6, No. 2, Article 7, 2009.
[18] W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow,” in the Proceeding of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07), pp. 407-420, Dec. 2007.
[19] W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” in the Proceeding of the 17th International Symposium on High Performance Computer Architecture (HPCA’11), pp. 25-36, Feb, 2011.
[20] M. Rhu and M. Erez, “The Dual-Path Execution Model for Efficient GPU Control Flow,” in the Proceeding of the 19th International Symposium on High Performance Computer Architecture (HPCA’13), pp. 591-602, Feb, 2013.
[21] Y. Wen, Z. Wang, and M. O’Boyle, “Smart Multi-Task Scheduling for OpenCL Programs on CPU/GPU Heterogeneous Platforms,” International Conference on High Performance Computing (HiPC’21), pp. 1-10, Dec, 2014.
[22] V. Ravi, M. Becchi, W Jiang, G. Agrawal, and S. Chakradhar, “Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes,” International Symposium on Cluster, Cloud and Grid Computing (CCGRID’12), pp. 140-147, May, 2012.
[23] P. Phothilimthana, J. Ansel, J. Ragan-Kelley ,and S. Amarasinghe, “Portable Performance on Heterogeneous Architectures,” International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13), pp 431-444, March, 2013
[24] D. Grewe, Z. Wang, and M. O’Boyle “OpenCL Task Partitioning in the Presence of GPU Contention,” Languages and Compilers for Parallel Computing(LCPC), Volume 8664 of the series Lecture Notes in Computer Science, pp 87-101, Oct, 2014
[25] M. Steuwer, “Performance Prediction of OpenCL Applications in Parallel Heterogeneous Systems,” HPC-EUROPA2 project (project number: 228398)
[26] K-C. Chen, Y. Chi, C-W. Lin, C-H. Chen, “Spatiotemporal SIMT Design on Multiprocessor for Efficient Data-Parallel Processing,” Submitted to IEEE Transactions on Computers, July, 2016.
[27] NVidia OpenCL Benchmarks [Online]
Available:https://github.com/ashwinraghav/Grafight/tree/master/gpgpu-sim/benchmarks/OpenCL
[28] Rodinia OpenCL Benchmarks [Online]
Available:http://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/Rodinia:Accelerating_Compute-Intensive_Applications_with_Accelerators
[29] AMD OpenCL Benchmarks [Online]
Available: https://developer.nvidia.com/cuda-toolkit
[30] Parboil OpenCL Benchmarks [Online]
Available: https://github.com/abduld/Parboil