成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	齊元 Chi, Yuan
論文名稱：	運用OpenCL核心特性預測之SIMT/MIMD雙模架構運算模式 OpenCL Kernel Attribute Prediction for Operation Mode Selection in SIMT/MIMD Dual-mode Architecture
指導教授：	陳中和 Chen, Chung-Ho
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering
論文出版年：	2016
畢業學年度：	104
語文別：	英文
論文頁數：	72
中文關鍵詞：	編譯器協助分析,多指令流多數據流,OpenCL核心特性,運算模式預測,單指令流多數據流
外文關鍵詞：	Compiler Analysis, MIMD, OpenCL Kernel Attribute, Operation Mode Prediction, SIMT
相關次數：	點閱：79 下載：3
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在現代高性能運算領域中，平行運算變的日益重要。OpenCL程式框架可以讓程式設計師很方便並且快速的實現平行運算程式在異質性平台上。在多指令流多數據流和單指令流多數據流的平台上，只要其支援了OpenCL的程式框架就可以順利的執行OpenCL程式，而這兩種平台上在執行OpenCL程式有其差異。透過觀察我們的雙模平台，我們發現了一些程式的特性會嚴重地影響執行效能。所以，根據程式特性並讓程式執行在其適合的運行模式上是非常重要的。因此，這篇論文我們提出了一個編譯器協助分析的程式框架，其由兩個部分組成，程式特性分析器和效能預測模型。藉由這個框架我們可以透過分析程式核心特性並且預測程式核心的最佳執行模式，此預測準確率達到95%。並且因為這樣的高預測準確率下，我們可以得到讓核心程式運行在對的模式上的好處，也就是在70支待測物比起跑在不適合的模式上可以有平均1.5倍的效能提升。

In the field of high performance computing, parallel computing has become increasingly important and has been broadly used. Since OpenCL Framework provides a friendly and convenient environment to programmers, the programmers could implement parallel computing programs on the heterogeneous platform as long as the platform supports the OpenCL model. As for the SIMT and MIMD architecture, they could also execute the OpenCL applications by their own if they could support the OpenCL framework. While OpenCL applications can be executed on SIMT or MIMD architecture respectively, each of them has its own executed features. Through the observations of our dual-mode SIMT/MIMD platform, we find out some features will have a dramatic impact on performance. Thus, putting OpenCL applications on the proper operation mode turns out to be an important issue. In this thesis, we design a compiler analysis framework, which has its own code-feature-analysis tools and prediction training models to predict the most suitable executing mode for each kernels. Because of that, we could make a kernel executed in its best mode. Having the compiler analysis framework, which has 95% accuracy, we can speed the efficiency up to 1.5X of 70 benchmarks on average.

摘要	I
Abstract	II
List of Tables	VII
List of Figures	VIII
Chapter 1	Introduction	1
1	Motivation	1
2	Contribution	1
3	Organization	2
Chapter 2	Background	3
1	OpenCL Framework	3
1.1	OpenCL Platform Model	3
1.2	OpenCL Execution Model	5
1.3	OpenCL Memory Hierarchy	6
1.4	OpenCL Runtime System Architecture	7
2	LLVM Framework	8
2.1	LLVM Front-End	9
2.2	LLVM Intermediate Representation	11
2.3	LLVM Back-End	11
2.4	Clang Plugins	12
3	Machine Learning Training Algorithm	12
3.1	Decision Tree Learning	13
3.2	Support Vector Machine	13
Chapter 3	Related Work	15
1	OpenCL Applications in MIMD mode	15
2	OpenCL Applications in SIMT mode GPGPU	16
3	Scheduling OpenCL Application on Heterogeneous Platforms	16
Chapter 4	SIMT/MIMD Dual-Mode Processor Architecture	17
1	Architecture Overview	17
2	SIMT Mode Operation	18
3	MIMD Mode Operation	19
4	Dual-Mode Performance Discussion	20
Chapter 5	OpenCL Kernel Code Attribute Prediction	22
1	Performance Prediction Factors	22
1.1	Static Synchronization Frequency	22
1.2	Data locality	25
1.3	Initialization Overhead	29
1.4	Branch Divergence	31
2	Compiler and Assembler Analysis	33
2.1	Overview of compiler analysis framework	33
2.2	Static Kernel Code analysis	34
2.3	Assembly Code analysis	38
3	Prediction Training Models	41
3.1	Decision Tree Learning and Weighted Decision Tree Learning	43
3.2	Support Vector Machine and Grouped Support Vector Machine	44
3.3	Grouped Support Vector Machine with Credit Voting Mechanism	46
Chapter 6	Experiment Result	48
1	Training Result	52
1.1	Decision Tree Learning	52
1.2	Weighted Decision Tree	54
1.3	Support Vector Machine	56
1.4	Grouped Support Vector Machine	58
1.5	Weighted Decision Tree with Voting Grouped Support Vector Machine	60
1.6	Discussion of prediction correct rate from training models	64
2	Hybrid Mode Operation Result	65
Chapter 7	Conclusion	67
References		68
                                    

[1] KHRONOS Group INC. (2014) “The OpenCL Specification version 2.0” [Online]
Available: https://www.khronos.org/registry/cl/specs/opencl-2.0.pdf
[2] The LLVM Compiler Infrastructure. [Online] Available: http://llvm.org/
[3] clang: a C language family frontend for LLVM. [Online]
Available: http://clang.llvm.org/
[4] clang: Driver Design & Internals. [Online]
Available: http://clang.llvm.org/docs/DriverInternals.html
[5] clang: C Interface to Clang. [Online]
Available: http://clang.llvm.org/doxygen/group__CINDEX.html
[6] Decision tree learning. [Online]
Available: https://en.wikipedia.org/wiki/Decision_tree_learning
[7] Support vector machine [Online]
Available: https://en.wikipedia.org/wiki/Support_vector_machine
[8] H-S. Kim, I. Hajj, J. Stratton, S. Lumetta, and W-M. Hwu, “Locality-Centric Thread Scheduling for Bulk-Synchronous Programming Models on CPU Architectures,” in the Proceeding of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’15), pp. 257-268, Feb. 2015.
[9] J. Stratton, S. Stone, and W-W. Hwu, “Efficient Compilation of Fine-grained SPMD-threaded Programs for Multicore CPUs,” in the Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’10), pp. 111-119, Apr. 2010.
[10] J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee, “SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters,” in Proceedings of the 26th ACM International Conference on Supercomputing (ICS), pp. 341-352, Jun. 2012.
[11] G. Jo, W. Jeon, W. Jung, G. Taft, and J. Lee, “OpenCL framework for ARM processors with NEON support,” in the Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing, pp. 33-40, 2014.
[12] P. Jääskeläinen, C. de La Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg, “pocl: A performance-portable opencl implementation,” in International Journal of Parallel Programming, 43(5), pp. 752-785, 2014.
[13] J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. Gaster, and B. Zheng, “Twin Peaks: a Software Platform for Heterogeneous Computing on General-Purpose and Graphics Processors,” in International Conference on Parallel Architecture and Compilation Techniques (PACT’10), pp. 205-216, Sep. 2010.
[14] J. Lee, J. Kim, S Seo, S Kim, J Park, and H Kim, “An OpenCL Framework for Heterogeneous Multicores with Local Memory, ” in International Conference on Parallel Architecture and Compilation Techniques (PACT’10), pp. 193-204, Sep. 2010.
[15] G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H. Wu, and S. Yalamanchill, “SIMD Re-Convergence At Thread Frontiers,” in the Proceeding of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 477-488, Dec. 2011.
[16] N. Brunie, S. Collange, and G. Diamos, “Simultaneous Branch and Warp Interweaving for Sustained GPU Performance,” in Proceedings of 39th Annual International Symposium on Computer Architecture (ISCA), pp. 49-60, Jun. 2012.
[17] W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” in ACM Transactions on Architecture and Code Optimization (TACO), Vol. 6, No. 2, Article 7, 2009.
[18] W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow,” in the Proceeding of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07), pp. 407-420, Dec. 2007.
[19] W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” in the Proceeding of the 17th International Symposium on High Performance Computer Architecture (HPCA’11), pp. 25-36, Feb, 2011.
[20] M. Rhu and M. Erez, “The Dual-Path Execution Model for Efficient GPU Control Flow,” in the Proceeding of the 19th International Symposium on High Performance Computer Architecture (HPCA’13), pp. 591-602, Feb, 2013.
[21] Y. Wen, Z. Wang, and M. O’Boyle, “Smart Multi-Task Scheduling for OpenCL Programs on CPU/GPU Heterogeneous Platforms,” International Conference on High Performance Computing (HiPC’21), pp. 1-10, Dec, 2014.
[22] V. Ravi, M. Becchi, W Jiang, G. Agrawal, and S. Chakradhar, “Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes,” International Symposium on Cluster, Cloud and Grid Computing (CCGRID’12), pp. 140-147, May, 2012.
[23] P. Phothilimthana, J. Ansel, J. Ragan-Kelley ,and S. Amarasinghe, “Portable Performance on Heterogeneous Architectures,” International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13), pp 431-444, March, 2013
[24] D. Grewe, Z. Wang, and M. O’Boyle “OpenCL Task Partitioning in the Presence of GPU Contention,” Languages and Compilers for Parallel Computing(LCPC), Volume 8664 of the series Lecture Notes in Computer Science, pp 87-101, Oct, 2014
[25] M. Steuwer, “Performance Prediction of OpenCL Applications in Parallel Heterogeneous Systems,” HPC-EUROPA2 project (project number: 228398)
[26] K-C. Chen, Y. Chi, C-W. Lin, C-H. Chen, “Spatiotemporal SIMT Design on Multiprocessor for Efficient Data-Parallel Processing,” Submitted to IEEE Transactions on Computers, July, 2016.
[27] NVidia OpenCL Benchmarks [Online]
Available:https://github.com/ashwinraghav/Grafight/tree/master/gpgpu-sim/benchmarks/OpenCL
[28] Rodinia OpenCL Benchmarks [Online]
Available:http://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/Rodinia:Accelerating_Compute-Intensive_Applications_with_Accelerators
[29] AMD OpenCL Benchmarks [Online]
Available: https://developer.nvidia.com/cuda-toolkit
[30] Parboil OpenCL Benchmarks [Online]
Available: https://github.com/abduld/Parboil

校內：2021-08-31公開
校外：2021-08-31公開

簡易檢索 / 詳目顯示

相關論文