成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	黃昀棨 Huang, Yun-Chi
論文名稱：	使用對徑比較法之動態單指令多數據流收斂 Dynamic SIMD Re-convergence with Paired-Path Comparison
指導教授：	陳中和 Chen, Chung-Ho
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering
論文出版年：	2015
畢業學年度：	103
語文別：	英文
論文頁數：	61
中文關鍵詞：	GPGPU 、OpenCL 、SIMD Control Divergence
外文關鍵詞：	GPGPU, OpenCL, SIMD Control Divergence
相關次數：	點閱：72 下載：1
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在當前的GPGPU(General Purpose Graphic Processor Unit)架構下，單指令多
資料流的分歧(SIMD Divergence)是造成平行運算效能下降的主要原因之一。我們
評估一個基於HSAIL指令集的GPU模擬器，在上面運行OpenCL的核心涵式
(Kernel)以觀察GPU的效能與結果。SIMD中最小的執行單位為波前(Wavefront)
，相當於SISD中的執行序。波前執行條件跳躍時，若此波前中每個工作項目
(Workitem)之跳躍條件不同，導致同一波前中的工作項目要執行不同運算指令
，這種情形便稱為控制分歧(Control Divergence)。一旦有控制分歧的情形發生，
便要啟用輔助的機制使得一個波前能夠依序讓不同的工作項目執行不同的指令，使用這樣的機制處理控制分歧需要編譯器與GPU的共同配合，不同的處理演算法亦會影響GPU在控制分歧下的執行效能。本論文提出了一個新的基於堆疊方式收斂機制，能讓波前在運算途中自行收斂。此機制可以選擇使用或不使用結譯器(Finalizer)所產生的收斂提示指令，不使用的話則免除了編譯器的支援與執行多餘的指令。使用此種動態收斂方法，GPU運行有不規則控制流之程式時獲得平均13.36%的活動比率(Activity Factor)提升。使用不依賴收斂提示指令之收斂方法能透過省去執行多餘指令的時間獲得整體執行效能的提升。

SIMD divergence is one of the critical causes that decrease the parallel computing efficiency in contemporary GPGPU (General Purpose Graphic Processor Unit) architecture. In this thesis, we evaluate a cycle accurate GPU simulator platform based on HSAIL under OpenCL framework by offloading the kernel programs into
simulator. A wavefront (“wavefront” and “warp” in AMD and NVIDIA terminology respectively) is the gathering of multiple threads that execute the same instruction in SIMD fashion. When a wavefront or a warp executes a conditional branch instruction, threads in the warp may go to distinct PCs if the threads have different branch targets, and it’s called SIMD control divergence. Re-convergence mechanisms are applied to help divergent wavefront to execute instructions properly. We develop a new dynamic stack-based re-convergence scheme that can be implemented with or without finalizer generated re-convergence instructions. Using the scheme we propose, the divergent warp re-converges dynamically and get a 13.36% activity factor improvement on average from opportunistic early re-convergence in the unstructured control flow, and the performance is better in the way that warp re-convergence without finalier generated hint instructions.

Chapter 1	Introduction	1
1	Motivation	1
2	Contribution	2
3	Organization	2
Chapter 2	Background	3
1	OpenCL Programming	3
1.1	OpenCL Platform and Execution Model	3
1.2	OpenCL Memory Model	4
1.3	OpenCL Framework	4
2	Heterogeneous System Architecture(HSA)	5
2.1	HSA Feature	6
2.2	HSAIL	7
3	General Purpose Computing on Graphics Processing Units(GPGPU)	8
3.1	Workitems of a Kernel mapping to a SM	8
3.2	Streaming Multiprocessors	9
3.3	Warp Scheduling	10
3.4	SIMD Divergence and Re-convergence Schemes	10
Chapter 3	Related Work	16
1	Dual-Path Execution Model	16
1.1	Execution Example	16
2	Implicit Stack-less Re-convergence	18
2.1	Re-convergence Mechanism	18
2.2	Divergent Control Flow Traversal	19
3	Unstructured Control Flow	19
Chapter 4	Dynamic Re-convergence in Dual-Path Stack	21
1	Observation	21
2	Re-convergence with Dynamic Paired-Path Comparison	22
2.1	Re-convergence Schemes Algorithm	23
2.2	Divergent Control Flow Traversal	33
2.3	Re-convergence Detection Methods	37
2.4	Behavior with Synchronization Barrier	41
2.5	Divergence Stack Implementation	42
Chapter 5	GPU Simulation Platform	44
1	Overview of HSAIL GPU Simulation Platform	44
2	Streaming Multiprocessor Pipeline	45
3	Finalizer	47
4	Configuration	48
Chapter 6	Benchmarks and Evaluation	50
1	Benchmarks	50
2	Evaluation	52
2.1	Activity Factor	52
2.2	LD/ST Unit Idle Ratio	55
2.3	SIMD Unit Utilization	56
2.4	Dynamic Instruction Counts	57
2.5	Overall Performance	58
Chapter 7	Conclusion	59
Reference	60

                                    

[1] OpenCL – The open standard for parallel programming of heterogeneous systems, [Online], Available: http://www.khronos.org/object/opencl/ .
[2] V. Narasiman; M. Shebanow; C. J. Lee; R. Miftakhutdinov; O. Mutlu, and Y. N. Patt, “Improving GPU Performance via Large Warps and Two-level Warp Scheduling,” MICRO-44 Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture,Pages 308-317,ACM New York, NY, USA ©2011.
[3] S. Collange, “Stack-less SIMT Reconvergence at Low Cost”, ARENAIRE - Inria Grenoble Rhône-Alpes / LIP Laboratoire de l’Informatique du Parallélisme, 2011.
[4] M. Rhu and M. Erez, "The dual-path execution model for efficient GPU control flow," High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on , vol., no., pp.591,602, 23-27 Feb. 2013
[5] HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer’s Guide. and Object Format(BRIG), 2014.
[6] W.W.L. Fung; I. Sham; G.Yuan; and T.M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on , vol., no., pp.407,420, 1-5 Dec. 2007.
[7] Intel HD Graphics OpenSource PRM, 2010.
[8] A. ElTantawy; J.W. Ma; M. O'Connor and T.M. Aamodt, "A scalable multi-path microarchitecture for efficient GPU control flow," High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on , vol., no., pp.248,259, 15-19 Feb. 2014
[9] F. Zhang and E. H. D’Hollander, “Using hammock graphs to structure programs,” Software Engineering, IEEE Transactions on , vol.30, no.4, pp.231,245, April 2004.
[10] R. A. Lorie and H. R. Strong, US Patent 4,435,758: Method for conditional branch execution in SIMD vector processors, 1984.
[11] J. Meng; D. Tarjan and K. Skadron, “Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance, ”In Proc. 37th Int’l Symp. on Computer Architecture (ISCA), pages 235– 246, 2010.
[12] J.D.Collins; D.M. Tullsen and P. Wang, "Control Flow Optimization Via Dynamic Reconvergence Prediction,",MICRO 37 Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, Pages 129-140, 2004..
[13] AMD SDK: AMD APP Software Development Kit, [Online], Available : http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/ .
[14] S. Che et al., "Rodinia: A benchmark suite for heterogeneous computing," IISWC ( IEEE International Symposium on Workload Characterization ) , vol., no., pp.44,54, 4-6 Oct. 2009.
[15] A. Kerr, G. Diamos and S. Yalamanchili, "A characterization and analysis of PTX kernels," IISWC ( IEEE International Symposium on Workload Characterization ) , , vol., no., pp.3,12, 4-6 Oct. 2009
[16] Rogers, T.G., O'Connor, M., Aamodt, T.M., "Cache-Conscious Wavefront Scheduling,", MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM, International Symposium on Microarchitecture, Pages 72-83, 2012.

校內：2016-08-18公開
校外：2016-08-18公開

簡易檢索 / 詳目顯示

相關論文