簡易檢索 / 詳目顯示

研究生: 黃昀棨
Huang, Yun-Chi
論文名稱: 使用對徑比較法之動態單指令多數據流收斂
Dynamic SIMD Re-convergence with Paired-Path Comparison
指導教授: 陳中和
Chen, Chung-Ho
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2015
畢業學年度: 103
語文別: 英文
論文頁數: 61
中文關鍵詞: GPGPUOpenCLSIMD Control Divergence
外文關鍵詞: GPGPU, OpenCL, SIMD Control Divergence
相關次數: 點閱:72下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在當前的GPGPU(General Purpose Graphic Processor Unit)架構下,單指令多
    資料流的分歧(SIMD Divergence)是造成平行運算效能下降的主要原因之一。我們
    評估一個基於HSAIL指令集的GPU模擬器,在上面運行OpenCL的核心涵式
    (Kernel)以觀察GPU的效能與結果。SIMD中最小的執行單位為波前(Wavefront)
    ,相當於SISD中的執行序。波前執行條件跳躍時,若此波前中每個工作項目
    (Workitem)之跳躍條件不同,導致同一波前中的工作項目要執行不同運算指令
    ,這種情形便稱為控制分歧(Control Divergence)。一旦有控制分歧的情形發生,
    便要啟用輔助的機制使得一個波前能夠依序讓不同的工作項目執行不同的指令,使用這樣的機制處理控制分歧需要編譯器與GPU的共同配合,不同的處理演算法亦會影響GPU在控制分歧下的執行效能。本論文提出了一個新的基於堆疊方式收斂機制,能讓波前在運算途中自行收斂。此機制可以選擇使用或不使用結譯器(Finalizer)所產生的收斂提示指令,不使用的話則免除了編譯器的支援與執行多餘的指令。使用此種動態收斂方法,GPU運行有不規則控制流之程式時獲得平均13.36%的活動比率(Activity Factor)提升。使用不依賴收斂提示指令之收斂方法能透過省去執行多餘指令的時間獲得整體執行效能的提升。

    SIMD divergence is one of the critical causes that decrease the parallel computing efficiency in contemporary GPGPU (General Purpose Graphic Processor Unit) architecture. In this thesis, we evaluate a cycle accurate GPU simulator platform based on HSAIL under OpenCL framework by offloading the kernel programs into
    simulator. A wavefront (“wavefront” and “warp” in AMD and NVIDIA terminology respectively) is the gathering of multiple threads that execute the same instruction in SIMD fashion. When a wavefront or a warp executes a conditional branch instruction, threads in the warp may go to distinct PCs if the threads have different branch targets, and it’s called SIMD control divergence. Re-convergence mechanisms are applied to help divergent wavefront to execute instructions properly. We develop a new dynamic stack-based re-convergence scheme that can be implemented with or without finalizer generated re-convergence instructions. Using the scheme we propose, the divergent warp re-converges dynamically and get a 13.36% activity factor improvement on average from opportunistic early re-convergence in the unstructured control flow, and the performance is better in the way that warp re-convergence without finalier generated hint instructions.

    Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 2 1.3 Organization 2 Chapter 2 Background 3 2.1 OpenCL Programming 3 2.1.1 OpenCL Platform and Execution Model 3 2.1.2 OpenCL Memory Model 4 2.1.3 OpenCL Framework 4 2.2 Heterogeneous System Architecture(HSA) 5 2.2.1 HSA Feature 6 2.2.2 HSAIL 7 2.3 General Purpose Computing on Graphics Processing Units(GPGPU) 8 2.3.1 Workitems of a Kernel mapping to a SM 8 2.3.2 Streaming Multiprocessors 9 2.3.3 Warp Scheduling 10 2.3.4 SIMD Divergence and Re-convergence Schemes 10 Chapter 3 Related Work 16 3.1 Dual-Path Execution Model 16 3.1.1 Execution Example 16 3.2 Implicit Stack-less Re-convergence 18 3.2.1 Re-convergence Mechanism 18 3.2.2 Divergent Control Flow Traversal 19 3.3 Unstructured Control Flow 19 Chapter 4 Dynamic Re-convergence in Dual-Path Stack 21 4.1 Observation 21 4.2 Re-convergence with Dynamic Paired-Path Comparison 22 4.2.1 Re-convergence Schemes Algorithm 23 4.2.2 Divergent Control Flow Traversal 33 4.2.3 Re-convergence Detection Methods 37 4.2.4 Behavior with Synchronization Barrier 41 4.2.5 Divergence Stack Implementation 42 Chapter 5 GPU Simulation Platform 44 5.1 Overview of HSAIL GPU Simulation Platform 44 5.2 Streaming Multiprocessor Pipeline 45 5.3 Finalizer 47 5.4 Configuration 48 Chapter 6 Benchmarks and Evaluation 50 6.1 Benchmarks 50 6.2 Evaluation 52 6.2.1 Activity Factor 52 6.2.2 LD/ST Unit Idle Ratio 55 6.2.3 SIMD Unit Utilization 56 6.2.4 Dynamic Instruction Counts 57 6.2.5 Overall Performance 58 Chapter 7 Conclusion 59 Reference 60

    [1] OpenCL – The open standard for parallel programming of heterogeneous systems, [Online], Available: http://www.khronos.org/object/opencl/ .
    [2] V. Narasiman; M. Shebanow; C. J. Lee; R. Miftakhutdinov; O. Mutlu, and Y. N. Patt, “Improving GPU Performance via Large Warps and Two-level Warp Scheduling,” MICRO-44 Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture,Pages 308-317,ACM New York, NY, USA ©2011.
    [3] S. Collange, “Stack-less SIMT Reconvergence at Low Cost”, ARENAIRE - Inria Grenoble Rhône-Alpes / LIP Laboratoire de l’Informatique du Parallélisme, 2011.
    [4] M. Rhu and M. Erez, "The dual-path execution model for efficient GPU control flow," High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on , vol., no., pp.591,602, 23-27 Feb. 2013
    [5] HSA Programmer’s Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer’s Guide. and Object Format(BRIG), 2014.
    [6] W.W.L. Fung; I. Sham; G.Yuan; and T.M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on , vol., no., pp.407,420, 1-5 Dec. 2007.
    [7] Intel HD Graphics OpenSource PRM, 2010.
    [8] A. ElTantawy; J.W. Ma; M. O'Connor and T.M. Aamodt, "A scalable multi-path microarchitecture for efficient GPU control flow," High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on , vol., no., pp.248,259, 15-19 Feb. 2014
    [9] F. Zhang and E. H. D’Hollander, “Using hammock graphs to structure programs,” Software Engineering, IEEE Transactions on , vol.30, no.4, pp.231,245, April 2004.
    [10] R. A. Lorie and H. R. Strong, US Patent 4,435,758: Method for conditional branch execution in SIMD vector processors, 1984.
    [11] J. Meng; D. Tarjan and K. Skadron, “Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance, ”In Proc. 37th Int’l Symp. on Computer Architecture (ISCA), pages 235– 246, 2010.
    [12] J.D.Collins; D.M. Tullsen and P. Wang, "Control Flow Optimization Via Dynamic Reconvergence Prediction,",MICRO 37 Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, Pages 129-140, 2004..
    [13] AMD SDK: AMD APP Software Development Kit, [Online], Available : http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/ .
    [14] S. Che et al., "Rodinia: A benchmark suite for heterogeneous computing," IISWC ( IEEE International Symposium on Workload Characterization ) , vol., no., pp.44,54, 4-6 Oct. 2009.
    [15] A. Kerr, G. Diamos and S. Yalamanchili, "A characterization and analysis of PTX kernels," IISWC ( IEEE International Symposium on Workload Characterization ) , , vol., no., pp.3,12, 4-6 Oct. 2009
    [16] Rogers, T.G., O'Connor, M., Aamodt, T.M., "Cache-Conscious Wavefront Scheduling,", MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM, International Symposium on Microarchitecture, Pages 72-83, 2012.

    下載圖示 校內:2016-08-18公開
    校外:2016-08-18公開
    QR CODE