簡易檢索 / 詳目顯示

研究生: 蔡森至
Tsai, Sen-Chih
論文名稱: 繪圖處理器之執行緒區塊排程優化與其在CASLAB-GPUSIM上之實現
Optimization of Workgroup Scheduling on CASLAB-GPUSIM
指導教授: 陳中和
Chen, Chung-Ho
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 中文
論文頁數: 60
中文關鍵詞: 繪圖處理器multikernel多元程式排程執行緒級平行處理
外文關鍵詞: Graphics processing units, multikernel, multiprogramming, scheduling, thread-level parallelism
相關次數: 點閱:82下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 通用型繪圖處理器的應用日漸受到重視。而本實驗室以高階語言SystemC建立了基於Single Instruction Multiple Thread架構的通用型繪圖處理器模擬平台, CASLAB-GPUSIM,模擬平台也包含了子記憶體及軟體程式介面,並通過取自Rodinia、AMD和NVIDIA等的驗證程式。
    此篇論文探討通用型繪圖處理器執行緒區塊排程的效能,提出Kernel Aware Warp Scheduler ( KWS ) 機制緩解其Kernel工作量使用硬體資源的不平衡,此機制需要在執行緒區塊排程配合使用Mixed Concurrent Kernel Execution,讓不同的Kernel執行在同個串流多處理器上,然後以Kernel和指令作為分類調整Warp優先權,藉此提升硬體使用率以改善效能。此篇論文亦提出Profiling Based Workgroup Scheduler (PBWS) 機制緩解Kernel需求與子記憶體資源不平衡。先使用靜態分析決定初始的執行緒區塊數量限制,再藉由動態分析逐步調整每個串流多處理器內部的執行緒區塊數量限制。最後將這些機制實做於CASLAB-GPUSIM平台上,並以實驗評估其硬體使用率的改善或快取記憶體命中的提升以及效能的提升。
    總結此篇論文,當繪圖處理器同時執行一個Arithmetic-Intensive和一個Memory-Intensive的Kernel時,這時可以使用KWS機制提升效能約20%;當繪圖處理器只執行一個Kernel時,這時可以使用PBWS機制提升效能約11%。

    General Purpose Graphics Processing Units (GPGPUs) become more and more important in recent years. We develop CASLAB-GPUSIM, a GPGPU simulation platform based on single instruction multiple thread acrchitecture by SystemC. The platform also includes the memory subsystem and the software toolchain, and is verified with benchmarks from Rodinia, AMD and NVIDIA.
    This paper explores the problems of performance by workgroup scheduling and warp scheduling on CASLAB-GPUSIM. There are two methods proposed. The first is KWS, a kernel aware warp scheduler, which has to be used with mixed concurrent kernel execution. KWS prioritizes the warps by the attribution of kernel and the type of instructions to ease the problem of the imbalance of kernel workload and hardware resources. The second is PBWS, a profiling based workgroup scheduler, which restricts the maximum number of workgroups allocated to the streaming multiprocessors. PBWS miligates the problem of the imbalance of the memory requests from kernel and the memory subsystem. The mechanisms are implemented in CASLAB-GPUSIM and are evaluated with the benchmarks. KWS with mixed concurrent kernel execution yields 20% speedup compared to traditional concurrent kernel execution with Loose Round-Robin warp scheduler. PBWS yields 11% speedup compared to Round-Robin workgroup scheduler.

    摘要 I 誌謝 VI 目錄 VII 表目錄 X 圖目錄 XI 第1章 序論 1 1.1 研究動機 2 1.2 研究貢獻 5 1.3 文章組織 6 第2章 背景知識與相關研究 7 2.1 通用型繪圖處理器 7 2.1.1 單指令流多執行緒 7 2.1.2 Kernel Execution 8 2.1.3 Workgroup 排程 10 2.1.4 Warp 排程 10 2.1.5 控制流程與分歧 12 2.2 相關研究 13 第3章 執行緒區塊排程的優化 15 3.1 Mixed Workload機制 15 3.1.1 Mixed Concurrent Kernel Execution 15 3.1.2 KWS: Kernel Aware Warp Scheduler 17 3.2 Thread Throttling機制 21 3.2.1 靜態Kernel 分析 21 3.2.2 動態Kernel 分析 23 第4章 CASLAB-GPUSIM軟體層模擬和執行緒區塊排程器之實現 24 4.1 CASLAB GPUSIM 平台全貌 25 4.1.1 GPU 指令集 25 4.2 Runtime System 26 4.2.1 OpenCL Runtime 26 4.2.2 HSA Runtime 28 4.3 驅動層 29 4.4 實驗平台硬體設計 30 4.4.1 Task Dispatch Unit 30 4.4.2 Streaming MultiProcessor 31 4.4.3 Memory Subsystem 32 第5章 實驗評估 33 5.1 實驗環境 33 5.1.1 環境設定值 33 5.1.2 測試程式 34 5.2 評估結果 34 5.2.1 Mixed Workload之實驗評估 34 5.2.2 Mixed Workload之評估分析 42 5.2.3 Thread Thottling之實驗評估 55 第6章 結論 57 6.1 實驗評估結果討論 57 參考文獻 58

    [1] "Rodinia: A Benchmark Suite for Heterogeneous Computing," [Online]. Available: http://lava.cs.virginia.edu/Rodinia/download_links.htm.
    [2] "AMD APP SDK – A Complete Development Platform," [Online]. Available: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/.
    [3] "NVIDIA OpenCL SDK Code Samples," [Online]. Available: https://developer.nvidia.com/opencl.
    [4] "THE GREEN500," [Online]. Available: https://www.top500.org/green500/.
    [5] "TOP500," [Online]. Available: https://www.top500.org/.
    [6] "Whitepaper NVIDIA’s Next Generation CUDA Compute Architecture: Fermi," NVIDIA, 2009. [Online]. Available: https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.
    [7] Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, Soojung Ryu, "Improving GPGPU Resource Utilization Through Alternative Thread Block Scheduling," in High Performance Computer Architecture (HPCA), Orlando, FL, USA, 2014.
    [8] NVIDIA Corporation, 2012. [Online]. Available: https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.
    [9] Yun-Chi Huang, Kuan-Chieh Hsu, Wan-shan Hsieh, Chen-Chieh Wang, Chia-Han Lu, and Chung-Ho Chen, "Dynamic SIMD Re-Convergence with Paired-Path Comparison," in Proceeding of IEEE International Symposium on Circuits and Systems (ISCAS), 2016.
    [10] Onur Kayıran, Adwait Jog, Mahmut T. Kandemir, Chita R. Das, "Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs," in Parallel Architectures and Compilation Techniques (PACT), Edinburgh, Scotland, UK, 2013.
    [11] T. Rogers, M. O'Connor, T. Aamodt, "Cache-Conscious Wavefront Scheduling," in 45th International Symposium on Microarchitecture (MICRO-45), Vancouver, BC, Canada, 2012.
    [12] Qiumin Xu, Hyeran Jeon, Keunsoo Kim, Won Woo Ro, Murali Annavaram, "Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming," in Proceedings of the 43rd International Symposium on Computer Architecture, Seoul, South Korea, 2016.
    [13] "HSA Foundation github," [Online]. Available: https://github.com/HSAFoundation/.
    [14] H-Y. Chen, C-H. Chen, “An HSAIL ISA Conformed GPU Platform,” the thesis for Master of Science, Tainan, Taiwan: National Cheng Kung University, 2015.
    [15] "The OpenCL Specification Version: 2.0," Khronos OpenCL Working Group, 2014.
    [16] "HSA Runtime Programmer’s Reference Manual Version 1.0," HSA Foundation, 2015.
    [17] "HSA Platform System Architecture Specification Version 1.0 Final," HSA Foundation, 2015.
    [18] "HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer, and Object Format (BRIG) Version 1.0 Final," HSA Foundation, 2015.
    [19] Wan-Shan Hsieh, Chung-Ho Chen, “Micro-Architecture Optimization of HSA-Compatible GPU,” the thesis for Master of Science, Tainan, Taiwan: National Cheng Kung University, 2017.
    [20] C-M. Chiu, C-H. Chen, “GPU Warp Scheduling Using Memory Stall Sampling on CASLAB-GPUSIM,” the thesis for Master of Science, Tainan, Taiwan: National Cheng Kung University, 2015.
    [21] B-X. Zeng, C-H. Chen, “Architecture Exploration and Optimization of CASLAB-GPUSIM Memory Subsystem,” the thesis for Master of Science, Tainan, Taiwan: National Cheng Kung University, 2017.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE