| 研究生: |
蔡森至 Tsai, Sen-Chih |
|---|---|
| 論文名稱: |
繪圖處理器之執行緒區塊排程優化與其在CASLAB-GPUSIM上之實現 Optimization of Workgroup Scheduling on CASLAB-GPUSIM |
| 指導教授: |
陳中和
Chen, Chung-Ho |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2017 |
| 畢業學年度: | 105 |
| 語文別: | 中文 |
| 論文頁數: | 60 |
| 中文關鍵詞: | 繪圖處理器 、multikernel 、多元程式 、排程 、執行緒級平行處理 |
| 外文關鍵詞: | Graphics processing units, multikernel, multiprogramming, scheduling, thread-level parallelism |
| 相關次數: | 點閱:82 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
通用型繪圖處理器的應用日漸受到重視。而本實驗室以高階語言SystemC建立了基於Single Instruction Multiple Thread架構的通用型繪圖處理器模擬平台, CASLAB-GPUSIM,模擬平台也包含了子記憶體及軟體程式介面,並通過取自Rodinia、AMD和NVIDIA等的驗證程式。
此篇論文探討通用型繪圖處理器執行緒區塊排程的效能,提出Kernel Aware Warp Scheduler ( KWS ) 機制緩解其Kernel工作量使用硬體資源的不平衡,此機制需要在執行緒區塊排程配合使用Mixed Concurrent Kernel Execution,讓不同的Kernel執行在同個串流多處理器上,然後以Kernel和指令作為分類調整Warp優先權,藉此提升硬體使用率以改善效能。此篇論文亦提出Profiling Based Workgroup Scheduler (PBWS) 機制緩解Kernel需求與子記憶體資源不平衡。先使用靜態分析決定初始的執行緒區塊數量限制,再藉由動態分析逐步調整每個串流多處理器內部的執行緒區塊數量限制。最後將這些機制實做於CASLAB-GPUSIM平台上,並以實驗評估其硬體使用率的改善或快取記憶體命中的提升以及效能的提升。
總結此篇論文,當繪圖處理器同時執行一個Arithmetic-Intensive和一個Memory-Intensive的Kernel時,這時可以使用KWS機制提升效能約20%;當繪圖處理器只執行一個Kernel時,這時可以使用PBWS機制提升效能約11%。
General Purpose Graphics Processing Units (GPGPUs) become more and more important in recent years. We develop CASLAB-GPUSIM, a GPGPU simulation platform based on single instruction multiple thread acrchitecture by SystemC. The platform also includes the memory subsystem and the software toolchain, and is verified with benchmarks from Rodinia, AMD and NVIDIA.
This paper explores the problems of performance by workgroup scheduling and warp scheduling on CASLAB-GPUSIM. There are two methods proposed. The first is KWS, a kernel aware warp scheduler, which has to be used with mixed concurrent kernel execution. KWS prioritizes the warps by the attribution of kernel and the type of instructions to ease the problem of the imbalance of kernel workload and hardware resources. The second is PBWS, a profiling based workgroup scheduler, which restricts the maximum number of workgroups allocated to the streaming multiprocessors. PBWS miligates the problem of the imbalance of the memory requests from kernel and the memory subsystem. The mechanisms are implemented in CASLAB-GPUSIM and are evaluated with the benchmarks. KWS with mixed concurrent kernel execution yields 20% speedup compared to traditional concurrent kernel execution with Loose Round-Robin warp scheduler. PBWS yields 11% speedup compared to Round-Robin workgroup scheduler.
[1] "Rodinia: A Benchmark Suite for Heterogeneous Computing," [Online]. Available: http://lava.cs.virginia.edu/Rodinia/download_links.htm.
[2] "AMD APP SDK – A Complete Development Platform," [Online]. Available: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/.
[3] "NVIDIA OpenCL SDK Code Samples," [Online]. Available: https://developer.nvidia.com/opencl.
[4] "THE GREEN500," [Online]. Available: https://www.top500.org/green500/.
[5] "TOP500," [Online]. Available: https://www.top500.org/.
[6] "Whitepaper NVIDIA’s Next Generation CUDA Compute Architecture: Fermi," NVIDIA, 2009. [Online]. Available: https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.
[7] Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, Soojung Ryu, "Improving GPGPU Resource Utilization Through Alternative Thread Block Scheduling," in High Performance Computer Architecture (HPCA), Orlando, FL, USA, 2014.
[8] NVIDIA Corporation, 2012. [Online]. Available: https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.
[9] Yun-Chi Huang, Kuan-Chieh Hsu, Wan-shan Hsieh, Chen-Chieh Wang, Chia-Han Lu, and Chung-Ho Chen, "Dynamic SIMD Re-Convergence with Paired-Path Comparison," in Proceeding of IEEE International Symposium on Circuits and Systems (ISCAS), 2016.
[10] Onur Kayıran, Adwait Jog, Mahmut T. Kandemir, Chita R. Das, "Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs," in Parallel Architectures and Compilation Techniques (PACT), Edinburgh, Scotland, UK, 2013.
[11] T. Rogers, M. O'Connor, T. Aamodt, "Cache-Conscious Wavefront Scheduling," in 45th International Symposium on Microarchitecture (MICRO-45), Vancouver, BC, Canada, 2012.
[12] Qiumin Xu, Hyeran Jeon, Keunsoo Kim, Won Woo Ro, Murali Annavaram, "Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming," in Proceedings of the 43rd International Symposium on Computer Architecture, Seoul, South Korea, 2016.
[13] "HSA Foundation github," [Online]. Available: https://github.com/HSAFoundation/.
[14] H-Y. Chen, C-H. Chen, “An HSAIL ISA Conformed GPU Platform,” the thesis for Master of Science, Tainan, Taiwan: National Cheng Kung University, 2015.
[15] "The OpenCL Specification Version: 2.0," Khronos OpenCL Working Group, 2014.
[16] "HSA Runtime Programmer’s Reference Manual Version 1.0," HSA Foundation, 2015.
[17] "HSA Platform System Architecture Specification Version 1.0 Final," HSA Foundation, 2015.
[18] "HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer, and Object Format (BRIG) Version 1.0 Final," HSA Foundation, 2015.
[19] Wan-Shan Hsieh, Chung-Ho Chen, “Micro-Architecture Optimization of HSA-Compatible GPU,” the thesis for Master of Science, Tainan, Taiwan: National Cheng Kung University, 2017.
[20] C-M. Chiu, C-H. Chen, “GPU Warp Scheduling Using Memory Stall Sampling on CASLAB-GPUSIM,” the thesis for Master of Science, Tainan, Taiwan: National Cheng Kung University, 2015.
[21] B-X. Zeng, C-H. Chen, “Architecture Exploration and Optimization of CASLAB-GPUSIM Memory Subsystem,” the thesis for Master of Science, Tainan, Taiwan: National Cheng Kung University, 2017.