| 研究生: |
邱健鳴 Chiu, Chien-Ming |
|---|---|
| 論文名稱: |
使用記憶體延遲取樣之繪圖處理器執行緒排程機制與其在CASLAB-GPUSIM上之實現 GPU Warp Scheduling Using Memory Stall Sampling on CASLAB-GPUSIM |
| 指導教授: |
陳中和
Chen, Chung-Ho |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2017 |
| 畢業學年度: | 105 |
| 語文別: | 英文 |
| 論文頁數: | 75 |
| 中文關鍵詞: | 繪圖處理器 、執行緒排程機制 、記憶體競爭 |
| 外文關鍵詞: | GPU, Warp scheduling, Memory contention |
| 相關次數: | 點閱:130 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,隨著資料探勘(Data Mining)、機器學習(Machine Learning)以及影像辨識(Image Recognition)等需要針對大量平行資料進行處理的應用變得愈來愈熱門,繪圖處理器(GPUs)於是被廣泛使用來加速這些非繪圖的工作。現今的繪圖處理器使用極大量的並行執行緒(Multithreading)以及細質多執行緒(Fine-Grained Multithreading)的技術來隱藏運算或管線傳遞時間。然而,近期的研究顯示記憶體競爭(Memory Contention)是造成現今的繪圖處理器無法達到巔峰效能的最嚴重瓶頸之一。當並行執行緒的數量越多,由於記憶體系統的過載,記憶體競爭問題也變得越嚴重,而少量的並行執行緒又會弱化運算或管線傳遞時間的遮蔽能力。我們提出了考量記憶體競爭之繪圖處理器執行緒排程機制(Memory-Contention Aware Warp Scheduling)以尋找記憶體系統資源和工作量之間的平衡。這個機制使用動態取樣(Dynamic Sampling)的方法精準地辨識記憶體競爭問題的嚴重程度並且依照不同的情況提供最合適的執行緒並行程度。我們的實驗結果顯示,對於快取記憶體敏感(Cache Sensitive)的工作,與基本的鬆散循環制(Loose Round-Robin)相比,考量記憶體競爭之繪圖處理器執行緒排程機制在GPGPU-Sim上達到幾何平均高達96.4%的加速。除此之外,考量記憶體競爭之繪圖處理器執行緒排程機制也在CASLAB-GPUSIM上達到整體17.4%的效能提升。
In these years, Graphic Processing Units (GPUs), well known for parallel computing, are widely adopted to accelerate non-graphic workloads such as Data Mining, Machine Learning, and Image Recognition. Modern GPUs utilize a huge number of concurrent threads and Fine-Grained Multithreading technique to overlap operation latencies. However, recent researches have shown that the memory contention problem is one of the most important bottlenecks preventing modern GPUs from achieving peak performance. The memory contention problem could be even more serious when the degree of multithreading gets higher due to the overloading of the memory system while the latency hiding ability is poor with a low degree of multithreading. We propose Memory-Contention Aware Warp Scheduling (MAWS) to strike a balance between memory workloads and memory resources. This scheme uses dynamic sampling to accurately recognize the severity level of the memory contention problem and provides an appropriate degree of thread concurrency correspondingly. Our experiments show that MAWS achieves a geometric mean speedup of 96.4% over baseline Loose Round-Robin scheduler for cache sensitive workloads on GPGPU-Sim. MAWS also achieves an overall speedup of 17.4% on CASLAB-GPUSIM.
[1] NVIDIA, "NVIDIA Fermi Compute Architecture Whitepaper," 2009. [Online]. Available: https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.
[2] NVIDIA, "NVIDIA Kepler GK110 Architecture Whitepaper," 2012. [Online]. Available: https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf.
[3] NVIDIA, "NVIDIA GeForce GTX 680 Whitepaper," 2012. [Online]. Available: http://la.nvidia.com/content/PDF/product-specifications/GeForce_GTX_680_Whitepaper_FINAL.pdf.
[4] NVIDIA, "NVIDIA GeForce GTX 980 Whitepaper," 2014. [Online]. Available: https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF.
[5] NVIDIA, "NVIDIA GeForce GTX 750 Ti Whitepaper," 2014. [Online]. Available: http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf.
[6] NVIDIA, "NVIDIA GeForce GTX 1080 Whitepaper," 2016. [Online]. Available: http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FINAL.pdf.
[7] Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi, "Characterizing and Improving the Use of Demand-Fetched Caches in GPUs," in International Computer Symposium (ICS), 2012.
[8] Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi, "MRPB: Memory Request Prioritization for Massively Parallel Processors," in IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), 2014.
[9] Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das, "Orchestrated Scheduling and Prefetching for GPGPUs," in International Symposium on Computer Architecture (ISCA), 2013.
[10] Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das, "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2013.
[11] Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos, "Demystifying GPU Microarchitecture through Microbenchmarking," in IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), 2010.
[12] MARK GEBHART, DANIEL R. JOHNSON, DAVID TARJAN, STEPHEN W. KECKLER, WILLIAM J. DALLY, ERIK LINDHOLM, and KEVIN SKADRON, "A Hierarchical Thread Scheduler and Register File for Energy-efficient Throughput Processors," ACM Transactions on Computer Systems, 2011.
[13] Onur Kayıran, Adwait Jog, Mahmut T. Kandemir, and Chita R. Das, "Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs," in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2013.
[14] Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu, "Improving GPGPU Resource Utilization Through Alternative Thread Block Scheduling," in International Symposium on High Performance Computer Architecture (HPCA), 2014.
[15] Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2009.
[16] HSA Foundation, HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer, and Object Format (BRIG) Version 1.0 Final, HSA Foundation, 2015.
[17] Accellera Systems Initiative., "SystemC," Accellera Systems Initiative., 2017. [Online]. Available: http://www.vhdl.org/downloads/standards/systemc.
[18] Shin-Ying Lee, and Carole-Jean Wu, "CAWS: Criticality-aware warp scheduling for GPGPU workloads," in International Conference on Parallel Architecture and Compilation Techniques (PACT), 2014.
[19] Shin-Ying Lee, Akhil Arunkumar, and Carole-Jean Wu, "CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads," in ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), 2015.
[20] Myung Kuk Yoon, Keunsoo Kim, Sangpil Lee, Won Woo Ro, and Murali Annavaram, "Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit," in ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016.
[21] NVIDIA, "CUDA Toolkit," NVIDIA, 2017. [Online]. Available: https://developer.nvidia.com/cuda-toolkit.
[22] Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos, "Auto-tuning a high-level language targeted to GPU codes," in Innovative Parallel Computing (InPar), 2012.
[23] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in IEEE International Symposium on Workload Characterization (IISWC), 2009.
[24] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron, "A Characterization of the Rodinia Benchmark Suite with Comparison to Contemporary CMP Workloads," in IEEE International Symposium on Workload Characterization, 2010.
[25] Timothy G. Rogers, Mike OConnor, and Tor M. Aamodt, "Cache-Conscious Wavefront Scheduling," in IEEE/ACM International Symposium on Microarchitecture (MICRO), 2012.
[26] Ankit Sethia, D. Anoushe Jamshidi, and Scott Mahlke, "Mascar: Speeding up GPU warps by reducing memory pitstops," in IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), 2015.
[27] H.-Y. Chen, "An HSAIL ISA conformed GPU platform," in Master thesis, Institute of Computer and Communication Engineering, National Cheng-Kung University, Taiwan, 2015.
[28] Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in IEEE/ACM International Symposium on Microarchitecture (MICRO), 2011.
[29] Myung Kuk Yoon, Seung Hun Kim, and Won Woo Ro, "DRAW: Investigating Benefits of Adaptive Fetch Group Size on GPU," in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2015.
[30] Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, Gunjae Koo, Won Woo Ro, and Murali Annavaram, "Warped-preexecution: A GPU pre-execution approach for improving latency hiding," in IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016.
[31] B.-X. Zeng, "Architecture Exploration and Optimization of CASLAB-GPUSIM Memory Subsystem," in Master thesis, Institute of Computer and Communication Engineering, National Cheng-Kung University, Taiwan, 2017.
[32] Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," in IEEE/ACM International Symposium on Microarchitecture (MICRO), 2007.
[33] Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt, "Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware," ACM Transactions on Architecture and Code Optimization (TACO), 6 2009.
[34] Samuel Liu, John Erik Lindholm, Ming Y Siu, Brett W. Coon and Stuart F. Oberman, "Operand collector architecture". United States Patent 7834881 B2, 16 11 2010.
[35] S.-C. Yu, "Design of Special Function Unit with Dual-Precision Function Approximation," in Master thesis, Institute of Electrical Engineering, National Cheng-Kung University, Taiwan, 2017.
[36] The Khronos Group Inc., "The open standard for parallel programming of heterogeneous systems," The Khronos Group Inc., 2017. [Online]. Available: https://www.khronos.org/opencl/.
[37] Advanced Micro Devices, Inc., "APP SDK – A Complete Development Platform," Advanced Micro Devices, Inc., 2017. [Online]. Available: http://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/.