| 研究生: |
許冠傑 Hsu, Kuan-Chieh |
|---|---|
| 論文名稱: |
HSA繪圖處理器之效能預測模型 Performance Prediction Model on HSA-Compatible General-Purpose GPU System |
| 指導教授: |
陳中和
Chen, Chung-Ho |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2016 |
| 畢業學年度: | 104 |
| 語文別: | 英文 |
| 論文頁數: | 66 |
| 中文關鍵詞: | 記憶體系統 、多核模擬平台 、預測模型 |
| 外文關鍵詞: | Memory system, Multicore simulation, Prediction model |
| 相關次數: | 點閱:69 下載:7 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
首先,我們在本論文中提及了一個完整實現的記憶體子系統。其所屬的平台是先前本實驗室開發的特製通用繪圖處理器(General purpose GPU)架構。為了能夠快速開發與實現這樣的晶片,在早期設計階段我們將先前的C++語言模擬器版本做時序上的延伸探討,盡力保持完整詳細的記憶體流量模擬與快速的模擬時間。這是由於記憶體流量佔程式模擬時間的大部分。例如,第一層快取記憶體(Cache)的存取要求送上晶片網路系統(NoC)會經歷非定值的延遲時間,記憶體快取一致性與記憶體排程器的選擇等皆會影響架構核心所見的延遲時間。而關於記憶體空間分配的方法我們也探討了粗與細兩種不同粒度的分散方法。在晶片網路模組我們也探討了網格狀架構(Mesh)在幾何特性上強韌的原因並採用此類拓樸。
本篇論文另一個特點是採用機器學習的效能預測模型。使用kmeans與SVM兩種模型可以分析出各別測試程式在所有可能的硬體架構設定下,於何種參數數值下會有最佳效能結果。於此,還能進一步預測未測試過的程式將會在何硬體參數下達到最高效能。Kmeans演算法負責將所有測試程式的效能結果分類為若干種特性相似的群集,做為預測模型的參考範本。SVM模型則接著以記憶體系統相關的量測結果訓練出最終的預測參數。由於我們宣稱程式效能主要由記憶體子系統所影響,只以此類特徵分析的結果在八個群集設定下可以達到至少有46.48%的測試點落在百分之十以內的錯誤率;在改變群集數量的量測中,最高可以有57.97%的測試點落在百分之十以內的錯誤率。最終,我們發現最佳的效能結果不一定發生在給定最大硬體資源數量上,這是由於晶片網路系統上的交通流量時常有塞車的情況而影響。
結合了記憶體子系統的開發與預測模型,我們旨在提供一套可靠且精確的早期開發平台,使往後的晶片實現能夠採用現階段的效能探討並快速的完成。
In this thesis, we present a memory subsystem of customized general purpose GPU architecture. For fast development, the C++ simulated architecture should be kept as light-weight while timing accurate at the same time. Since most parts of benchmark simulation time come from memory subsystem-related latencies. For example, the level one cache miss will trigger Network on Chip (NoC) traffic; the cache coherence and memory controller scheduling policy also affect the latency viewed by streaming multiprocessor in this GPGPU architecture. Also, we discuss the memory space partitioning methods in one following section including coarse grain and fine grain partitioning methods. As for NoC module, we adopted previous research in this work and discuss geometry features of chosen topology – Mesh structure for robust reason.
Another contribution of this work is that two machine learning models are used for predicting architecture performance and depicting the performance trend across plenty of hardware configuration settings. We aim to guess a reasonable summit value in performance surface by the following procedures. First, kmeans algorithm clusters training benchmarks into determined number of clusters. The multi-class Support Vector Machine (SVM) model is latter trained to fit memory-related only features. During validation phase, testing benchmarks’ summit performance values are predicted by the result from training phase. Under eight clusters setting, 46.48% predicted cycle performance counts across all tested benchmarks are less than 10% error comparing to real performance values. By varying the number of clusters, up to 57.97% points are less than 10% errors. Also, we show that summit performance not necessary happen under maximum hardware resources. Some discussions point out the memory traffic issues that significantly drag down the execution speed of certain accessing patterns from benchmarks.
Combined the mentioned contributions together, we aim to provide a reliable and accurate early stage simulation platform for future IC chip implementation in an efficient way.
[1] CLOC source code download. [Online] Available: https://github.com/HSAFoundation/HSA-Docs-AMD/wiki/CLOC-Compiler-and-Sample-SDK
[2] Heterogeneous System Architecture standard. [Online] Available: http://www.hsafoundation.com/standards/
[3] Timothy G.Rogers, Mike O’Connor, and Tor M. Aamodt, “Cache-Conscious Wavefront Scheduling,” In MICRO, 2012.
[4] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. PaTT, “Improving GPU Performance via Large Warps and Two-Level Warp Scheduling,” in Proc. of the 44th International Symposium on Microarchitecture (MICRO-44), Dec 2011.
[5] B. Pichai, L. Hsu, and A. Bhattacharjee, “Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces,” in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 2014, pp. 743–758.
[6] Shin-Ying Lee, Akhil Arunkumar, and Carole-Jean Wu, “CAWA: Coordinated Warp Scheduling and Cache Prioritization for Critical Warp Acceleration of GPGPU Workloads,” in Proceedings of International Symposium on Computer Architecture (ISCA), 2015.
[7] Yun-Chi Huang, Kuan-Chieh Hsu, Wan-shan Hsieh, Chen-Chieh Wang, Chia-Han Lu, and Chung-Ho Chen, “Dynamic SIMD Re-Convergence with Paired-Path Comparison,” in Proceeding of IEEE International Symposium on Circuits and Systems (ISCAS), 2016.
[8] M. Rhu and M. Erez, “The Dual-path execution model for efficient GPU control flow,” High Performance Computer Architecture (HPCA), 2013.
[9] A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, “Analyzing cuda workloads using a detailed GPU simulator,” in IEEE ISPASS, April 2009.
[10] Jason Power, Joel Hestness, Marc S. Orr, Mark D. Hill, and David A. Wood, “gem5-gpu: A heterogeneous CPU-GPU Simulator,” Computer Architecture Letters vol. 13, no. 1, Jan 2014.
[11] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood, “The gem5 Simulator,” ACM SIGARCH Computer Architecture News. May 2011.
[12] Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli, “Multi2Sim: A Simulation Framework for CPU-GPU Computing,” PACT, 2012.
[13] G. Wu, J. Greathouse, A. Lyashevsky, N. Jayasena, and D. Chiou, “GPGPU performance and power estimation using machine learning,” in: Proceedings of IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 564–576.
[14] Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O’Connor, and Tor M. Aamodt, “Cache Coherence for GPU Architectures,” HPCA 2013.
[15] Chien-Hsuan Yen, “A Memory-Efficient NoC System for Manycore Platform,” Thesis for Master of Science, Institute of Computer and Communication Engineering, National Cheng Kung University, July, 2014.
[16] W. Y. Alexander, C. X. Thomas, L. Pasi and T. Hannu, “Explorations of Honeycomb Topologies for Network-on-Chip,” Sixth IFIP Inter. Conf. Network and Parallel Computing, pp.73-79, 2009.
[17] SK Hynix website. [Online] Available: https://www.skhynix.com/eng/index.jsp
[18] Scott Rixner, William J. Dally, Ujval J. Kapasi, et al., “Memory Access Scheduling,” in the Proceedings of 27th Annual International Symposium on Computer Architecture (ISCA’00), pages 128-138.
[19] Onur Mutlu and Thomas Moscibroda, “Parallelsim-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems,” In ISCA-36, 2008.
[20] Onur Mutlu and Thomas Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” In MICRO-40, 2007.
[21] Heng-Yi Chen, “An HSAIL ISA Conformed GPU Platform,” Thesis for Master of Science, Institute of Computer and Communication Engineering, National Cheng Kung University, July, 2015
[22] Souley Madougou, Ana Varbanescu, Cees de Laat, and Rob van Nieuwpoort, “The landscape of GPGPU performance modeling tools,” Parallel Computing 56 (2016) 18-33.
[23] E1071 package download. [Online] Available: https://cran.r-project.org/web/packages/e1071/index.html
[24] AMD APP SDK download. [Online] Available: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/
[25] NVIDIA OpenCL benchmarks download. [Online] Available: https://developer.nvidia.com/opencl
[26] Rodinia benchmark suite download. [Online] Available: http://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/Rodinia:Accelerating_Compute-Intensive_Applications_with_Accelerators
[27] Open source lecture video from Professor Onur Mutlu. [Online] Available: https://www.youtube.com/watch?v=tpQPN01i3GA&list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq&index=26