| 研究生: |
鄭基漢 Jheng, Jhi-Han |
|---|---|
| 論文名稱: |
時序精確SIMT核心設計與實作 Design of Cycle-accurate SIMT Core and Implementation |
| 指導教授: |
陳中和
Chen, Chung-Ho |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2018 |
| 畢業學年度: | 107 |
| 語文別: | 中文 |
| 論文頁數: | 62 |
| 中文關鍵詞: | 通用繪圖處理器 、時序精確模組 |
| 外文關鍵詞: | GPGPU, Cycle-accurate Model |
| 相關次數: | 點閱:81 下載:9 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
當前高效能運算領域中GPU用於非繪圖應用程式的加速。無論是平行演算法還是深度學習的應用,皆須利用GPU進行運算加速,也因此GPU的設計與實作對於運算系統的開發來說佔有重要的地位。然而開發GPU運算系統是個複雜的過程,必須兼顧硬體與軟體系統才能驗證整個運算平台,透過TLM方法能克服實現複雜系統的障礙,漸增式的開發流程將會先從高度抽象化的硬體模組著手,建構軟體系統的雛型,並在早期的開發階段進行軟硬體的整合驗證,之後再逐步實現更實際與低抽象層級的軟硬體系統。
時序精確為TLM所規範的一種抽象層級,需描述硬體模組在每個時脈邊緣時的行為。透過時序精確的規範,設計者得以根據硬體的功能性模組,開發更低抽象層級的硬體模組。本論文探討時序精確模組的設計方法,並且將此方法應用詳述GPU內部的時序精確SIMT核心設計。在本論文將討論基本時序精確模組的規範方法,分析與條列出SIMT核心架構上的功能需求,並且呈現微架構層級的硬體模組設計與效能指標。最後在CASLAB-GPUSim平台進行整合測試,分析效能指標並且探討效能瓶頸,以及比較其他低階運算系統效能差異,在平行化佳的測試程式能得到4.7到20.1倍的效能提升,而當GPU調升至1.2GHz時,GEMM能有52.6倍的效能提升。
Developing a GPU computing platform requires both software and hardware development. To overcome the complex development process, adopting TLM methodology can build the system by incremental development process, which makes verification and validation in early development stage possible. Cycle-accurate model, the most detailed functional model in TLM, is used to implement RTLable hardware module by describing behavior of the module at each clock edge. We develop the cycle-accurate SIMT core by basic cycle-accurate modeling approach and evaluate its performance on CASLAB-GPUSim cosimulation platform. The performance comparison between a low-end GPU and an embedded CPU with 1.2GHz shows that the low-end GPU can achieve 4.7 to 20.1 times speedup in good parallelism test cases. When tuning the low-end GPU to 1.2 GHz, it can achieve 52.6 times speedup in the test case GEMM, which is the most time-consuming operation in deep learning applications.
[1] Hennessy, John L., and David A. Patterson. Computer architecture: a quantitative approach. Elsevier, 2011.
[2] Black, David C., et al. SystemC: From the ground up. Vol. 71. Springer Science & Business Media, 2009.
[3] Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt"Analyzing CUDA workloads using a detailed GPU simulator." IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp 163-174, 2009.
[4] Aaamodt T. M., and A. Boktor. "GPGPU-Sim 3. x: A performance simulator for many-core accelerator research." International Symposium on Computer Architecture (ISCA), http://www. gpgpu-sim. org/isca2012-tutorial. 2012.
[5] Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi . "GPUWattch: enabling energy optimizations in GPGPUs." ACM SIGARCH Computer Architecture News. Vol. 41. No. 3, pp 487-498,2013.
[6] Thomas, Donald, and Philip Moorby. "Cycle-Accurate Specification." The Verilog® Hardware Description Language (2002): 195-210.
[7] Chupilko, M., and A. Kamkin. "Developing cycle-accurate contract specifications for synchronous parallel-pipeline hardware: application to verification." Electronics Conference (BEC), 2010 12th Biennial Balti
[8] HSA Foundation, “HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer, and Object Format (BRIG),”
http://www.cs.nthu.edu.tw/~ychung/slides/HSA/HSA-PRM-1.02.pdf
[9] Khronos OpenCL Working Group, “The OpenCL Specification,”
https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf
[10] HSA Foundation. “Heterogeneous System Architecture,” http://www.hsafoundation.com/.
[11] Pouchet Louis-Noël. "Polybench: The polyhedral benchmark suite." URL: http://www. cs. ucla. edu/pouchet/software/polybench (2012).
[12] Heng-Yi Chen, “An HSAIL ISA Conformed GPU Platform,” Thesis for Master of Science, Institute of Computer and Communication Engineering, National Cheng Kung University, July, 2015
[13] Kuan- Chieh Hsu, Chung-Ho Chen, “Performance Prediction Model on HSA-Compatible General-Purpose GPU System” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan. 2016.
[14] Wan-Shan Hsieh, Chung-Ho Chen, “Micro-Architecture Optimization of HSA-Compatible GPU” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan. 2016.
[15] Sen-Chih Tsai, Chung-Ho Chen, “Optimization of Workgroup Scheduling on CASLAB-GPUSIM” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan. 2017.
[16] Chien-Ming Chiu, Chung-Ho Chen, “GPU Warp Scheduling Using Memory Stall Sampling on CASLAB-GPUSIM” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan. 2017.
[17] Bo-Xiang Zeng, Chung-Ho Chen, “Architecture Exploration and Optimization of CASLAB-GPUSIM Memory Subsystem” the thesis for Master of Science. National Cheng Kung University, Tainan, Taiwan. 2017.
[18] Chetlur, Sharan, et al. "cudnn: Efficient primitives for deep learning." arXiv preprint arXiv:1410.0759 (2014).