成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	黃冠霖 Huang, Kuan-Lin
論文名稱：	支援共享虛擬位址空間之架構於CASLAB-GPU上實現 Architecture Support for Shared Virtual Address Space on CASLAB-GPU
指導教授：	陳中和 Chen, Chung-Ho
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering
論文出版年：	2018
畢業學年度：	106
語文別：	中文
論文頁數：	69
中文關鍵詞：	繪圖處理器、記憶體管理單元、共享虛擬記憶體系統
外文關鍵詞：	GPU, Memory Management Unit, Shared Virtual Address Space
相關次數：	點閱：86 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

人工智慧蓬勃發展的背後，最主要的推手為通用繪圖處理器的強大運算能力，而深度學習的應用裡面，中央處理器與繪圖處理器間的資料搬移會是效能與功耗的一大瓶頸，若能讓兩者共享實體記憶體，並在繪圖處理器中加入與中央處理器相容的記憶體管理單元，將能有效地減少資料搬移所帶來的效能與功耗負擔，因此對於繪圖處理器設計記憶體管理單元的好處是顯而易見的。
本論文針對繪圖處理器的記憶體管理單元進行架構設計與探勘，在其中加入了非阻塞式轉譯旁觀緩衝器以降低繪圖處理器中大量請求，減少繪圖處理器中同時大量請求之序列式等待時間；Page Table Walker則以多組的形式讓多個串流多處理器共享，提升執行緒層級的平行度；另外加入分頁快取記憶體，並將其根據存取頁表的特性分成頂端與非頂端快取記憶體，非頂端分頁快取記憶體降低列大小以降低對於記憶體存取的頻寬，頂端快取記憶體提升列大小取得空間局部性。將上述三種技術在Gem5-GPU實驗平台上以CASLAB-GPU參數配置進行效能探勘，並整合此記憶體管理單元架構進CASLAB-GPU模擬平台，使本實驗室平台具有共享虛擬記憶體位址空間的特性。

Data migration between CPU and GPU is big performance and power consumption overhead. In order to deal with this problem, concept of shared virtual address space has been proposed. To implement unified address space, we need to consider aspect of both hardware and software. For software, we implement OpenCL-2.0 SVM runtime API on the CASLAB-GPUSIM. For hardware, we propose three memory management unit optimization techniques to improve GPGPU memory subsystem performance. Non-blocking TLB technique adds MSHR to reduce wait time of concurrent multiple requests. Shared coalesced page table walkers technique also focuses on minimum concurrent multiple requests. We also propose mechanism that divides page cache to non-leaf page cache and leaf page cache. In this technique, we can reduce bandwidth of request to bus. At last, we implement these three optimization techniques on the CASLAB-GPUSIM. Performance overhead compared to the original architecture is around 6.6%. If we consider performance overhead of copy engine in the original architecture, there is 4.31% performance improvement.

摘要	I
Summary	II
誌謝	V
目錄	VI
圖目錄	X
第1章 序論	1
1 Motivation	1
2 Contribution	2
3 Organization	2
第2章 背景知識	3
1 GPU Architecture	3
1.1 GPU Microarchitecture	3
1.2 GPU Memory Subsystem Introduction	5
2 MMU Introduction	6
2.1 Virtual Memory	6
2.2 Address Translation	7
3 Shared Virtual Address Space Introduction	9
3.1 Unified Virtual Addressing	10
3.2 Heterogeneous Uniform Memory Access (hUMA)	10
3.3 OpenCL-2.0 Shared Virtual Memory	11
第3章 通用繪圖處理器記憶體管理單元相關研究	14
1 CPU MMU Design	14
1.1 Hardware MMU	14
1.2 RISC-V CPU MMU	15
1.3 Summary	17
2 GPU MMU Design	17
2.1 Translation Lookaside Buffer	18
2.2 Page Table Walker	18
2.3 Page Fault Handling	19
2.4 Other GPU Micro-architecture	20
2.5 Summary	20
第4章 繪圖處理器記憶體管理單元優化與設計	21
1 Observation	21
1.1 Concurrent Page Table Walking Observation	21
1.2 Page Cache Access Observation	22
2 通用繪圖處理器記憶體管理單元架構設計與探勘	23
2.1 Non-blocking TLB	23
2.2 Shared Coalesced Multiple Page Table Walkers	25
2.3 Non-leaf/Leaf Page Cache	25
2.4 硬體成本評估	27
3 Experiment Platform Environment	28
4 Experiment Result	29
4.1 Non-blocking TLB	29
4.2 Shared Coalesced Multiple Page Table Walkers	31
4.3 Page Cache	33
4.4 GPU MMU Experimental Analysis	34
第5章 CASLAB-GPU記憶體管理單元之實現	38
1 Platform Introduction	38
1.1 Heterogeneous System Architecture(HSA)	39
1.2 TensorFlow Application to OpenCL API	39
1.3 OpenCL & HSA Runtime	40
1.4 Device Driver	40
1.5 Streaming Multiprocessor (SM)	41
2 Implementation	42
2.1 OpenCL SVM Runtime API	42
2.2 RISC-V CPU Page Table Emulation	43
2.3 GPU MMU Implementation	45
3 Experiment Evaluation	47
3.1 Simulation Environment and Benchmarks	47
3.2 GPU MMU Performance	50
3.3 Performance, Power and Area	61
4 研究限制與建議	64
第6章 結論	66
參考文獻	67
                                    

[1] T. W. Barr, A. L. Cox and S. Rixner, "SpecTLB: A mechanism for speculative address translation," ACM/IEEE 38th Annual International Symposium on Computer Architecture (ISCA), pp 307-317, 2011.
[2] A.Bhattacharjee, D.Lustig, and M.Martonosi, "Shared Last-Level TLBs for Chip Multiprocessors, " IEEE 17th International Symposium on High Performance Computer Architecture, pp. 62-63, 2011.
[3] G. B. Kandiraju and A. Sivasubramaniam, "Going the distance for TLB prefetching: an application-driven study," Proceedings 29th Annual International Symposium on Computer Architecture, pp. 195-206, 2002.
[4] T. M.Austin and G. S.Sohi, “High-bandwidth address translation for multiple-issue processors,” 23rd Annual International Symposium on Computer Architecture (ISCA'96), pp. 158-158, 1996.
[5] R. Bhargava B. Serebrin F. Spadini S. Manne "Accelerating two-dimensional page walks for virtualized systems," Proc. Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2008
[6] V. I. B. U.Isa, A.Waterman, Y.Lee, D.Patterson, K.Asanovi, and B. U.Isa, "The RISC-V Instruction Set Manual v2.1, " 2012 IEEE Int. Conf. Ind. Technol. ICIT 2012, Proc., vol. I, pp. 1-32, 2012.
[7] V. I. B. U.Isa, A.Waterman, Y.Lee, D.Patterson, K.Asanovi, and B. U.Isa, "The RISC-V Instruction Set Manual Vol.II Privileged Architecture, " 2012 IEEE Int. Conf. Ind. Technol. ICIT 2012, Proc., vol. I, pp. 1-32, 2012.
[8] R.Ausavarungnirun, J.Landgraf, V.Miller, and C. J.Rossbach, "Mosaic : A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes, " In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 136-150, 2017.
[9] NVIDIA, "NVIDIA GeForce GTX 1080, " Whitepaper, pp. 1–52, 2016.
[10] Jason Power, Mark D. Hill, David A. Wood, "Supporting x86-64 address translation for 100s of GPU lanes", High Performance Computer Architecture (HPCA) 2014 IEEE 20th International Symposium on, pp. 568-578, 2014
[11] B.Pichai, L.Hsu, and A.Bhattacharjee, "Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces," Proc. 19th Int. Conf. Archit. Support Program. Lang. Oper. Syst. -ASPLOS ’14, pp. 743–758, 2014.
[12] Yoon, Hongil, Jason Lowe-Power, and Gurindar S. Sohi, "Reducing GPU Address Translation Overhead with Virtual Caching," Technical Report Tech Report TR-1842, Computer Science Dept., University of Wisconsin–Madison, 2016.
[13] J.Vesely, A.Basu, M.Oskin, G. H.Loh, and A.Bhattacharjee, "Observations and opportunities in architecting shared virtual memory for heterogeneous systems, " ISPASS 2016 - Int. Symp. Perform. Anal. Syst. Softw., pp. 161-171, 2016.
[14] A.B.Bogdan F.Romanescu, Alvin R. Lebeck, Daniel J. Sorin, “UNified Instruction/Translation/Data (UNITD) Coherence: One Protocol to Rule Them All,” Int. Symp. High-Performance Comput. Archit., 2010.
[15] S.Shahar, S.Bergman, and M.Silberstein, "ActivePointers: A Case for Software Address Translation on GPUs, " in Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016, 2016.
[16] J.Kloosterman et al, "WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors, " Int. Symp. Microarchitecture, pp. 433-444, 2015.
[17] Jia Wenhao, Kelly A. Shaw, and Margaret Martonosi. "MRPB: Memory request prioritization for massively parallel processors." IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp 272-283, 2014.
[18] J.Power, J.Hestness, M.S. Orr, M.D. Hill and D.A. Wood, "gem5-gpu: A Heterogeneous CPU-GPU Simulator," in IEEE Computer Architecture Letters, vol. 14, no. 1, pp. 34-36, 2015.
[19] A.Bakhoda, G.L.Yuan, W. W. L.Fung, H.Wong, and T. M.Aamodt, “Analyzing CUDA workloads using a detailed GPU simulator,” ISPASS 2009 - Int. Symp. Perform. Anal. Syst. Softw., pp. 163-174, 2009.
[20] Aaamodt, T. M., and A. Boktor. "GPGPU-Sim 3. x: A performance simulator for many-core accelerator research." International Symposium on Computer Architecture (ISCA), http://www. gpgpu-sim. org/isca2012-tutorial. 2012.
[21] S.Che et al., "A Benchmark Suite for Heterogeneous Computing," IEEE Int. Symp. Workload Charact., pp. 44-54, 2009.
[22] H.-Y.Chen, "An HSAIL ISA Conformed GPU Platform," National Cheng Kung University, 2015.
[23] Y.C. Huang, K.C. Hsu, W.S. Hsieh, C.C. Wang, C.H. Lu, and C.H. Chen, "Dynamic SIMD re-convergence with paired-path comparison," Proc. - IEEE Int. Symp. Circuits Syst., vol. 2016-July, pp. 233-236, 2016.
[24] C.M.Chiu, "GPU Warp Scheduling Using Memory Stall Sampling on CASLAB-GPUSIM," National Cheng Kung University, 2017.
[25] "Tf-coriander." [Online]. Available: https://github.com/hughperkins/tf-coriander.
[26] "OpenCL-The open standard for parallel programming of heterogeneous systems." [Online]. Available: http://www.khronos.org/opencl.
[27] J.Ǵomez-Luna et al., "Chai: Collaborative heterogeneous applications for integrated-Architectures," ISPASS 2017-IEEE Int. Symp. Perform. Anal. Syst. Softw., pp. 43-54, 2017.
[28] D.Lustig and M.Martonosi, "Reducing GPGPU Offload Latency via Fine-Grained CPU-GPU Synchronization," 19th High Performance Computer Architecture (HPCA), 2013
[29] N.Muralimanohar, R.Balasubramonian, and N.P.Jouppi, "CACTI 6.0 : A Tool to Model Large Caches CACTI 6.0 : A Tool to Model Large Caches," Symp. A Q. J. Mod. Foreign Lit., no. HPL-2009-85, pp. 0-24, 2009.
[30] riscv, "riscv-isa-sim." [Online]. Available: https://github.com/riscv/riscv-isa-sim.

校外：不公開電子論文及紙本論文均尚未授權公開

簡易檢索 / 詳目顯示

相關論文