簡易檢索 / 詳目顯示

研究生: 黃冠霖
Huang, Kuan-Lin
論文名稱: 支援共享虛擬位址空間之架構於CASLAB-GPU上實現
Architecture Support for Shared Virtual Address Space on CASLAB-GPU
指導教授: 陳中和
Chen, Chung-Ho
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 中文
論文頁數: 69
中文關鍵詞: 繪圖處理器記憶體管理單元共享虛擬記憶體系統
外文關鍵詞: GPU, Memory Management Unit, Shared Virtual Address Space
相關次數: 點閱:61下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 人工智慧蓬勃發展的背後,最主要的推手為通用繪圖處理器的強大運算能力,而深度學習的應用裡面,中央處理器與繪圖處理器間的資料搬移會是效能與功耗的一大瓶頸,若能讓兩者共享實體記憶體,並在繪圖處理器中加入與中央處理器相容的記憶體管理單元,將能有效地減少資料搬移所帶來的效能與功耗負擔,因此對於繪圖處理器設計記憶體管理單元的好處是顯而易見的。
    本論文針對繪圖處理器的記憶體管理單元進行架構設計與探勘,在其中加入了非阻塞式轉譯旁觀緩衝器以降低繪圖處理器中大量請求,減少繪圖處理器中同時大量請求之序列式等待時間;Page Table Walker則以多組的形式讓多個串流多處理器共享,提升執行緒層級的平行度;另外加入分頁快取記憶體,並將其根據存取頁表的特性分成頂端與非頂端快取記憶體,非頂端分頁快取記憶體降低列大小以降低對於記憶體存取的頻寬,頂端快取記憶體提升列大小取得空間局部性。將上述三種技術在Gem5-GPU實驗平台上以CASLAB-GPU參數配置進行效能探勘,並整合此記憶體管理單元架構進CASLAB-GPU模擬平台,使本實驗室平台具有共享虛擬記憶體位址空間的特性。

    Data migration between CPU and GPU is big performance and power consumption overhead. In order to deal with this problem, concept of shared virtual address space has been proposed. To implement unified address space, we need to consider aspect of both hardware and software. For software, we implement OpenCL-2.0 SVM runtime API on the CASLAB-GPUSIM. For hardware, we propose three memory management unit optimization techniques to improve GPGPU memory subsystem performance. Non-blocking TLB technique adds MSHR to reduce wait time of concurrent multiple requests. Shared coalesced page table walkers technique also focuses on minimum concurrent multiple requests. We also propose mechanism that divides page cache to non-leaf page cache and leaf page cache. In this technique, we can reduce bandwidth of request to bus. At last, we implement these three optimization techniques on the CASLAB-GPUSIM. Performance overhead compared to the original architecture is around 6.6%. If we consider performance overhead of copy engine in the original architecture, there is 4.31% performance improvement.

    摘要 I Summary II 誌謝 V 目錄 VI 圖目錄 X 第1章 序論 1 1.1 Motivation 1 1.2 Contribution 2 1.3 Organization 2 第2章 背景知識 3 2.1 GPU Architecture 3 2.1.1 GPU Microarchitecture 3 2.1.2 GPU Memory Subsystem Introduction 5 2.2 MMU Introduction 6 2.2.1 Virtual Memory 6 2.2.2 Address Translation 7 2.3 Shared Virtual Address Space Introduction 9 2.3.1 Unified Virtual Addressing 10 2.3.2 Heterogeneous Uniform Memory Access (hUMA) 10 2.3.3 OpenCL-2.0 Shared Virtual Memory 11 第3章 通用繪圖處理器記憶體管理單元相關研究 14 3.1 CPU MMU Design 14 3.1.1 Hardware MMU 14 3.1.2 RISC-V CPU MMU 15 3.1.3 Summary 17 3.2 GPU MMU Design 17 3.2.1 Translation Lookaside Buffer 18 3.2.2 Page Table Walker 18 3.2.3 Page Fault Handling 19 3.2.4 Other GPU Micro-architecture 20 3.2.5 Summary 20 第4章 繪圖處理器記憶體管理單元優化與設計 21 4.1 Observation 21 4.1.1 Concurrent Page Table Walking Observation 21 4.1.2 Page Cache Access Observation 22 4.2 通用繪圖處理器記憶體管理單元架構設計與探勘 23 4.2.1 Non-blocking TLB 23 4.2.2 Shared Coalesced Multiple Page Table Walkers 25 4.2.3 Non-leaf/Leaf Page Cache 25 4.2.4 硬體成本評估 27 4.3 Experiment Platform Environment 28 4.4 Experiment Result 29 4.4.1 Non-blocking TLB 29 4.4.2 Shared Coalesced Multiple Page Table Walkers 31 4.4.3 Page Cache 33 4.4.4 GPU MMU Experimental Analysis 34 第5章 CASLAB-GPU記憶體管理單元之實現 38 5.1 Platform Introduction 38 5.1.1 Heterogeneous System Architecture(HSA) 39 5.1.2 TensorFlow Application to OpenCL API 39 5.1.3 OpenCL & HSA Runtime 40 5.1.4 Device Driver 40 5.1.5 Streaming Multiprocessor (SM) 41 5.2 Implementation 42 5.2.1 OpenCL SVM Runtime API 42 5.2.2 RISC-V CPU Page Table Emulation 43 5.2.3 GPU MMU Implementation 45 5.3 Experiment Evaluation 47 5.3.1 Simulation Environment and Benchmarks 47 5.3.2 GPU MMU Performance 50 5.3.3 Performance, Power and Area 61 5.4 研究限制與建議 64 第6章 結論 66 參考文獻 67

    [1] T. W. Barr, A. L. Cox and S. Rixner, "SpecTLB: A mechanism for speculative address translation," ACM/IEEE 38th Annual International Symposium on Computer Architecture (ISCA), pp 307-317, 2011.
    [2] A.Bhattacharjee, D.Lustig, and M.Martonosi, "Shared Last-Level TLBs for Chip Multiprocessors, " IEEE 17th International Symposium on High Performance Computer Architecture, pp. 62-63, 2011.
    [3] G. B. Kandiraju and A. Sivasubramaniam, "Going the distance for TLB prefetching: an application-driven study," Proceedings 29th Annual International Symposium on Computer Architecture, pp. 195-206, 2002.
    [4] T. M.Austin and G. S.Sohi, “High-bandwidth address translation for multiple-issue processors,” 23rd Annual International Symposium on Computer Architecture (ISCA'96), pp. 158-158, 1996.
    [5] R. Bhargava B. Serebrin F. Spadini S. Manne "Accelerating two-dimensional page walks for virtualized systems," Proc. Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2008
    [6] V. I. B. U.Isa, A.Waterman, Y.Lee, D.Patterson, K.Asanovi, and B. U.Isa, "The RISC-V Instruction Set Manual v2.1, " 2012 IEEE Int. Conf. Ind. Technol. ICIT 2012, Proc., vol. I, pp. 1-32, 2012.
    [7] V. I. B. U.Isa, A.Waterman, Y.Lee, D.Patterson, K.Asanovi, and B. U.Isa, "The RISC-V Instruction Set Manual Vol.II Privileged Architecture, " 2012 IEEE Int. Conf. Ind. Technol. ICIT 2012, Proc., vol. I, pp. 1-32, 2012.
    [8] R.Ausavarungnirun, J.Landgraf, V.Miller, and C. J.Rossbach, "Mosaic : A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes, " In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 136-150, 2017.
    [9] NVIDIA, "NVIDIA GeForce GTX 1080, " Whitepaper, pp. 1–52, 2016.
    [10] Jason Power, Mark D. Hill, David A. Wood, "Supporting x86-64 address translation for 100s of GPU lanes", High Performance Computer Architecture (HPCA) 2014 IEEE 20th International Symposium on, pp. 568-578, 2014
    [11] B.Pichai, L.Hsu, and A.Bhattacharjee, "Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces," Proc. 19th Int. Conf. Archit. Support Program. Lang. Oper. Syst. -ASPLOS ’14, pp. 743–758, 2014.
    [12] Yoon, Hongil, Jason Lowe-Power, and Gurindar S. Sohi, "Reducing GPU Address Translation Overhead with Virtual Caching," Technical Report Tech Report TR-1842, Computer Science Dept., University of Wisconsin–Madison, 2016.
    [13] J.Vesely, A.Basu, M.Oskin, G. H.Loh, and A.Bhattacharjee, "Observations and opportunities in architecting shared virtual memory for heterogeneous systems, " ISPASS 2016 - Int. Symp. Perform. Anal. Syst. Softw., pp. 161-171, 2016.
    [14] A.B.Bogdan F.Romanescu, Alvin R. Lebeck, Daniel J. Sorin, “UNified Instruction/Translation/Data (UNITD) Coherence: One Protocol to Rule Them All,” Int. Symp. High-Performance Comput. Archit., 2010.
    [15] S.Shahar, S.Bergman, and M.Silberstein, "ActivePointers: A Case for Software Address Translation on GPUs, " in Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016, 2016.
    [16] J.Kloosterman et al, "WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors, " Int. Symp. Microarchitecture, pp. 433-444, 2015.
    [17] Jia Wenhao, Kelly A. Shaw, and Margaret Martonosi. "MRPB: Memory request prioritization for massively parallel processors." IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp 272-283, 2014.
    [18] J.Power, J.Hestness, M.S. Orr, M.D. Hill and D.A. Wood, "gem5-gpu: A Heterogeneous CPU-GPU Simulator," in IEEE Computer Architecture Letters, vol. 14, no. 1, pp. 34-36, 2015.
    [19] A.Bakhoda, G.L.Yuan, W. W. L.Fung, H.Wong, and T. M.Aamodt, “Analyzing CUDA workloads using a detailed GPU simulator,” ISPASS 2009 - Int. Symp. Perform. Anal. Syst. Softw., pp. 163-174, 2009.
    [20] Aaamodt, T. M., and A. Boktor. "GPGPU-Sim 3. x: A performance simulator for many-core accelerator research." International Symposium on Computer Architecture (ISCA), http://www. gpgpu-sim. org/isca2012-tutorial. 2012.
    [21] S.Che et al., "A Benchmark Suite for Heterogeneous Computing," IEEE Int. Symp. Workload Charact., pp. 44-54, 2009.
    [22] H.-Y.Chen, "An HSAIL ISA Conformed GPU Platform," National Cheng Kung University, 2015.
    [23] Y.C. Huang, K.C. Hsu, W.S. Hsieh, C.C. Wang, C.H. Lu, and C.H. Chen, "Dynamic SIMD re-convergence with paired-path comparison," Proc. - IEEE Int. Symp. Circuits Syst., vol. 2016-July, pp. 233-236, 2016.
    [24] C.M.Chiu, "GPU Warp Scheduling Using Memory Stall Sampling on CASLAB-GPUSIM," National Cheng Kung University, 2017.
    [25] "Tf-coriander." [Online]. Available: https://github.com/hughperkins/tf-coriander.
    [26] "OpenCL-The open standard for parallel programming of heterogeneous systems." [Online]. Available: http://www.khronos.org/opencl.
    [27] J.Ǵomez-Luna et al., "Chai: Collaborative heterogeneous applications for integrated-Architectures," ISPASS 2017-IEEE Int. Symp. Perform. Anal. Syst. Softw., pp. 43-54, 2017.
    [28] D.Lustig and M.Martonosi, "Reducing GPGPU Offload Latency via Fine-Grained CPU-GPU Synchronization," 19th High Performance Computer Architecture (HPCA), 2013
    [29] N.Muralimanohar, R.Balasubramonian, and N.P.Jouppi, "CACTI 6.0 : A Tool to Model Large Caches CACTI 6.0 : A Tool to Model Large Caches," Symp. A Q. J. Mod. Foreign Lit., no. HPL-2009-85, pp. 0-24, 2009.
    [30] riscv, "riscv-isa-sim." [Online]. Available: https://github.com/riscv/riscv-isa-sim.

    無法下載圖示 校內:2023-09-01公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE