簡易檢索 / 詳目顯示

研究生: 王昱翔
Wang, Yu-Hsiang
論文名稱: CASLAB-GPU在FPGA開發板上之驗證與其執行緒排程和子記憶體架構優化
CASLAB-GPU Verification on FPGA and Optimization of Warp Scheduling and Memory Subsystem
指導教授: 陳中和
Chen, Chung-Ho
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 87
中文關鍵詞: 通用型繪圖處理器FPGA快取記憶體架構執行緒排程機制
外文關鍵詞: GPGPU, FPGA, Cache policy, Warp scheduling
相關次數: 點閱:166下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來人工智慧、機器學習、影像辨識等領域越來越活躍,物件辨識、物件追蹤被廣泛運用在生活中,隨著IOT的出現,需要Edge computing這類輕巧並且運算快的電腦出現,於是通用型繪圖處理器開始被廣泛的使用,透過高度平行化運算來提升速度,適合用於AI相關領域,本實驗室開發電子系統層級的CASLAB-GPU,以邊緣運算為目標,符合OpenCL/Tensorflow API規範,建構軟體與硬體的整套系統。
    本論文以CASLAB-GPU為標準,開發對應的GPU Register-Transfer Level (RTL)並驗證於FPGA開發商所提供的Universal Multi-Resouce Bus Verilog Programming Language Interface (UMRBus PLI),UMRBus溝通機制提供軟硬體間的溝通橋樑,PLI則是提供將RTL轉成Simulator的工具來執行硬體,最後將OpenCL程式能執行於GPU RTL模擬環境,提供未來實際放到FPGA板上的重要參考依據。
    另外本論文實作學長們在GPGPUsim開發的優化機制,分別為Write Pseudo Allocate Cache Policy (WPAP)與Memory-Contention Aware Warp Scheduling (MAWS)在現在的CASLAB-GPU平台上,WPAP透過分析GPU記憶體讀寫地址特性來設計快取記憶體政策,MAWS則是透過動態取樣觀察記憶體競爭情形,調整適當的執行緒並行程度。實驗結果顯示,在WPAP優化技術下可增加66%的效能,減少20%的快取記憶體Miss Rate。在MAWS機制下可增加50%的效能,並減少11%左右的快取記憶體Miss Rate,驗證即使在時序精確的層級下,這些優化機制還是能發揮出效用,有效提升整體效能。

    This thesis has divided into two parts. One is CASLAB-GPU verification on FPGA and the other is optimization of warp scheduling and memory subsystem. In verification part, we design our GPU RTL program and use Universal Multi-Resource Bus (UMRBus) provided by FPGA vendor to build the communication system between OpenCL application and GPU hardware. Finally, we use Programming Language Interface (PLI) to simulate the UMRBus in a software-only environment.

    In optimization, we implement Write Pseudo Allocate Policy (WPAP) and Memory-Contention Aware Warp Scheduling (MAWS). The WPAP technique focuses on reducing useless data reading from external memory to minimize bus traffic. The MAWS mechanism strikes a balance between memory workloads and memory resources.

    Our results of PolyBench benchmarks on the CASLAB-GPU platform show that WPAP with hash function technique yields 66% speedup, and reduces 20% miss rate compared to write back with write allocate cache policy and MAWS technique yields 50% speedup compared to Loose Round Robin (LRR) warp scheduling.

    目錄 摘要 I Summary II 誌謝 IX 目錄 X 表目錄 XIII 圖目錄 XIV 第1章 序論 1 1.1 論文動機 2 1.2 論文貢獻 3 1.3 論文架構 3 第2章 背景知識與相關研究 4 2.1 Runtime System 4 2.1.1 OpenCL Runtime 4 2.1.2 OpenCL Runtime API 5 2.1.3 HSA Runtime 6 2.2 CASLAB-GPU 內部架構 8 2.2.1 Streaming Multiprocessor (SM) 10 2.2.2 Workgroup Initializer (WG) 10 2.2.3 Instruction Fetch (IF) 11 2.2.4 Instruction Cache (IC) 11 2.2.5 Instruction Buffer (IB) 12 2.2.6 Dependency Check Unit (DCU) 12 2.2.7 Warp Scheduler (WS) 13 2.2.8 Register Arbitrator (RA) 13 2.2.9 Register Bank (RB) 13 2.2.10 Register Crossbar (RC) 14 2.2.11 Operand Collect Unit (OCU) 14 2.2.12 Divergence Stack (DS) 15 2.2.13 Warp Dispatcher (WD) 16 2.2.14 Execution Unit (EXE) 16 2.2.15 Load/Store Unit (LSU) 17 2.2.16 Local Memory (LM) 17 2.2.17 Data Cache (DC) 18 2.2.18 Write back Unit (WB) 18 2.3 快取記憶體架構 19 2.4 Warp Scheduling 21 2.4.1 Loose Round Robin (LRR) 21 2.4.2 Greedy-Than-Oldest (GTO) 22 2.4.3 Two-Level (TL) 23 第3章 執行緒排程和子記憶體架構優化 24 3.1 平台比較 24 3.2 Profiler 28 3.2.1 Stall factor 29 3.3 Memory-Contention Aware Warp Scheduling (MAWS) 31 3.3.1 應用程式觀察 31 3.3.2 記憶體競爭對排程機制的影響 33 3.3.3 MAWS-α 34 3.3.4 MAWS-β 38 3.3.5 MAWS機制在不同平台上的效能差異 39 3.4 Write Pseudo Allocate Cache Policy 43 3.4.1 應用程式觀察 43 3.4.2 快取記憶體讀取效能低落原因 44 3.4.3 快取記憶體優化技術 46 第4章 驗證CASLAB GPU於FPGA開發板 49 4.1 Platform Introduction 49 4.2 ESL Full System Design Methodology 50 4.3 FPGA GPU Hardware Verification 51 4.4 UMRBus Communication System 56 4.5 UMRBus PLI Interface 65 第5章 實驗結果與效能評估 70 5.1 實驗平台環境介紹 70 5.2 實驗結果與分析 71 5.2.1 CASLAB GPU Different Configuration 71 5.2.2 Cache : Write Pseudo Allocate Policy 76 5.2.3 Memory-Contention Aware Warp Scheduling 79 第6章 結論 84 參考文獻 85

    [1] Jia, Wenhao, Kelly A. Shaw, and Margaret Martonosi. “Characterizing and improving the use of demand-fetched caches in GPUs,” Proceedings of the 26th ACM international conference on Supercomputing. 2012.
    [2] Li, Chao, et al. "Locality-driven dynamic GPU cache bypassing," Proceedings of the 29th ACM on International Conference on Supercomputing. 2015.
    [3] Jia, Wenhao, Kelly A. Shaw, and Margaret Martonosi. "MRPB: Memory request prioritization for massively parallel processors," 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2014.
    [4] Bakhoda, Ali, et al. "Analyzing CUDA workloads using a detailed GPU simulator," 2009 IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 2009.
    [5] Rogers, Timothy G., Mike OConnor, and Tor M. Aamodt. "Cache-conscious wavefront scheduling," 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2012.
    [6] Narasiman, Veynu, et al. "Improving GPU performance via large warps and two-level warp scheduling," Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 2011
    [7] Van Hook, Timothy, and Anthony P. DeLaurier. "Method and apparatus for cache index hashing," U.S. Patent No. 6,549,210. 15 Apr. 2003
    [8] Nugteren, Cedric, et al. "A detailed GPU cache model based on reuse distance theory," 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2014.
    [9] Rogers, Timothy G., Mike OConnor, and Tor M. Aamodt. "Cache-conscious wavefront scheduling," 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2012.
    [10] Narasiman, Veynu, et al. "Improving GPU performance via large warps and two-level warp scheduling," Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 2011.
    [11] Tsou, Tsung-Han, et al. "Optimization of Stride Prefetching Mechanism and Dependent Warp Scheduling on GPGPU." 2020 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2020.
    [12] Sander, B., and T. Tye. "HSAIL-Virtual Parallel ISA." Heterogeneous System Architecture. Morgan Kaufmann, 2016. 19-34.
    [13] Ghenassia, Frank. Transaction-level modeling with SystemC. Vol. 2. Dordrecht, The Netherlands: Springer, 2005.
    [14] S.-Y. Lee, A. Arunkumar, and C.-J. Wu, “Cawa: Coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA ’15). New York, NY, USA: ACM, 2015, pp. 515– 527.
    [15] S.-Y. Lee and C.-J. Wu, “Caws: Criticality-aware warp scheduling for gpgpu workloads,” in Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT ’14), 2014, pp. 175–186.
    [16] M. K. Yoon, Y. Oh, S. Lee, S. H. Kim, D. Kim, and W. W. Ro, “Draw: investigating benefits of adaptive fetch group size on 202 gpu,” in 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS ’15), March 2015, pp. 183–192.
    [17] M. K. Yoon, K. Kim, S. Lee, W. W. Ro, and M. Annavaram, “Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit,” in Proceedings of the 43rd International Symposium on Computer Architecture (ISCA ’16), 2016.
    [18] A. Sethia, D. A. Jamshidi, and S. Mahlke, “Mascar: Speeding up gpu warps by reducing memory pitstops,” in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA ’15), Feb 2015, pp. 174–185.
    [19] Chien-Ming Chiu, “GPU Warp Scheduling Using Memory Stall Sampling on CASLAB-GPUSIM, ”
    [20] Bo-Xiang Zeng, “Architecture Exploration and Optimization of CASLAB-GPUSIM Memory Subsystem,”

    無法下載圖示 校內:2026-01-29公開
    校外:2026-01-30公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE