| 研究生: |
王昱翔 Wang, Yu-Hsiang |
|---|---|
| 論文名稱: |
CASLAB-GPU在FPGA開發板上之驗證與其執行緒排程和子記憶體架構優化 CASLAB-GPU Verification on FPGA and Optimization of Warp Scheduling and Memory Subsystem |
| 指導教授: |
陳中和
Chen, Chung-Ho |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 中文 |
| 論文頁數: | 87 |
| 中文關鍵詞: | 通用型繪圖處理器 、FPGA 、快取記憶體架構 、執行緒排程機制 |
| 外文關鍵詞: | GPGPU, FPGA, Cache policy, Warp scheduling |
| 相關次數: | 點閱:166 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來人工智慧、機器學習、影像辨識等領域越來越活躍,物件辨識、物件追蹤被廣泛運用在生活中,隨著IOT的出現,需要Edge computing這類輕巧並且運算快的電腦出現,於是通用型繪圖處理器開始被廣泛的使用,透過高度平行化運算來提升速度,適合用於AI相關領域,本實驗室開發電子系統層級的CASLAB-GPU,以邊緣運算為目標,符合OpenCL/Tensorflow API規範,建構軟體與硬體的整套系統。
本論文以CASLAB-GPU為標準,開發對應的GPU Register-Transfer Level (RTL)並驗證於FPGA開發商所提供的Universal Multi-Resouce Bus Verilog Programming Language Interface (UMRBus PLI),UMRBus溝通機制提供軟硬體間的溝通橋樑,PLI則是提供將RTL轉成Simulator的工具來執行硬體,最後將OpenCL程式能執行於GPU RTL模擬環境,提供未來實際放到FPGA板上的重要參考依據。
另外本論文實作學長們在GPGPUsim開發的優化機制,分別為Write Pseudo Allocate Cache Policy (WPAP)與Memory-Contention Aware Warp Scheduling (MAWS)在現在的CASLAB-GPU平台上,WPAP透過分析GPU記憶體讀寫地址特性來設計快取記憶體政策,MAWS則是透過動態取樣觀察記憶體競爭情形,調整適當的執行緒並行程度。實驗結果顯示,在WPAP優化技術下可增加66%的效能,減少20%的快取記憶體Miss Rate。在MAWS機制下可增加50%的效能,並減少11%左右的快取記憶體Miss Rate,驗證即使在時序精確的層級下,這些優化機制還是能發揮出效用,有效提升整體效能。
This thesis has divided into two parts. One is CASLAB-GPU verification on FPGA and the other is optimization of warp scheduling and memory subsystem. In verification part, we design our GPU RTL program and use Universal Multi-Resource Bus (UMRBus) provided by FPGA vendor to build the communication system between OpenCL application and GPU hardware. Finally, we use Programming Language Interface (PLI) to simulate the UMRBus in a software-only environment.
In optimization, we implement Write Pseudo Allocate Policy (WPAP) and Memory-Contention Aware Warp Scheduling (MAWS). The WPAP technique focuses on reducing useless data reading from external memory to minimize bus traffic. The MAWS mechanism strikes a balance between memory workloads and memory resources.
Our results of PolyBench benchmarks on the CASLAB-GPU platform show that WPAP with hash function technique yields 66% speedup, and reduces 20% miss rate compared to write back with write allocate cache policy and MAWS technique yields 50% speedup compared to Loose Round Robin (LRR) warp scheduling.
[1] Jia, Wenhao, Kelly A. Shaw, and Margaret Martonosi. “Characterizing and improving the use of demand-fetched caches in GPUs,” Proceedings of the 26th ACM international conference on Supercomputing. 2012.
[2] Li, Chao, et al. "Locality-driven dynamic GPU cache bypassing," Proceedings of the 29th ACM on International Conference on Supercomputing. 2015.
[3] Jia, Wenhao, Kelly A. Shaw, and Margaret Martonosi. "MRPB: Memory request prioritization for massively parallel processors," 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2014.
[4] Bakhoda, Ali, et al. "Analyzing CUDA workloads using a detailed GPU simulator," 2009 IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 2009.
[5] Rogers, Timothy G., Mike OConnor, and Tor M. Aamodt. "Cache-conscious wavefront scheduling," 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2012.
[6] Narasiman, Veynu, et al. "Improving GPU performance via large warps and two-level warp scheduling," Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 2011
[7] Van Hook, Timothy, and Anthony P. DeLaurier. "Method and apparatus for cache index hashing," U.S. Patent No. 6,549,210. 15 Apr. 2003
[8] Nugteren, Cedric, et al. "A detailed GPU cache model based on reuse distance theory," 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2014.
[9] Rogers, Timothy G., Mike OConnor, and Tor M. Aamodt. "Cache-conscious wavefront scheduling," 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2012.
[10] Narasiman, Veynu, et al. "Improving GPU performance via large warps and two-level warp scheduling," Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 2011.
[11] Tsou, Tsung-Han, et al. "Optimization of Stride Prefetching Mechanism and Dependent Warp Scheduling on GPGPU." 2020 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2020.
[12] Sander, B., and T. Tye. "HSAIL-Virtual Parallel ISA." Heterogeneous System Architecture. Morgan Kaufmann, 2016. 19-34.
[13] Ghenassia, Frank. Transaction-level modeling with SystemC. Vol. 2. Dordrecht, The Netherlands: Springer, 2005.
[14] S.-Y. Lee, A. Arunkumar, and C.-J. Wu, “Cawa: Coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA ’15). New York, NY, USA: ACM, 2015, pp. 515– 527.
[15] S.-Y. Lee and C.-J. Wu, “Caws: Criticality-aware warp scheduling for gpgpu workloads,” in Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT ’14), 2014, pp. 175–186.
[16] M. K. Yoon, Y. Oh, S. Lee, S. H. Kim, D. Kim, and W. W. Ro, “Draw: investigating benefits of adaptive fetch group size on 202 gpu,” in 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS ’15), March 2015, pp. 183–192.
[17] M. K. Yoon, K. Kim, S. Lee, W. W. Ro, and M. Annavaram, “Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit,” in Proceedings of the 43rd International Symposium on Computer Architecture (ISCA ’16), 2016.
[18] A. Sethia, D. A. Jamshidi, and S. Mahlke, “Mascar: Speeding up gpu warps by reducing memory pitstops,” in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA ’15), Feb 2015, pp. 174–185.
[19] Chien-Ming Chiu, “GPU Warp Scheduling Using Memory Stall Sampling on CASLAB-GPUSIM, ”
[20] Bo-Xiang Zeng, “Architecture Exploration and Optimization of CASLAB-GPUSIM Memory Subsystem,”
校內:2026-01-29公開