簡易檢索 / 詳目顯示

研究生: 鄒宗翰
Tsou, Tsung-Han
論文名稱: 步距資料預取機制與Warp排程機制在計算型繪圖處理器上之優化
Optimization of Stride Prefetching Mechanism and Dependent Warp Scheduling on GPGPU
指導教授: 陳中和
Chen, Chung-Ho
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2019
畢業學年度: 108
語文別: 中文
論文頁數: 93
中文關鍵詞: 通用型繪圖處理器資料預取Warp排程機制
外文關鍵詞: Data Prefetching, GPGPU, Warp Scheduling
相關次數: 點閱:63下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在人工智慧、影像辨識等研究領域越來越熱門,通用型繪圖處理器開始被廣泛的應用在這些非傳統圖形處理的工作上。本實驗室平台CASLAB-GPUSIM作為可以支援TensorFlow及OpenCL等Framework的通用型繪圖處理器,尚未有Data Prefetching這方面的記憶體優化方式,由於記憶體存取所造成的系統負擔一直是通用型繪圖處理器的瓶頸之一,因此本論文便朝這個方向做研究與探討。
    本論文先利用Next-Line (NL)與Spatial Locality Detection (SLD)等常見的Data Prefetching方法在本實驗平台CASLAB-GPUSIM上做實驗,透過其中的一些現象及觀察,以Stride Prefetching作為基礎提出了History-Awoken Stride (HAS) Prefetching方法,以及與之搭配的Prefetched-Then-Executed (PTE) Scheduling機制。HAS主要透過前幾筆的記憶體存取請求找出規律,預先存取之後所需的資料,並透過History Table及Warp Status的設計重複利用過去的資訊及控制每個Warp的預取進度。而PTE則依據HAS傳來的預取進度,優先派發準備好預取資料的Warp,更精確的配合預取進度來做發派時機的調整,使得預取的機制更有效率。在CASLAB-GPUSIM上執行12支應用程式的實驗結果顯示,HAS方法搭配PTE機制,平均可以達到10.4%的效能提升,以及減少7.8%的Data Cache Miss Rate。而在預取表現上,精確度可達67.7%,派發時機恰當的預取請求比例達到48.2%,遠遠高出前述之NL及SLD等方法在預取上的表現。

    Based on stride prefetching, this thesis proposes a History-Awoken Stride (HAS) prefetching optimized with a warp scheduling strategy called, Prefetched-Then-Executed (PTE) policy. HAS finds the strides among iterations, inter-warp, and among workgroups, through the previous memory access requests, and pre-accesses the data that may be used in the future. Through the mechanism of the history table, the subsequent requests have the opportunity to use the past access information to quickly start the prefetching operations. The warp status setting is used to control the prefetching progress for each warp, avoiding sending too many or premature prefetch requests. PTE uses the warp status from HAS to directly arrange warp's executing schedule into four different priorities which accurately match with the prefetching progress to meet the data demand time.
    The experimental results of LeNet-5 inference and 11 PolyBench benchmarks on CASLAB-GPUSIM show that our mechanism can achieve an average performance improvement of 10.4%, and 7.8% reduction in data cache miss rate. The prefetch accuracy can reach 67.7%, and the proportion of prefetch request sent at the appropriate time reaches 48.2%.

    摘要 I Summary II 誌謝 VII 目錄 VIII 表目錄 XI 圖目錄 XII 第1章 序論 1 1.1 論文動機 1 1.2 論文貢獻 2 1.3 論文架構 3 第2章 背景知識與相關研究 4 2.1 CASLAB-GPUSIM內部架構 4 2.1.1 Streaming Multiprocessor (SM) 6 2.1.2 Instruction Buffer (IB) 7 2.1.3 Dependency Check Unit (DCU) 7 2.1.4 Warp Scheduler (WS) 8 2.1.5 Execution Unit (EXE) & Load/Store Unit (LSU) 8 2.2 Warp排程機制 10 2.2.1 Loose Round-Robin (LRR) Warp Scheduling 10 2.2.2 Greedy-Than-Oldest (GTO) Warp Scheduling 11 2.2.3 Two-Level (TL) Warp Scheduling 14 2.2.4 Prefetch-Aware (PA) Warp Scheduling 16 2.3 資料預取機制 18 2.3.1 Next-Line (NL) Prefetching 19 2.3.2 Stride (STR) Prefetching 21 2.3.3 Spatial Locality Detection (SLD) Prefetching 24 2.3.4 Warp Scheduling機制之影響 26 2.4 CASLAB-GPUSIM資料快取機制 32 第3章 設計方法 36 3.1 應用程式觀察 36 3.1.1 應用程式的不相鄰存取行為 36 3.1.2 Workgroup排程機制的影響 37 3.1.3 記憶體指令的位址存取規律 39 3.2 History-Awoken Stride (HAS) Prefetching 44 3.2.1 Stride Prefetching 47 3.2.2 Warp Status 51 3.2.3 Prefetch Status 53 3.2.4 History Table 59 3.3 Prefetched-Then-Executed (PTE) Scheduling 65 3.3.1 Oldest Load Warp Status 65 3.3.2 Four-Level Priority 66 3.3.3 Dual-Mode Scheduling 69 3.4 成本評估 70 第4章 實驗結果與效能評估 72 4.1 實驗平台環境介紹 72 4.2 實驗結果與分析 75 4.2.1 Instruction-Per-Cycle 75 4.2.2 Miss Rate 77 4.2.3 Prefetching Performance 80 第5章 結論 89 參考文獻 90

    [1] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, “Improving GPU performance via large warps and two-level warp scheduling,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). New York, NY, USA: ACM, 2011, pp. 308–317.
    [2] A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, “Orchestrated scheduling and prefetching for GPGPUs,” in Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA ’13). New York, NY, USA: ACM, 2013, pp. 332–343.
    [3] T. Johnson, M. Merten, and W.-M. Hwu. “Run-time spatial locality detection and optimization,” in Proceedings of the 31st International Symposium on Microarchitecture, Nov.1998.
    [4] A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das, “Owl: Cooperative thread array aware scheduling techniques for improving gpgpu performance,” in Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’13). New York, NY, USA: ACM, 2013, pp. 395–406.
    [5] J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc, “Many-thread aware prefetching mechanisms for gpgpu applications,” in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’10), Dec 2010, pp. 213–224.
    [6] A. Sethia, G. Dasika, M. Samadi, and S. Mahlke, “Apogee: Adaptive prefetching on gpus for energy efficiency,” in Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT ’13). Piscataway, NJ, USA: IEEE Press, 2013, pp. 73–82.
    [7] Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun Park, Yongjun Park, Won Woo Ro, and Murali Annavaram, “APRES: Improving cache efficiency by exploiting load characteristics on GPUs,” in Proceedings of the 43rd International Symposium on Computer Architecture (ISCA ’16). Piscataway, NJ, USA: IEEE Press, 2016, pp. 191–203.
    [8] S.-Y. Lee and C.-J. Wu, “Characterizing the latency hiding ability of GPUs,” in 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS ’14), 2014, pp. 145–146.
    [9] K. Kim, S. Lee, M. K. Yoon, G. Koo, W. W. Ro, and M. Annavaram, “Warped-preexecution: A GPU pre-execution approach for improving latency hiding,” in 2016 IEEE 22nd International Symposium on High Performance Computer Architecture (HPCA ’16), March 2016, pp. 163–175.
    [10] S.-Y. Lee, A. Arunkumar, and C.-J. Wu, “Cawa: Coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA ’15). New York, NY, USA: ACM, 2015, pp. 515– 527.
    [11] S.-Y. Lee and C.-J. Wu, “Caws: Criticality-aware warp scheduling for gpgpu workloads,” in Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT ’14), 2014, pp. 175–186.
    [12] T. G. Rogers, M. O’Connor, and T. M. Aamodt, “Cache conscious wavefront scheduling,” in Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). Washington, DC, USA: IEEE Computer Society, 2012, pp. 72–83.
    [13] M. K. Yoon, Y. Oh, S. Lee, S. H. Kim, D. Kim, and W. W. Ro, “Draw: investigating benefits of adaptive fetch group size on 202 gpu,” in 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS ’15), March 2015, pp. 183–192.
    [14] T.-F. Chen and J.-L. Baer, “Effective hardware-based data prefetching for high-performance processors,” Computers, IEEE Transactions on, vol. 44, no. 5, pp. 609–623, May 1995.
    [15] J. W. C. Fu, J. H. Patel, and B. L. Janssens, “Stride directed prefetching in scalar processors,” in Proceedings of the 25th Annual International Symposium on Microarchitecture (MICRO ’92). Los Alamitos, CA, USA: IEEE Computer Society Press, 1992, pp. 102–110.
    [16] P. Diaz and M. Cintra, “Stream chaining: exploiting multiple levels of correlation in data prefetching,” in Proc. of the 36th Annual International Symposium on Computer Architecture, 2009, pp. 81–92.
    [17] S. Srinath et al., “Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers,” in Proc. of the 13th International Symposium on High-Performance Computer Architecture, Feb. 2007, pp. 63–74.
    [18] M. K. Yoon, K. Kim, S. Lee, W. W. Ro, and M. Annavaram, “Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit,” in Proceedings of the 43rd International Symposium on Computer Architecture (ISCA ’16), 2016.
    [19] A. Sethia, D. A. Jamshidi, and S. Mahlke, “Mascar: Speeding up gpu warps by reducing memory pitstops,” in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA ’15), Feb 2015, pp. 174–185.
    [20] N. B. Lakshminarayana and H. Kim, “Spare register aware prefetching for graph algorithms on gpus,” in 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA ’14), Feb 2014, pp. 614–625.

    下載圖示 校內:2024-11-01公開
    校外:2024-11-01公開
    QR CODE