簡易檢索 / 詳目顯示

研究生: 周沛辰
Chou, Pei-Chen
論文名稱: L2-Cache對卷積神經網路之優化與其在CASLab-GPGPU之實現
L2-Cache Optimizations for Convolutional Neural Network on CASLab-GPGPU
指導教授: 陳中和
Chen, Chung-Ho
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 69
中文關鍵詞: 通用型繪圖處理器卷積神經網路資料預取機制
外文關鍵詞: GPGPU, CNN, Data Prefetching
相關次數: 點閱:202下載:8
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來在人工智慧、機器學習、影像辨識等領域愈加成熟,有越來越多的研究者投入此領域,也開始大量使用繪圖處理器來處理這些應用。本實驗室所開發的CASLab-GPGPU平台的目標應用即為上述領域的模型運算,而影像辨識中常使用卷積神經網路作為其模型,因此本論文對卷積神經網路的在運行CASLab-GPGPU情況分析評估,找出效能瓶頸的地方並對其優化。
    本論文透過檢視CASLab-GPGPU執行卷積神經網路時的情況,發現在執行全連接層的表現並不理想,因此在Next-Line Prefetching的基礎上設計了Block-Base Prefetching來優化全連接層的執行效能,並且在CASLab-GPGPU的L2-Cache中實現。Next-Line Prefetching在增加Prefetch Degree時有可能會發生預取到重覆資料的缺陷,因此只能對下一個Cache Line進行預取,而本論文設計的Block-Base Prefetching透過將記憶體切分為多個記憶體區塊,將區塊中的存取進度紀錄於Memory Block History Table,有了存取進度紀錄,就算提高Prefetch Degree也能透過查詢紀錄來避免預取到重覆的資料。而在全連接層的執行測試結果中,使用Block-Base Prefetching的效能可以高達原本的2.6倍之多,L2-Cache Hit Rate也提升將近45%。而在各卷積神經網路的測試中,使用Block-Base Prefetching也可以帶來平均70%的效能增加,與Next-Line Prefetching相比可以說是效果卓越。

    The target of CASLab-GPGPU is AI (Artificial Intelligence) application, and CNN (Convolutional Neural Network) is a kind of popular network in image recognition. We analyze the execution of classical CNN LeNet-5 on CASLab-GPGPU. Fully connected layer take a large part in the total execution time of LeNet-5, and L2-cache hit rate of fully connected layer is much lower than other layers of LeNet-5. It spends plenty of time to access DRAM because of low L2-cache hit rate. We propose a Block-Base prefetching scheme to increase the L2-cache hit rate for improve the performance of fully connected layer, which can prefetch more than one cache line once and avoid prefetching repeated cache line into cache. Any access including prefetching access will be recorded into a memory block history table. The prefetcher can get access the memory block via memory block history table to avoid prefetching the cache line already accessed before. The experimental results for fully connected layer show that Block-Base prefetching can increase 160% performance. And the experimental results of different CNN show that Block-Base prefetching can bring over 60% performance improvement. All these experimental results show Block-Base prefetching can make CASLab-GPGPU more efficient when executing CNN model.

    摘要 I 誌謝 VII 目錄 VIII 表目錄 XI 圖目錄 XII 第1章 序論 1 1.1 論文動機 2 1.2 論文貢獻 3 1.3 論文架構 3 第2章 背景知識與相關研究 4 2.1 CASLab-GPGPU平台系統架構 4 2.1.1 CASLab-GPGPU架構 5 2.1.2 Streaming Multiprocessor (SM) 7 2.1.3 L1 instruction Cache (IC) 7 2.1.4 Warp Scheduler (WS) 8 2.1.5 Load/Store Unit (LSU) 8 2.1.6 L1 Data Cache (DC) 9 2.2 卷積神經網路 10 2.2.1 Lenet-5 10 2.2.2 Convolutional Layer 11 2.2.3 Pooling Layer 12 2.2.4 Fully Connected Layer 12 2.2.5 Activation Function 14 2.3 快取資料預取機制 15 2.3.1 Next-Line Prefetching 17 2.3.2 Stride Prefetching 19 2.3.3 Global History Buffer Prefetching 21 2.4 Non-Blocking Cache和MSHR 24 2.4.1 MSHR(Miss Status Holding Register) 26 第3章 設計方法 28 3.1 L2-Cache設計 28 3.1.1 硬體方塊圖及運行流程 28 3.1.2 規格配置及分析 32 3.2 應用程式觀察 34 3.2.1 LeNet-5 執行分析 35 3.2.2 全連接層分析以及存取行為觀察 36 3.3 Block-Base Prefetching 40 3.3.1 Memory Block History Table 41 3.3.2 Prefetching Pointer 43 3.3.3 運作流程圖 48 3.4 增加成本估算 50 第4章 實驗結果與效能評估 51 4.1 實驗平台硬體規格 51 4.2 實驗結果與分析 53 4.2.1 Hit Prefetch Threshold 53 4.2.2 全連接層 56 4.2.3 卷積層 59 4.2.4 卷積神經網路模型 62 第5章 結論 66 參考文獻 67

    [1] Y. Lecun, L. Bottou, Y. Bengio, and P.Haffner, “Gradient-based learning applied to document recognition,” in Proc. IEEE, vol. 86, no. 11, 1998.
    [2] A. Krizhevsky, I. Sutskever, and G.E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,”, Adv. Neural Inf. Process. Syst., 2012.
    [3] K. Simonyan, and A. Zisserman, "Deep Convolutional Networks for Large-Scale Image Recognition,” in ICLR, 2015.
    [4] A. G. Howard, M. Zhu,. B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. “MobileNet : Efficient Convolutional Neural Networks for Mobile Applications,” arXiv:1704.04861, 2017
    [5] S. Ding, F. Long, H. Fan, L. Liu, and Y. Wang, “A novel YOLOv3-tiny network for unmanned airship obstacle detection,” in IEEE 8th Data Driven Control Learn. Syst. Conf. DDCLS, 2019
    [6] He K, Zhang X, Ren S, et al. “Deep residual learning for image recognition,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
    [7] J. Lee, N. B. Lakshminarayana, H. Kim, and R. Vuduc, “Many-thread aware prefetching mechanisms for gpgpu applications,” in 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010
    [8] A. Sethia, G. Dasika, M. Samadi, and S. Mahlke, “APOGEE: Adaptive prefetching on GPUs for energy efficiency,” in Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 2013
    [9] K.J. Nesbit, J.E. Smith, “Data Cache Prefetching Using a Global History Buffer,” in International Symposium on High Performance Computer Architecture (HPCA'04), 2004
    [10] J. W. C. Fu, J. H. Patel, and B. L. Janssens, “Stride directed prefetching in scalar processors,” in Proceedings the 25th Annual International Symposium on Microarchitecture (MICRO 25), 1992
    [11] T-F. Chen and J.-L. Baer, “Effective hardware-based data prefetching for high-performance processors,” in IEEE Transactions on Computers, 1995
    [12] A. Lashgar, E. Salehi, A. Baniasadi, “Understanding Outstanding Memory Request Handling Resources in GPGPUs,”in The Sixth International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART 2015), 2015
    [13] S.-Y. Lee, C.-J. Wu, “Characterizing the latency hiding ability of GPUs,” in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2014
    [14] H. Guo, L. Huang, Y. Y. Lü, J. Ma, C. Qian, S. Ma, Z. Wang, “Accelerating BFS via Data Structure-Aware Prefetching on GPU,” in IEEE Access, 2018
    [15] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, Y. N. Patt, "Improving GPU performance via large warps and two-level warp scheduling," in 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2011
    [16] W. Jia, K. A. Shaw, M. Martonosi, “Characterizing and improving the use of demand-fetched caches in GPUs,” in 26th ACM international conference on Supercomputing, 2012
    [17] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, T. M. Aamodt, “Analyzing CUDA workloads using a detailed GPU simulator,” in IEEE International Symposium on Performance Analysis of Systems and Software, 2009
    [18] “THE MNIST DATABASE,” [Online]. Available: http://yann.lecun.com/exdb/mnist/
    [19] “The CIFAR-10 dataset,” [Online]. Available: https://www.cs.toronto.edu/~kriz/cifar.html
    [20] D. Joseph, D. Grunwald, “Prefetching Using MarkovPredictors,” in IEEE Transactions on Computer Systems, 1999
    [21] G.B. Kandiraju, A. Sivasubramaniam, “Going the distance for TLB prefetching: an application-driven study,” in Proceedings 29th Annual International Symposium on Computer Architecture, 2002

    下載圖示 校內:2023-10-08公開
    校外:2023-10-08公開
    QR CODE