簡易檢索 / 詳目顯示

研究生: 洪浩宸
Hong, Hao-Chen
論文名稱: 形狀可重構式脈動陣列加速器之設計與效能分析
Design and Performance Analysis of a Shape-Reconfigurable Systolic Array Accelerator
指導教授: 郭致宏
Kuo, Chih-Hung
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2026
畢業學年度: 114
語文別: 中文
論文頁數: 96
中文關鍵詞: 脈動陣列AI 加速器深度學習
外文關鍵詞: Systolic array, AI accelerator, deep learning
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 脈動陣列 (Systolic Array) 能高效執行深度神經網路中的矩陣乘法,已廣泛應用於各式加速器設計中。然而,固定形狀的脈動陣列無法有效率地處理多樣化的矩陣維度。當脈動陣列形狀與矩陣尺寸不匹配時,處理單元的使用率會下降,資料搬移量增加,進而導致整體處理延遲增加。本研究提出形狀可重構式脈動陣列加速器 (Shape-Reconfigurable Systolic Array Accelerator, SRSAA),可以在執行過程中動態切換不同的陣列形狀。每次矩陣乘法運算都能採用較佳的陣列形狀,進而提升性能。實驗結果顯示,與相關作品相比,SRSAA 硬體效率最高可提升 4 倍。

    Systolic arrays have been widely adopted for efficient matrix multiplication in deep neural networks. However, fixed array shapes cannot efficiently accommodate diverse matrix dimensions. When the array shape does not match the matrix dimensions, systolic array utilization decreases or data movement increases, leading to longer processing latency. We propose a Shape-Reconfigurable Systolic Array Accelerator (SRSAA) that dynamically switches array shapes. This shape-reconfigurability enables each matrix multiplication to utilize a better array shape, improving performance. Experimental results demonstrate that the proposed SRSAA achieves up to 4× improvement in hardware efficiency over related works.

    中文摘要 i 英文延伸摘要 ii 誌謝 xiv 第一章 緒論 1 1-1 前言 1 1-2 研究動機 1 1-3 研究貢獻 3 1-4 論文架構 4 第二章 相關研究背景介紹 5 2-1 深度神經網路 5 2-2 神經網路相關運算 7 2-3 脈動陣列 10 2-4 加速器常見資料流 11 第三章 神經網路加速器相關文獻回顧 14 3-1 固定架構的加速器設計 14 3-1-1 Eyeriss 14 3-1-2 張量處理器 (TPU) 15 3-1-3 Design of FPGA-Based Reconfigurable Hardware Acceleration System 16 3-2 可重構與靈活設計的加速器架構 18 3-2-1 ReSA 18 3-2-2 SAVector 19 3-2-3 Flex-TPU 20 3-3 相關研究方法比較 21 第四章 形狀可重構式脈動陣列加速器設計與脈動陣列形狀性能估計方法 24 4-1 脈動陣列形狀性能估計方法 24 4-1-1 卷積運算映射至二維脈動陣列 24 4-1-2 圖塊切割方法 28 4-1-3 延遲估計方法 29 4-2 形狀可重構式脈動陣列加速器設計 33 4-2-1 形狀可重構脈動陣列 (Shape-Reconfigurable Systolic Array) 34 4-2-2 資料設置模組 (Data Setup) 35 4-2-2-1 緩衝器儲存與排列 36 4-2-2-2 控制邏輯 38 4-2-3 後處理模組 (Post-Processing Module) 39 4-2-4 池化模組 (Pooling Module) 40 4-2-5 上採樣模組 (Upsample Module) 41 4-3 加速器運作流程 43 4-4 加速器採用之指令集架構 45 4-4-1 指令集架構 (Instruction Set Architecture, ISA) 45 4-5 加速器新增全域緩衝區之性能分析方法 47 第五章 實驗環境與數據分析 50 5-1 不同脈動陣列形狀之效能分析 50 5-1-1 非理想外部記憶體頻寬與不同加速器工作頻率情境分析 53 5-1-2 加速器新增全域緩衝區對性能影響分析 59 5-2 硬體實驗環境與加速器規格 64 5-3 固定與動態重構脈動陣列形狀之效能比較 65 5-4 與其他作品之效能比較 66 5-5 與其他可重構式靈活設計加速器比較 66 第六章 結論與未來展望 69 6-1 結論 69 6-2 未來展望 69 參考文獻 71

    [1] M. Hu, J. Fan, Y. Hu, R. Xu, and Y. Guo, “Modeling and Optimizing PE Utilization Rate for Systolic Array Based CNN Accelerators,” in Eighth International Conference on Electronic Technology and Information Science (ICETIS 2023), vol. 12715, pp. 435–443, SPIE, 2023.
    [2] M. Elbtity, P. Chandarana, and R. Zand, “Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture,” arXiv preprint arXiv:2407.08700, 2024.
    [3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
    [4] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
    [5] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7464–7475, 2023.
    [6] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” IEEE journal of solid-state circuits, vol. 52, no. 1, pp. 127–138, 2016.
    [7] L. Wang, Z. Wang, J. Dang, C. Dang, Y. Cai, and H. Shi, “Design of FPGA-Based Reconfigurable Hardware Acceleration System,” in 2024 6th International Conference on Electronics and Communication, Network and Computer Technology (ECNCT), pp. 561–566, IEEE, 2024.
    [8] F. N. Peccia, S. Pavlitska, T. Fleck, and O. Bringmann, “Efficient Edge AI: Deploying Convolutional Neural Networks on FPGA with the Gemmini Accelerator,” in 2024 27th Euromicro Conference on Digital System Design (DSD), pp. 418–426, IEEE, 2024.
    [9] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit,” in Proceedings of the 44th annual international symposium on computer architecture, pp. 1–12, 2017.
    [10] C.-J. Lee and T. T. Yeh, “ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors,” ACM Transactions on Architecture and Code Optimization, vol. 21, no. 3, pp. 1–24, 2024.
    [11] S. Choi, S. Park, J. Park, J. Kim, G. Koo, S. Hong, M. K. Yoon, and Y. Oh, “SAVector: Vectored Systolic Arrays,” IEEE Access, vol. 12, pp. 44446–44461, 2024.
    [12] X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural Networks,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315–323, JMLR Workshop and Conference Proceedings, 2011.
    [13] H. T. Kung and C. E. Leiserson, “Systolic Arrays (for VLSI),” in Sparse Matrix Proceedings 1978, vol. 1, pp. 256–282, Society for industrial and applied mathematics Philadelphia, PA, USA, 1979.

    QR CODE