簡易檢索 / 詳目顯示

研究生: 詹世安
Zhan, Shi-An
論文名稱: 卷積神經網路之脈動陣列加速器與數據設置模組設計
Design of Systolic Array Accelerator and Data Setup Module for Convolutional Neural Networks
指導教授: 郭致宏
Kuo, Chih-Hung
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 49
中文關鍵詞: 脈動陣列深度學習卷積神經網路
外文關鍵詞: systolic array, deep learning, convolutional neural network
相關次數: 點閱:44下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 二維脈動陣列架構被認為能有效執行卷積運算。透過數據設置 (Data setup) 程序將輸入特徵圖轉換成輸入矩陣,可以讓二維脈動陣列有效地計算等效於卷積層的矩陣乘法。然而一般數據設置程序產生的輸入矩陣需要大量記憶體儲存卷積滑動窗之間的重複數據。本論文設計了用於推論神經網路的脈動陣列加速器,可支援卷積層、池化層與墊零操作。為了減少儲存重複數據的記憶體空間,我們提出可有效重複使用數據的數據設置模組。與之前的脈動陣列加速器相比,所提出的方案在YOLO v4-Tiny和VGG-16分別提高1.43和1.61倍的效能,並縮小數據設置模組的面積為 1/3.35。

    To speed up the computation of the Convolutional Neural Networks (CNNs), the 2D systolic arrays are regarded as effective architecture to perform the convolutional operations. However, converting a convolutional operation into a matrix multiplication requires the image-to-column (im2col) transform that requires a large local buffer to store duplicate data from overlapping sliding windows. In this work, we propose the Row Buffers with Multiplexers (RBM) module to select and reuse the repeated data. Compared with the previous systolic array accelerators, our proposed accelerator improves the performance up to 1.43× and 1.61× in the YOLO v4-Tiny model and the VGG-16 model, respectively, and reduces the area of the data setup component up to 3.35×.

    中文摘要 I 目錄 XI 圖目錄 XIII 表目錄 XV 第一章 緒論 1 1-1 前言 1 1-2 研究動機 1 1-3 研究貢獻 2 1-4 論文架構 3 第二章 相關研究背景介紹 4 2-1 深度學習與神經網路 4 2-2 卷積神經網路 5 2-3 量化卷積神經網路 10 2-4 脈動陣列 (Systolic Array) 12 第三章 卷積神經網路加速器相關文獻回顧 14 3-1 神經網路加速器架構 14 3-1-1 DianNao系列 14 3-1-2 Eyeriss系列 16 3-1-3 脈動陣列加速器 18 3-1-3-1 張量處理器 (Tensor Processing Unit, TPU) 18 3-1-3-2 VWA 19 3-1-3-3 Systolic Array Accelerator for Quantized Convolutional Neural Network (SAQCNN) 20 3-1-3-4 Systolic Array+ Structure (SAS) 21 3-1-3-5 SPOTS 22 3-2 相關研究方法比較 23 第四章 脈動陣列之神經網路加速器設計 25 4-1 加速器運算流程 26 4-1-1 圖塊切割方法 27 4-1-2 卷積運算映射至二維脈動陣列 28 4-1-3 前處理單元 30 4-1-4 後處理單元 31 4-2 數據設置模組 (Row Buffers with Multiplexers, RBM) 34 4-3 儲存架構設計 37 4-4 控制電路指令 38 第五章 實驗環境與數據分析 40 5-1 RBM中列緩衝器數量的影響 40 5-2 RBM的面積與功耗 41 5-3 加速器的性能與比較 41 5-4 ESL虛擬平台 43 第六章 結論與未來展望 45 6-1 結論 45 6-2 未來展望 45 參考文獻 46

    [1] C.-N. Liu, Y.-A. Lai, C.-H. Kuo, and S.-A. Zhan, “Design of 2D Systolic Array Accelerator for Quantized Convolutional Neural Networks,” in 2021 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), pp. 1–4, IEEE, 2021.
    [2] H. Kim, S. Lee, J. Choi, and J. H. Ahn, “Row-streaming dataflow using a chaining buffer and systolic array+ structure,” IEEE Computer Architecture Letters, vol. 20, no. 1, pp. 34–37, 2021.
    [3] M. Soltaniyeh, R. P. Martin, and S. Nagarakatte, “An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-Matrix Multiplication,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 19, no. 3, pp. 1–26, 2022.
    [4] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks,” IEEE journal of solid-state circuits, vol. 52, no. 1, pp. 127–138, 2016.
    [5] V. Panchbhaiyye and T. Ogunfunmi, “A FIFO based accelerator for convolutional neural networks,” in ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 1758–1762, IEEE, 2020.
    [6] A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,” arXiv preprint arXiv:1404.5997, 2014.
    [7] F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain.,” Psychological review, vol. 65, no. 6, p. 386, 1958.
    [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
    [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
    [10] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
    [11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
    [12] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” ACM SIGARCH Computer Architecture News, vol. 42, no. 1, pp. 269–284, 2014.
    [13] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” Advances in neural information processing systems, vol. 28, 2015.
    [14] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprint arXiv:1605.04711, 2016.
    [15] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
    [16] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” pp. 2704–2713, 2018.
    [17] H. Kung and C. E. Leiserson, “Systolic arrays (for VLSI),” vol. 1, pp. 256– 282, 1979.
    [18] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al., “Dadiannao: A machine-learning supercomputer,” pp. 609–622, 2014.
    [19] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “Shidiannao: Shifting vision processing closer to the sensor,” pp. 92–104, 2015.
    [20] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, and Y. Chen, “Pudiannao: A polyvalent machine learning accelerator,” ACM SIGARCH Computer Architecture News, vol. 43, no. 1, pp. 369– 381, 2015.
    [21] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Cambricon-X: An accelerator for sparse neural networks,” pp. 1–12, 2016.
    [22] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292–308, 2019.
    [23] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th annual international symposium on computer architecture, pp. 1–12, 2017.
    [24] K.-W. Chang and T.-S. Chang, “VWA: Hardware efficient vectorwise accelerator for convolutional neural network,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 67, no. 1, pp. 145–154, 2019.
    [25] P. Adarsh, P. Rathi, and M. Kumar, “YOLO v3-Tiny: Object Detection and Recognition using one stage improved model,” in 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), pp. 687–694, IEEE, 2020.
    [26] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020.
    [27] M.-Z. Ji, W.-C. Tseng, T.-J. Wu, B.-R. Lin, and C.-H. Chen, “Micro Darknet For Inference: ESL reference for inference accelerator design,” in 2019 International SoC Design Conference (ISOCC), pp. 69–70, IEEE, 2019.
    [28] X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural Networks,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (G. Gordon, D. Dunson, and M. Dudà k, eds.), vol. 15 of Proceedings of Machine Learning Research, (Fort Lauderdale, FL, USA), pp. 315–323, Apr 2011.

    下載圖示 校內:2024-09-05公開
    校外:2024-09-05公開
    QR CODE