| 研究生: |
詹世安 Zhan, Shi-An |
|---|---|
| 論文名稱: |
卷積神經網路之脈動陣列加速器與數據設置模組設計 Design of Systolic Array Accelerator and Data Setup Module for Convolutional Neural Networks |
| 指導教授: |
郭致宏
Kuo, Chih-Hung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 中文 |
| 論文頁數: | 49 |
| 中文關鍵詞: | 脈動陣列 、深度學習 、卷積神經網路 |
| 外文關鍵詞: | systolic array, deep learning, convolutional neural network |
| 相關次數: | 點閱:44 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
二維脈動陣列架構被認為能有效執行卷積運算。透過數據設置 (Data setup) 程序將輸入特徵圖轉換成輸入矩陣,可以讓二維脈動陣列有效地計算等效於卷積層的矩陣乘法。然而一般數據設置程序產生的輸入矩陣需要大量記憶體儲存卷積滑動窗之間的重複數據。本論文設計了用於推論神經網路的脈動陣列加速器,可支援卷積層、池化層與墊零操作。為了減少儲存重複數據的記憶體空間,我們提出可有效重複使用數據的數據設置模組。與之前的脈動陣列加速器相比,所提出的方案在YOLO v4-Tiny和VGG-16分別提高1.43和1.61倍的效能,並縮小數據設置模組的面積為 1/3.35。
To speed up the computation of the Convolutional Neural Networks (CNNs), the 2D systolic arrays are regarded as effective architecture to perform the convolutional operations. However, converting a convolutional operation into a matrix multiplication requires the image-to-column (im2col) transform that requires a large local buffer to store duplicate data from overlapping sliding windows. In this work, we propose the Row Buffers with Multiplexers (RBM) module to select and reuse the repeated data. Compared with the previous systolic array accelerators, our proposed accelerator improves the performance up to 1.43× and 1.61× in the YOLO v4-Tiny model and the VGG-16 model, respectively, and reduces the area of the data setup component up to 3.35×.
[1] C.-N. Liu, Y.-A. Lai, C.-H. Kuo, and S.-A. Zhan, “Design of 2D Systolic Array Accelerator for Quantized Convolutional Neural Networks,” in 2021 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), pp. 1–4, IEEE, 2021.
[2] H. Kim, S. Lee, J. Choi, and J. H. Ahn, “Row-streaming dataflow using a chaining buffer and systolic array+ structure,” IEEE Computer Architecture Letters, vol. 20, no. 1, pp. 34–37, 2021.
[3] M. Soltaniyeh, R. P. Martin, and S. Nagarakatte, “An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix-Matrix Multiplication,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 19, no. 3, pp. 1–26, 2022.
[4] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks,” IEEE journal of solid-state circuits, vol. 52, no. 1, pp. 127–138, 2016.
[5] V. Panchbhaiyye and T. Ogunfunmi, “A FIFO based accelerator for convolutional neural networks,” in ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 1758–1762, IEEE, 2020.
[6] A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,” arXiv preprint arXiv:1404.5997, 2014.
[7] F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain.,” Psychological review, vol. 65, no. 6, p. 386, 1958.
[8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
[10] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
[12] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” ACM SIGARCH Computer Architecture News, vol. 42, no. 1, pp. 269–284, 2014.
[13] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” Advances in neural information processing systems, vol. 28, 2015.
[14] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprint arXiv:1605.04711, 2016.
[15] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
[16] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” pp. 2704–2713, 2018.
[17] H. Kung and C. E. Leiserson, “Systolic arrays (for VLSI),” vol. 1, pp. 256– 282, 1979.
[18] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al., “Dadiannao: A machine-learning supercomputer,” pp. 609–622, 2014.
[19] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “Shidiannao: Shifting vision processing closer to the sensor,” pp. 92–104, 2015.
[20] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, and Y. Chen, “Pudiannao: A polyvalent machine learning accelerator,” ACM SIGARCH Computer Architecture News, vol. 43, no. 1, pp. 369– 381, 2015.
[21] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Cambricon-X: An accelerator for sparse neural networks,” pp. 1–12, 2016.
[22] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292–308, 2019.
[23] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th annual international symposium on computer architecture, pp. 1–12, 2017.
[24] K.-W. Chang and T.-S. Chang, “VWA: Hardware efficient vectorwise accelerator for convolutional neural network,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 67, no. 1, pp. 145–154, 2019.
[25] P. Adarsh, P. Rathi, and M. Kumar, “YOLO v3-Tiny: Object Detection and Recognition using one stage improved model,” in 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), pp. 687–694, IEEE, 2020.
[26] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020.
[27] M.-Z. Ji, W.-C. Tseng, T.-J. Wu, B.-R. Lin, and C.-H. Chen, “Micro Darknet For Inference: ESL reference for inference accelerator design,” in 2019 International SoC Design Conference (ISOCC), pp. 69–70, IEEE, 2019.
[28] X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural Networks,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (G. Gordon, D. Dunson, and M. Dudà k, eds.), vol. 15 of Proceedings of Machine Learning Research, (Fort Lauderdale, FL, USA), pp. 315–323, Apr 2011.