簡易檢索 / 詳目顯示

研究生: 劉珈寧
Liu, Chia-Ning
論文名稱: 量化卷積神經網路之二維脈動陣列加速器設計
Design of 2D Systolic Array Accelerator for Quantized Convolutional Neural Networks
指導教授: 郭致宏
Kuo, Chih-Hung
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2020
畢業學年度: 109
語文別: 中文
論文頁數: 59
中文關鍵詞: 深度學習卷積神經網路硬體加速器脈動陣列
外文關鍵詞: deep learning, convolutional neural networks, hardware accelerator, systolic array
相關次數: 點閱:202下載:27
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 深度學習與人工智慧在近期越來越熱門,應用的方面也很廣泛。神經網路為了提升精確性,層數加深、架構也越來越大,計算量和參數量也隨之提升。因此,神經網路的量化和許多專用硬體加速器被提出以減少計算量並加速運算。本論文針對量化卷積神經網路設計了加速器架構,配合的網路中輸入激活與權重皆被量化為8位元整數。該加速器可以支持各種神經網路模型中不同尺寸的卷積層和全連接層。運算部分主要採用二維脈動陣列運算結構,以降低記憶體存取能耗並提升吞吐量。而為了與脈動陣列配合並最大程度地減少對外部記憶體的存取,設計了特殊的片上記憶體,並針對輸入存取提出Fetcher架構。整體設計與Eyeriss相比,在VGG-16和AlexNet的卷積層中可降低1.79和1.63倍的外部記憶體存取,內部記憶體存取也可節省17.48和7.31倍。

    Deep learning and artificial intelligence (AI) have received a lot of attention recently. Through large data sets and dedicated algorithms, machines are being programmed to complete specific tasks. Most operations in these algorithms use multiplication and addition. This characteristic is quite proper for specific hardware accelerator designs. Now quantized networks are being studied to reduce the computing and memory requirements of deep neural networks. These new networks quantize full-precision weights and activations to lower bit-width fixed-point or integer representation, with only a small loss of accuracy. This can be helpful when hardware is being used which has restricted power and storage capacity. In this work, we perform a systolic-based architecture to accelerate convolutional networks with 8-bit integer data. This accelerator can support both the convolutional layers (CLs) and fully-connected layers (FCLs) for various neural network models. The computing unit can balance computation with I/O and improve throughput with the use of systolic structures. To cooperate with the systolic array and minimize external memory access, we also design a particular on-chip memory. The external memory access of CLs is reduced by 1.63x and 1.79x in AlexNet and VGG-16 compared with Eyeriss, and the internal memory access is also reduced by 7.31x and 17.48x.

    中文摘要 I 英文摘要 II 誌謝 XVII 目錄 XVIII 表目錄 XX 圖目錄 XXI 第一章 緒論 1 1-1 前言 1 1-2 研究動機 1 1-3 研究貢獻 2 1-4 論文架構 3 第二章 相關研究背景介紹 4 2-1 深度學習與神經網路 4 2-2 卷積神經網路 5 2-3 卷積神經網路的量化 8 2-4 脈動陣列 (Systolic Array) 9 2-5 張量處理器 (Tensor Processing Unit, TPU) 10 第三章 卷積網路硬體加速器相關文獻回顧 13 3-1 卷積神經網路加速硬體架構 13 3-1-1 DianNao系列 13 3-1-2 Eyeriss系列 14 3-1-3 低精度網路加速器 16 3-1-3-1 UNPU 16 3-1-3-2 QUEST 17 3-1-4 脈動陣列加速器 18 3-1-4-1 MPNA 18 3-1-4-2 VWA 19 3-2 相關研究方法比較 19 第四章 卷積神經網路加速硬體設計 21 4-1 運算單元 22 4-1-1 脈動架構設計 22 4-1-2 SCALE-Sim 36 4-1-3 全連接層運算 38 4-2 儲存架構 40 4-2-1 Fetcher架構 40 4-2-2 儲存大小與能耗分析 45 第五章 實驗環境與數據分析 47 第六章 結論與未來展望 55 6-1 結論 55 6-2 未來展望 55 參考文獻 56

    [1] Chen, Yu-Hsin, et al. "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks." IEEE Journal of Solid-State Circuits 52.1: 127-138, 2016.
    [2] Chen, Yu-Hsin, et al. "Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices." IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9.2: 292-308, 2019.
    [3] Lee, Jinmook, et al. "UNPU: A 50.6 TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision." 2018 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2018.
    [4] Ueyoshi, Kodai, et al. "QUEST: Multi-purpose log-quantized DNN inference engine stacked on 96-MB 3-D SRAM using inductive coupling technology in 40-nm CMOS." IEEE Journal of Solid-State Circuits 54.1: 186-196, 2018.
    [5] Vasudevan, Aravind, Andrew Anderson, and David Gregg. "Parallel multi channel convolution using general matrix multiplication." 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE, 2017.
    [6] Samajdar, Ananda, et al. "SCALE-sim: Systolic CNN accelerator." arXiv preprint arXiv:1811.02883, 2018.
    [7] Horowitz, Mark. "1.1 computing's energy problem (and what we can do about it)." 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). IEEE, 2014.
    [8] Krizhevsky, Alex. "One weird trick for parallelizing convolutional neural networks." arXiv preprint arXiv:1404.5997, 2014.
    [9] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychological Review, vol. 65, no. 6, pp. 65-386, 1958.
    [10] LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11: 2278-2324, 1998.
    [11] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.
    [12] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556, 2014.
    [13] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition, 2015.
    [14] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
    [15] Redmon, Joseph, and Ali Farhadi. "Yolov3: An incremental improvement." arXiv preprint arXiv:1804.02767, 2018.
    [16] M. Courbariaux, Y. Bengio, and J.-P. David, "Binaryconnect: Training deep neural networks with binary weights during propagations," in Advances in neural information processing systems, pp. 3123-3131, 2015.
    [17] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, "Xnor-net: Imagenet classification using binary convolutional neural networks,"in European Conference on Computer Vision, pp. 525-542, 2016.
    [18] F. Li, B. Zhang, and B. Liu, "Ternary weight networks," ArXiv Prepr. ArXiv160504711, 2016.
    [19] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, "Dorefa-net: Training low bitwidth convolutional neural networks with lowbitwidth gradients,"ArXiv Prepr. ArXiv160606160, 2016.
    [20] Hubara, Itay, et al. "Quantized neural networks: Training neural networks with low precision weights and activations." The Journal of Machine Learning Research, vol. 18, no.1, pp. 6869-6898, 2017.
    [21] Kung, H.T., and Charles E. Leiserson. "Systolic arrays (for VLSI)." Sparse Matrix Proceedings 1978. Vol. 1. Society for industrial and applied mathematics, 1979.
    [22] Kung, H. T. "Why systolic architectures?" Computer 1: 37-46, 1982.
    [23] Chen, Tianshi, et al. "Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning." ACM SIGARCH Computer Architecture News 42.1: 269-284, 2014.
    [24] Chen, Yunji, et al. "Dadiannao: A machine-learning supercomputer." 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2014.
    [25] Du, Zidong, et al. "ShiDianNao: Shifting vision processing closer to the sensor." Proceedings of the 42nd Annual International Symposium on Computer Architecture. 2015.
    [26] Liu, Daofu, et al. "Pudiannao: A polyvalent machine learning accelerator." ACM SIGARCH Computer Architecture News 43.1: 369-381, 2015.
    [27] Liu, Shaoli, et al. "Cambricon: An instruction set architecture for neural networks." 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 2016.
    [28] Zhang, Shijin, et al. "Cambricon-x: An accelerator for sparse neural networks." 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016.
    [29] HANIF, Muhammad Abdullah, et al. "MPNA: a massively-parallel neural array accelerator with dataflow optimization for convolutional neural networks." arXiv preprint arXiv:1810.12910, 2018.
    [30] K. Chang and T. Chang, "VWA: Hardware Efficient Vectorwise Accelerator for Convolutional Neural Network," in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 67, no. 1, pp. 145-154, Jan. 2020.
    [31] Jouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit." Proceedings of the 44th Annual International Symposium on Computer Architecture. 2017.
    [32] Google(2020). “Cloud TPU System Architecture.” Retrieved from https://cloud.google.com/tpu/docs/system-architecture.
    [33] Parhi, Keshab K. "VLSI digital signal processing systems: design and implementation." John Wiley & Sons, 2007.
    [34] 吳庭嘉(2020)。支援資料復用及過濾器尺寸可擴展性之一維卷積加速器設計與其電子系統層級驗證平台。國立成功大學電機研究所,台南市。

    下載圖示 校內:2022-10-30公開
    校外:2022-10-30公開
    QR CODE