簡易檢索 / 詳目顯示

研究生: 吳庭嘉
Wu, Ting-Jia
論文名稱: 支援資料復用及過濾器尺寸可擴展性之一維卷積加速器設計與其電子系統層級驗證平台
A one-dimensional convolution accelerator supporting data reuse and multiple dimensional filters
指導教授: 陳中和
Chen, Chung-Ho
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2020
畢業學年度: 109
語文別: 中文
論文頁數: 65
中文關鍵詞: 卷積加速器資料重用連續派發資料映射電子系統層級設計終端裝置
外文關鍵詞: Convolution accelerator, Data reuse, ESL, Multiple dimensional filters
相關次數: 點閱:78下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著機器學習與人工智慧等應用蓬勃發展,其相關類神經網路模型架構日益複雜。加上近年來為了能夠符合本地端快速應用,使得類神經網路模型之運算單元逐漸從雲端移向終端計算。然而由於終端裝置之硬體及計算能力等限制,加上最常被應用的卷積神經網路模型其大量參數與卷積運算,造成終端裝置執行效率低下。
    為了提升終端裝置運算卷積類神經模型運算效率,本論文基於過去本實驗室所開發的一款適合終端裝置使用之輕量化inference framework,Micro Darknet For Inference (MDFI),設計了一顆一維卷積硬體加速器。本論文也提出同時結合Input feature map reuse、Partial sum reuse、與Weight resue三種data reuse方法之IPW Reuse方法,在加速器運算過程中重複利用載入加速器的input 做運算,以及重複利用前批input運算產生之partial sum加總來降低記憶體需求量,同時也重複使用儲存於運算單元的weight。以IPW Reuse來多方支援資料復用以降低加速器DRAM/SRAM access。也為了使得該一維卷積加速器能夠廣泛支援各式卷積類神經模型,本論文亦提出使其能支援各過濾器尺寸可擴展性之相關優化,同時也能夠提升硬體加速器中之MAC利用率。
    本論文也建立一個ESL虛擬平台,利用QEMU模擬RISC-V CPU執行環境搭配MDFI,使我們的卷積加速器能在利用虛擬平台驗證其硬體設計及運算流程正確性,同時也能夠對其進行效能評估。根據本論文提出之設計及優化方式,我們的硬體加速器擁有高MAC利用率,例如在VGG16中可達到100%利用率,在其他諸如YOLOv3-tiny、AlexNet等模型中也能有90%以上之MAC利用率。若硬體加速器以300M MHz單獨執行convolution layer運算,在YOLOv3-tiny可得到11.77 FPS;若與814 MHz RISC-V CPU共同執行,可得4.87 FPS。

    Convolution is a critical operation in neural networks. This thesis proposes a one-dimensional convolution accelerator called CU, which supports data reuse in input feature maps, partial sums, and kernel weights to greatly reduce memory accesses. To reduce execution time and increase the utilization of MACs, the CU is not only designed for the most common 3x3 filter size, but also supporting oversize filters by row-major order mapping on the same PE with batch weight reuse, and also optimized for 1x1 filter through multi-channel input data aggregation.
    To verify the CU design and compute performance, this thesis has built an ESL virtual platform which explores the accelerator micro-architecture design through representative neural network models described by the Micro Darknet for Inference (MDFI) C code. With an 814 MHz RISC-V CPU, a 300 MHz CU of 288 MACs is estimated to be able to achieve about 4.87 FPS in our target application, YOLOv3-tiny, for 416x416 input maps. Comparisons with existing designs are also made.

    目錄 摘要 I 誌謝 X 目錄 XI 表目錄 XIV 圖目錄 XV 第1章 序論 1 1.1 論文動機 2 1.2 論文貢獻 3 1.3 論文架構 3 第2章 背景知識與相關研究 4 2.1 類神經網路 (Neural Network) 4 2.1.1 卷積類神經網路 (Convolution Neural Network) 5 2.2 Micro Darknet for Inference (MDFI) 6 2.2.1 Convolution Layer of MDFI 7 2.3 DNN硬體架構與指標性硬體加速器 10 2.3.1 DNN硬體架構 10 2.3.2指標性硬體加速器 12 2.4 Data Reuse Methods 13 2.4.1 Input Feature Map Reuse 14 2.4.2 Partial Sum Reuse (Output Reuse) 14 2.4.3 Weight Reuse (Filter Reuse) 15 2.4.4 Layer Reuse 16 2.5電子系統層級設計 (Electronic system-level design, ESL design) 17 2.6 QEMU 18 第3章 一維卷積加速器之硬體設計與Data Reuse 19 3.1 Data Reuse 19 3.1.1 IPW Reuse 19 3.1.2 Tile Overlap 21 3.2 加速器硬體設計 23 3.2.1 1D Pipelined Convolution PE Array 24 3.2.2 Input SRAM 27 3.2.3 Weight SRAM 28 3.2.4 Output SRAM 29 3.2.5 Input/Weight/Output Buffer 30 3.2.6 Configuration Register 30 3.2.7 Controller 31 3.3 加速器優化 32 3.3.1 Row-major Order Mapping on the Same PE for Oversize Filter 32 3.3.2 Multi-channel Data Aggregation for 1x1 Filter Size 38 3.3.3 Ping-pong Input SRAM 41 3.3.4 Controller 41 3.4 一維卷積加速器與二維卷積加速器之dataflow比較 42 第4章 虛擬平台實驗環境 45 4.1 ESL Virtual Platform 45 4.2 MDFI修改與CU Driver 46 4.3 Get icount Hardware 50 第5章 實驗結果與效能評估 51 第6章 結論與未來展望 61 6.1 結論 61 6.2 未來展望 62 參考文獻 63

    [1] W. Tseng, “Layer-wise Fixed Point Quantization for Deep Convolutional Neural Networks and Implementation of YOLOv3 Inference Engine/深度卷積網路之逐層定點數量化方法與實作YOLOv3推論引擎,” Natl. Cheng K. Univ. - NCKU, 2019.
    [2] M. Ji, “Optimization of YOLOv3 Inference Engine for Edge Device/優化YOLOv3推論引擎並實現於終端裝置,” Natl. Cheng K. Univ. - NCKU, 2018.
    [3] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788, 2016.
    [4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” J. Geotech. Geoenvironmental Eng., vol. 12, p. 04015009, 2015.
    [5] J. Redmon, “Darknet: Open source neural networks in c.,” 2013. [Online]. Available: http://pjreddie.com/darknet/.
    [6] Y. Jia, and et al., “Caffe: Convolutional Architecture for Fast Feature Embedding,” in Proceedings of the 22nd ACM international conference on Multimedia, pp. 675–678, 2014.
    [7] M. Abadi, and et al., “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems,” in Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation, pp. 265–283, 2016.
    [8] Y. H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Curcyuts (JSSC), vol. 52, no. 1, pp. 127-138, Jan. 2017.
    [9] N. P. Jouppi, and et al., “In-Dataceter Performance Analysis of a Tensor Processing Unit,” Proc. International Symposium on Computer Architecture (ISCA), 2017.
    [10] Y. Chen, and et al., “DaDianNao: A Machine-Learning Supercomputer,” Proc. IEEE/ACM International Microarchitecture (MICRO), 2015, pp. 609-622
    [11] Z. Du, and et al., “ShiDianNao: Shifting vision processing closer to the sensor,” Proc. International Symposium on Computer Architecture (ISCA), pp. 82-1024, 2015.
    [12] S. Zhang, and et al., “Cambricon-X: An accelerator for sparse neural networks,” Proc. IEEE/ACM International Microarchitecture (MICRO), 2016.
    [13] K. Guo, and et al., “Angel-Eye:A Complete Design Flow for Mapping CNN onto Embedded FPGA,” IEEE Trans, Computer Aided Design of Integrated Circuits and Systems(TCAD), DOI 10.1109/TCAD, 2017.
    [14] S. Han, and et al., “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” Proc. - 2016 43rd International Symposium on Computer Architecture (ISCA), vol. 16, pp. 243–254, 2016.
    [15] “KL520 AI SoC,” 2019. [Online]. Avaiable: https://www.kneron.com/tw/
    [16] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN accelerators,” Proc. IEEE/ACM International Microarchitecture (MICRO), 2016.
    [17] A. Gerstlauer, C. Haubelt, A. D. Pimentel, T. P. Stefanov, D. D. Gajski, and J. Teich, “Electronic System-Level Synthesis Methodologies,” IEEE Trans. Comput. Des. Integr. Circuits and Syst., vol. 28, no. 10, pp. 1517-1530, 2009.
    [18] G. Schirner, A. Gerstlauer, and R. Domer, “Fast and Accurate Processor Models for Efficient MPSoC Design,” ACM TODAES, vol. 15, Iss. 2, Article 10, Feb. 2010.
    [19] B. Fabrice, “QEMU, a Fast and Portable Dynamic Translator,” in Proceeding of USENIX Annual Technical Conference, pp. 41-46, 2005.
    [20] W. He, Z. Huang, Z. Wei, C. Li, and B. Guo, “TF‐YOLO: An Improved Incremental Network for Real‐Time Object Detection,” Applied Sciences, vol. 9, issue 16, August 2019.
    [21] K. Simonyan, and A. Zisserman, “Very Deep Convolutional NetWorks for Large-Scale Image Recognition,”, In ICLR, Septermber 2015.
    [22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
    [23] J. Redmon, and A. Farhadi, “YOLOv3: An Incremental Improvement,” 2018. [Online]. Avaiable: https://pjreddie.com/publications/
    [24] C. Liu, “Design of 2D Systolic Array Accelerator for Quantized Convolutional Neural Networks/量化卷積神經網路之二維脈動陣列加速器設計,” Natl. Cheng K. Univ. - NCKU, 2020.
    [25] B. Lin, “An ESL (electronic system level) virtual platform for convolution accelerator design and verification/電子系統層級虛擬平台之卷積加速器設計與驗證,” Natl. Cheng K. Univ. - NCKU, 2019.

    下載圖示 校內:2025-12-04公開
    校外:2025-12-04公開
    QR CODE