簡易檢索 / 詳目顯示

研究生: 陳韋呈
Chen, Wei-Cheng
論文名稱: 於資源受限的現場可程式邏輯閘陣列上實現卷積神經網路加速器之研究
On FPGA Design of CNN Accelerators Using Resource Limited Platform
指導教授: 謝明得
Shieh, Ming-Der
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 57
中文關鍵詞: 現場可程式邏輯閘陣列卷積神經網路加速器行固定資料流設計空間探索資源受限平台
外文關鍵詞: FPGA, CNN accelerator, row-stationary dataflow, design space exploration, resource limited platform
相關次數: 點閱:47下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 由於現場可程式邏輯閘陣列(Field Programmable Gate Array, FPGA)具有靈活的可重構性、高性能和低開發週期等優點,許多基於FPGA的卷積神經網路(Convolutional Neural Network, CNN)加速器相繼被提出。儘管FPGA具有很多優點,但FPGA平台也有一些獨特的缺點,例如有限的硬體資源和晶片外記憶體頻寬。為了解決這個問題,先前大部分研究都採用設計空間探索(Design space exploration)技術來尋找在這些限制底下的最佳解。然而,他們僅探索數位訊號處理模組(Digital Signal Processing, DSP )的計算能力,忽略了可利用查找表(Lookup-table, LUT)來進行算術運算的選擇,而這可能會錯過一些設計空間。此外,在討論硬體限制時,現有文獻往往只討論DSP和區塊隨機存取記憶體(Block RAM, BRAM),而忽略了LUT的重要性。
    為了解決這個問題,本論文在設計空間探索中引入了基於 LUT 的計算單元。為了進一步提高 LUT 的使用效率,本論文使用捨棄式乘法器(Truncated Multiplier)來實現CNN的乘法運算,並提出一種改良的捨棄乘法補償方法和一種基於L1範數的核心選擇方法,以減少整體推理誤差。本論文採用目前常被使用的行固定資料流(row-stationary dataflow)為目標架構,相關設計空間探索是在基於FPGA特性的硬體限制下進行的,並基於探索結果,本論文在內含xc7z020 FPGA的Xilinx PYNQ-Z2開發版上實現VGG-16的卷積層加速器。實現結果顯示,於推理應用時,其平均性能在1.25億赫茲的操作頻率下可達到每秒570.20億個算術處理,相較於之前的研究成果,所產出之CNN加速器具有較佳的資源使用效率。

    Many FPGA-based CNN accelerators have been proposed recently in the literature because FPGA design can provide several advantages over conventional ASIC design, such as flexibility and short development round time. However, there exist some design constraints for FPGA-based platform such as the limited hardware resource and off-chip memory bandwidth. To deal with the problem, most previous works adopted design space exploration to find the optimal solution. However, they only explore the computational capability of DSP modules and ignore the possibility of using lookup-tables (LUTs) to implement the arithmetic operations such that some design space may be missing. Moreover, when considering hardware constraints, only DSP and BRAM are discussed and the role of LUTs is ignored.
    To address the above-mentioned problem, this thesis incorporates the LUT-based computing unit into the design space exploration. To further increase the efficiency of LUT usage, the multiplier is implemented with the truncated multiplier. An improved compensation method for truncated multiplier and a L1-norm-based kernel selection method are also proposed to reduce the overall inference error. The row-stationary dataflow is selected as our targeted architecture because of its high bandwidth efficiency. The design space exploration is executed under the hardware constraint and the characteristics of FPGA, and a CNN accelerator is implemented on the Xilinx PYNQ-Z2 with FPGA part xc7z020. The average performance can achieve up to 57.02 GOPS @ 125 MHz for the convolutional layers in VGG-16. Experimental results reveal that the resource efficiency of our implementation is better than that of the previous works.

    摘   要 i ABSTRACT iii 致謝 v Content vi List of Tables viii List of Figures ix Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Related work 2 1.3 Thesis organization 3 Chapter 2 Background 5 2.1 Row-stationary dataflow 5 2.1.1 Eyeriss 5 2.1.2 CNN shape parameter definition 6 2.1.3 Dataflow 8 2.1.4 Latency estimation 14 2.2 Characteristic and architecture of FPGA 16 2.2.1 Look-up table 17 2.2.2 Block RAM 17 2.2.3 DSP 17 2.2.4 Area equivalence 18 2.3 Roofline model 19 2.4 Truncated Multiplier 20 Chapter 3 Proposed Work 23 3.1 Truncated multiplier with prior probability compensation 23 3.1.1 Truncated multiplier with prior probability compensation 23 3.1.2 Kernel selection 25 3.2 Target accelerator architecture 27 3.2.1 Support kernel size restriction and dataflow modification 27 3.2.2 Target architecture 31 3.3 Details of design space exploration and implementation flow 32 3.3.1 On-chip resource selection 32 3.3.2 Constraint setup 34 3.3.3 Design space exploration 35 3.3.4 Overall system implementation flow 39 Chapter 4 Experimental Evaluation and Results Comparison 42 4.1 Experiment setup 42 4.2 Truncated multiplier 44 4.2.1 Precision evaluation of prior probability compensation 44 4.2.2 Evaluation of kernel selection 46 4.2.3 Accuracy Evaluation 49 4.3 Design space exploration result 49 4.4 FPGA Implementation 51 Chapter 5 Conclusion and Future Work 54 5.1 Conclusion 54 5.2 Future work 55 References 56

    [1] Zhang, Chen, et al. "Optimizing FPGA-based accelerator design for deep convolutional neural networks." Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays. 2015.
    [2] Qiu, Jiantao, et al. "Going deeper with embedded fpga platform for convolutional neural network." Proceedings of the 2016 ACM/SIGDA international symposium on field-programmable gate arrays. 2016.
    [3] Zhang, Chen, et al. "Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks." IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 38.11 (2018): 2072-2085.
    [4] Lu, Liqiang, et al. "Evaluating fast algorithms for convolutional neural networks on FPGAs." 2017 IEEE 25th annual international symposium on field-programmable custom computing machines (FCCM). IEEE, 2017.
    [5] Chen, Yu-Hsin, et al. "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks." IEEE journal of solid-state circuits 52.1 (2016): 127-138.
    [6] Williams, Samuel, Andrew Waterman, and David Patterson. "Roofline: an insightful visual performance model for multicore architectures." Communications of the ACM 52.4 (2009): 65-76.
    [7] Lavin, Andrew, and Scott Gray. "Fast algorithms for convolutional neural networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
    [8] Chen, Chun-Chen, et al. "Design Exploration Methodology for Deep Convolutional Neural Network Accelerator." NTHU Master thesis. 2018.
    [9] Xilinx Inc. "UG474 : 7 Series FPGAs Configurable Logic Block User Guide." 2016.
    [10] Xilinx Inc. "UG473 : 7 Series FPGAs Memory Resources User Guide." 2019.
    [11] Xilinx Inc. "UG479 : 7 Series DSP48E1 Slice User Guide." 2018.
    [12] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems 25 (2012).
    [13] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
    [14] Schulte, Michael J., and Earl E. Swartzlander. "Truncated multiplication with correction constant [for DSP]." Proceedings of IEEE Workshop on VLSI Signal Processing. IEEE, 1993.
    [15] Li, Hao, et al. "Pruning filters for efficient convnets." arXiv preprint arXiv:1608.08710 (2016).
    [16] Guo, Kaiyuan, et al. "Angel-eye: A complete design flow for mapping CNN onto embedded FPGA." IEEE transactions on computer-aided design of integrated circuits and systems 37.1 (2017): 35-47.
    [17] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
    [18] Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017).
    [19] Shen, Yongming, Michael Ferdman, and Peter Milder. "Maximizing CNN accelerator efficiency through resource partitioning." 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2017.

    下載圖示 校內:2025-08-09公開
    校外:2025-08-09公開
    QR CODE