簡易檢索 / 詳目顯示

研究生: 葉育辰
Yeh, Yu-Chen
論文名稱: 捲積神經網路架構之研究與設計
Research and Design of Convolutional Neural Network Architecture
指導教授: 周哲民
Jou, Jer-Min
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 63
中文關鍵詞: 機器學習硬體加速器HLSCNN
外文關鍵詞: CNN, HLS, Machine learning, Hardware accelerator
相關次數: 點閱:192下載:42
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文實現了捲積神經網路(CNN)硬體加速器設計。隨著時代發展,人工智能已然成為目前熱門的研究領域,其中捲積神經網路是目前深度神經網路(Deep Neural Network, DNN)領域的發展主力,但也存在通量和能量消耗的挑戰,藉由探討捲積運算中的重用性,設計空間探索,硬體間管線化設計等,做為我們的硬體加速器設計方案。
    在傳統類神經網路的基礎上,捲積神經網路是其中一種發展出來的機器學習模型。但由於在捲積神經網路其中,有著大量的乘加運算,而為了加速運算,有很多種不同的重用設計方式,不同的重用方式的硬體設計都要經過長時間的分析並重新分析硬體。因此藉由高階合成(High-Level Synthesis, HLS)工具將C語言實現快速的轉換為硬體描述語言,快速實現並生成可用的捲積神經網路硬體加速器。
    設計硬體加速器方面,我們藉由代入實際CNN參數例子,分析各個Local Buffer大小之可能性,經過排列組合後,透過記憶體訪問次數、計算單元執行次數及整體執行時間,假設並推導設計公式,且成立目標式代入設計公式後,可找到當中最佳Local Buffer大小;而後以此Local Buffer為基礎下,分析計算各模組之運行處理時間,找到各模組中最佳的Pipeline Latency(ii)之執行,並在模組中實現硬體管線化設計。而在整體架構上,我們利用雙緩衝器(Double-Buffer)之設計,並搭配HLS工具生成硬體架構,在此架構能以達到整體資料流之優化,最終實現任務級(Task-Level)資料流管線化設計。

    This paper implements the convolutional neural network (CNN) hardware accelerator design. With the development of the times, artificial intelligence has become a popular research field at present. Among them, convolutional neural network is the main force in the development of deep neural network (DNN) field, but there are also challenges of flux and energy consumption. By discussing the reusability of convolution operations, design space exploration, pipeline design between hardware, etc., as our hardware accelerator design.
    On the basis of traditional neural network, convolutional neural network is one of the developed machine learning models. However, because there are a large number of multiplication and addition operations in the convolutional neural network, and in order to speed up the operation, there are many different reuse design methods, and the hardware design of different reuse methods must be analyzed and reanalyzed for a long time body. Therefore, a high-level synthesis (HLS) tool is used to quickly convert the C language implementation into a hardware description language, and quickly implement and generate a usable convolutional neural network hardware accelerator.
    In terms of designing hardware accelerators, we analyze the possibility of the size of each Local Buffer by substituting the actual CNN parameter examples. After permutation and combination, the design formula is assumed and deduced based on the number of memory accesses, the number of execution times of the computing unit and the overall execution time. After establishing the target formula and substituting it into the design formula, the optimal Local Buffer size can be found. Then, based on this Local Buffer, analyze the execution time of each module, find the best implementation of Pipeline Latency(ii) in each module, and implement the hardware pipeline design in the module. In the overall architecture, we use the Double-Buffer design, and with the HLS tool to generate the hardware architecture, this architecture can achieve the optimization of the overall dataflow, and finally achieve the task-level Dataflow pipeline design.

    摘要 I 英文延伸摘要 II SUMMARY II OUR PROPOESD DESIGN II EXPERIMENTS IV 誌謝 VI 目錄 VII 表目錄 IX 圖目錄 X 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機與目的 2 1.3 論文架構 3 第二章 背景知識與相關研究 4 2.1 機器學習(Machine Learning) 4 2.1.1 監督式學習(Supervised Learning) 5 2.1.2 非監督式學習(Unsupervised Learning) 6 2.1.3 強化式學習(Reinforcement Learning, RL) 7 2.1.4 類神經網路(Neural Network, NN) 8 2.2 捲積神經網路架構(Convolutional Neural Network, CNN) 10 2.2.1 捲積層(Convolutional Layer, CONV) 11 2.2.2 激活函數 (Activation function) 12 2.2.3 池化層(Pooling layer) 12 2.2.4 全連接層(Fully Connected layer) 13 2.3 高階合成(High-Level Synthesis, HLS)工具 14 2.3.1 HLS設計流程 14 2.3.2 HLS運作和合成行為 15 2.3.3 HLS常用之優化指令(Optimization Directives) 16 2.3.4 HLS設計空間探討 18 第三章 CNN捲積層演算法分析 19 3.1 捲積層資料共享特徵 19 3.2 CNN硬體系統架構 20 3.3 捲積層輸出重用 21 3.3.1 捲積層輸出重用示意圖 21 3.3.2 捲積層輸出重用之PE架構圖 22 3.4 捲積層權重重用 23 3.4.1 捲積層權重重用示意圖 24 3.4.2 捲積層權重重用之PE架構圖 25 3.5 捲積層輸入重用 26 3.5.1 捲積層輸入重用示意圖 27 3.5.2 捲積層輸入重用之PE架構圖 28 3.6 CNN演算法資料重用探討 29 第四章 CNN設計空間探索 30 4.1 C程式碼設計 30 4.1.1 資料讀取迴圈設計 31 4.1.2 資料計算迴圈設計 32 4.1.3 資料累加迴圈設計 32 4.1.4 資料輸出迴圈設計 33 4.2 Local Buffer設計空間 33 4.2.1 Local Buffer最佳化設計 35 4.3 任務級(Task-Level)資料流優化之設計 37 4.3.1 任務級(Task-Level)資料流之管線化設計 37 4.3.2 任務級(Task-Level)資料流之雙緩衝器設計 38 4.3.3 管線化資料流之優化設計 39 第五章 CNN硬體設計 43 5.1 Local Buffer分割讀取資料 43 5.2 資料讀取之硬體資料流排程狀態分析 45 5.3 資料計算之硬體資料流排程狀態分析 49 5.4 資料累加之硬體資料流排程狀態分析 50 5.5 資料輸出之硬體資料流排程狀態分析 51 5.6 整體架構狀態圖及資料流排程分析 51 第六章 實驗環境與數據分析 54 6.1 實驗環境與輸入配置 54 6.2 實驗方法 55 6.3 實驗結果 56 6.3.1 實際合成硬體架構之執行時間比較 56 6.3.2 實際合成硬體架構之硬體元件比較 57 6.3.3 硬體個數與執行時間之權衡 57 6.3.4 整體執行時間之優化比較 58 第七章 結論與未來展望 60 參考文獻 61

    [1]J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, 2009: IEEE, pp.248-255.
    [2]G. Hinton et al., "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups", IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82-97, Nov. 2012.
    [3]M. Ibnkahla, "Applications of neural networks to digital communications—A survey", Signal Process., vol. 80, no. 7, pp. 1185-1215, 2000.
    [4]D. E. Rumelhart, G. E. Hinton and R. J. Williams, "Learning internal representations by error propagation", 1985.
    [5]A. P. Dempster, N. M. Laird and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm", J. Roy. Stat. Soc. Methodol., vol. 39, no. 1, pp. 1-38, 1977.
    [6]F. Rosenblatt, "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, " in Psychological Review, pp. 386-408, 1958.
    [7]I. Goodfellow, Y. Bengio, A. Courville and Y. Bengio, Deep Learning, Cambridge, U.K.:MIT Press, vol. 1, 2016.
    [8]X. Glorot, A Bordes, and Y. Bengio, "Deep sparse rectifier neural networks, "in Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp.315-323.
    [9]A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks, "in Advavces in neural information processing systems,2012,pp.1097-1105.
    [10] Y.-L. Boureau, J. Ponce, and Y. LeCun, "A theoretical analysis of feature pooling in visual recognition, " in Proceedings of the 27th international conference on machine learning(ICML-10),2010,pp. 111-118.
    [11] S. Gold and A. Rangarajan. Softmax to softassign: neural network algorithms for combinatorial optimization. J. Artif. Neural Netw., 2(4):381–399, 1995. 2, 4
    [12] D. Koeplinger, R. Prabhakar, Y. Zhang, C. Delimitrou, C. Kozyrakis and K. Olukotun, "Automatic Generation of Efficient Accelerators for Reconfigurable Hardware," in ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 115-127, 2016.
    [13] T. Feist, Vivado design suite, vol. 1, 2012.
    [14] B. Reagen, Y. S. Shao, G. Wei and D. Brooks, "Quantifying acceleration: Power/performance trade-offs of application kernels in hardware," International Symposium on Low Power Electronics and Design (ISLPED), 2013.
    [15] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, “High-level synthesis for FPGAs: From prototyping to deployment,” In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2011.
    [16] R. Nane, V. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. Chen, H. Hsiao, S. Brown, F. Ferrandi, J. Anderson, and K. Bertels, “ A survey and evaluation of FPGA high-level synthesis tools,” In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2015.
    [17] F. Winterstein, S. Bayliss, and G. Constantinides, "High-level synthesis of dynamic data structures: A case study using Vivado HLS," In Proc. International Conference on Field-Programmable Technology (FPT’13), 2013.
    [18] Sergiu Duda, "How to Implement a Convolutional Neural Network Using High Level Synthesis," In AMIQ Consulting, 2018.
    [19] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, "Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks," IEEEE Asia and DAC, pp. 570~580, 2016.
    [20] X. Wei et al., "Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs," in Proc. Design Autom. Conf., 2017, pp. 1–6.
    [21] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, "Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks," in High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, pp. 553–564, IEEE, 2017.
    [22] K. Arun and K. Srivatsan, "A binary high speed floating point multiplier," 2017 International Conference on Nextgen Electronic Technologies: Silicon to Software (ICNETS2), 2017, pp. 316-321, doi: 10.1109/ICNETS2.2017.8067953.
    [23] Barrabés Castillo, A. (2012). Design of single precision float adder (32-bit numbers) according to IEEE 754 standard using VHDL (Master's thesis, Universitat Politècnica de Catalunya).
    [24] SDAccel Development Environment Help[Online]. Available: https://china.xilinx.com/htmldocs/xilinx2017_4/sdaccel_doc/uwa1504034294196.html

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE