| 研究生: |
葉育辰 Yeh, Yu-Chen |
|---|---|
| 論文名稱: |
捲積神經網路架構之研究與設計 Research and Design of Convolutional Neural Network Architecture |
| 指導教授: |
周哲民
Jou, Jer-Min |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 中文 |
| 論文頁數: | 63 |
| 中文關鍵詞: | 機器學習 、硬體加速器 、HLS 、CNN |
| 外文關鍵詞: | CNN, HLS, Machine learning, Hardware accelerator |
| 相關次數: | 點閱:192 下載:42 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文實現了捲積神經網路(CNN)硬體加速器設計。隨著時代發展,人工智能已然成為目前熱門的研究領域,其中捲積神經網路是目前深度神經網路(Deep Neural Network, DNN)領域的發展主力,但也存在通量和能量消耗的挑戰,藉由探討捲積運算中的重用性,設計空間探索,硬體間管線化設計等,做為我們的硬體加速器設計方案。
在傳統類神經網路的基礎上,捲積神經網路是其中一種發展出來的機器學習模型。但由於在捲積神經網路其中,有著大量的乘加運算,而為了加速運算,有很多種不同的重用設計方式,不同的重用方式的硬體設計都要經過長時間的分析並重新分析硬體。因此藉由高階合成(High-Level Synthesis, HLS)工具將C語言實現快速的轉換為硬體描述語言,快速實現並生成可用的捲積神經網路硬體加速器。
設計硬體加速器方面,我們藉由代入實際CNN參數例子,分析各個Local Buffer大小之可能性,經過排列組合後,透過記憶體訪問次數、計算單元執行次數及整體執行時間,假設並推導設計公式,且成立目標式代入設計公式後,可找到當中最佳Local Buffer大小;而後以此Local Buffer為基礎下,分析計算各模組之運行處理時間,找到各模組中最佳的Pipeline Latency(ii)之執行,並在模組中實現硬體管線化設計。而在整體架構上,我們利用雙緩衝器(Double-Buffer)之設計,並搭配HLS工具生成硬體架構,在此架構能以達到整體資料流之優化,最終實現任務級(Task-Level)資料流管線化設計。
This paper implements the convolutional neural network (CNN) hardware accelerator design. With the development of the times, artificial intelligence has become a popular research field at present. Among them, convolutional neural network is the main force in the development of deep neural network (DNN) field, but there are also challenges of flux and energy consumption. By discussing the reusability of convolution operations, design space exploration, pipeline design between hardware, etc., as our hardware accelerator design.
On the basis of traditional neural network, convolutional neural network is one of the developed machine learning models. However, because there are a large number of multiplication and addition operations in the convolutional neural network, and in order to speed up the operation, there are many different reuse design methods, and the hardware design of different reuse methods must be analyzed and reanalyzed for a long time body. Therefore, a high-level synthesis (HLS) tool is used to quickly convert the C language implementation into a hardware description language, and quickly implement and generate a usable convolutional neural network hardware accelerator.
In terms of designing hardware accelerators, we analyze the possibility of the size of each Local Buffer by substituting the actual CNN parameter examples. After permutation and combination, the design formula is assumed and deduced based on the number of memory accesses, the number of execution times of the computing unit and the overall execution time. After establishing the target formula and substituting it into the design formula, the optimal Local Buffer size can be found. Then, based on this Local Buffer, analyze the execution time of each module, find the best implementation of Pipeline Latency(ii) in each module, and implement the hardware pipeline design in the module. In the overall architecture, we use the Double-Buffer design, and with the HLS tool to generate the hardware architecture, this architecture can achieve the optimization of the overall dataflow, and finally achieve the task-level Dataflow pipeline design.
[1]J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, 2009: IEEE, pp.248-255.
[2]G. Hinton et al., "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups", IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82-97, Nov. 2012.
[3]M. Ibnkahla, "Applications of neural networks to digital communications—A survey", Signal Process., vol. 80, no. 7, pp. 1185-1215, 2000.
[4]D. E. Rumelhart, G. E. Hinton and R. J. Williams, "Learning internal representations by error propagation", 1985.
[5]A. P. Dempster, N. M. Laird and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm", J. Roy. Stat. Soc. Methodol., vol. 39, no. 1, pp. 1-38, 1977.
[6]F. Rosenblatt, "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, " in Psychological Review, pp. 386-408, 1958.
[7]I. Goodfellow, Y. Bengio, A. Courville and Y. Bengio, Deep Learning, Cambridge, U.K.:MIT Press, vol. 1, 2016.
[8]X. Glorot, A Bordes, and Y. Bengio, "Deep sparse rectifier neural networks, "in Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp.315-323.
[9]A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks, "in Advavces in neural information processing systems,2012,pp.1097-1105.
[10] Y.-L. Boureau, J. Ponce, and Y. LeCun, "A theoretical analysis of feature pooling in visual recognition, " in Proceedings of the 27th international conference on machine learning(ICML-10),2010,pp. 111-118.
[11] S. Gold and A. Rangarajan. Softmax to softassign: neural network algorithms for combinatorial optimization. J. Artif. Neural Netw., 2(4):381–399, 1995. 2, 4
[12] D. Koeplinger, R. Prabhakar, Y. Zhang, C. Delimitrou, C. Kozyrakis and K. Olukotun, "Automatic Generation of Efficient Accelerators for Reconfigurable Hardware," in ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 115-127, 2016.
[13] T. Feist, Vivado design suite, vol. 1, 2012.
[14] B. Reagen, Y. S. Shao, G. Wei and D. Brooks, "Quantifying acceleration: Power/performance trade-offs of application kernels in hardware," International Symposium on Low Power Electronics and Design (ISLPED), 2013.
[15] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, “High-level synthesis for FPGAs: From prototyping to deployment,” In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2011.
[16] R. Nane, V. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. Chen, H. Hsiao, S. Brown, F. Ferrandi, J. Anderson, and K. Bertels, “ A survey and evaluation of FPGA high-level synthesis tools,” In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2015.
[17] F. Winterstein, S. Bayliss, and G. Constantinides, "High-level synthesis of dynamic data structures: A case study using Vivado HLS," In Proc. International Conference on Field-Programmable Technology (FPT’13), 2013.
[18] Sergiu Duda, "How to Implement a Convolutional Neural Network Using High Level Synthesis," In AMIQ Consulting, 2018.
[19] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, "Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks," IEEEE Asia and DAC, pp. 570~580, 2016.
[20] X. Wei et al., "Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs," in Proc. Design Autom. Conf., 2017, pp. 1–6.
[21] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, "Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks," in High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, pp. 553–564, IEEE, 2017.
[22] K. Arun and K. Srivatsan, "A binary high speed floating point multiplier," 2017 International Conference on Nextgen Electronic Technologies: Silicon to Software (ICNETS2), 2017, pp. 316-321, doi: 10.1109/ICNETS2.2017.8067953.
[23] Barrabés Castillo, A. (2012). Design of single precision float adder (32-bit numbers) according to IEEE 754 standard using VHDL (Master's thesis, Universitat Politècnica de Catalunya).
[24] SDAccel Development Environment Help[Online]. Available: https://china.xilinx.com/htmldocs/xilinx2017_4/sdaccel_doc/uwa1504034294196.html