簡易檢索 / 詳目顯示

研究生: 黃崇晉
Huang, Chung-Chin
論文名稱: 可重構捲積神經網路加速器之設計
Design Of A Reconfigurable CNN Accelerator
指導教授: 周哲民
Jou, Jer-Min
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 63
中文關鍵詞: 深度神經網路階層式控制單元分散式控制單元可重構
外文關鍵詞: Deep Neural Network, Hierarchical Control Unit, Distributed Control Unit, Reconfigurable Accelerator
相關次數: 點閱:111下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 深度神經網絡(Deep Neural Network, DNN)規模部署遞增,嚴格的時間延遲、生產量 (throughput)和能量限制等,使DNN硬體加速器之設計日趨複雜。而應用於各種DNN layers:CONV2D、Fully-Connected (FC)、LSTM等神經網路層中面臨了眾多運算函式,其中捲積神經網路(Convolutional Neural Network, CNN)是最被廣泛應用的神經網路之一,捲積層運算佔CNN整體運算量的90%以上。而現今CNN加速器須控制數百個處理單元(Processing Elements, PE)的多維並行性來實現,這也使得在CNN硬體加速器設計上必然將面對資料流、控制流和地址流資料壅塞與硬體可擴展性差等問題。故對神經網路運算模式中的分層並行、資料局部性、資料重複使用狀況等提供完善的大數據運行策略來最佳化硬體使用率、生產量(throughput)與其效能,是現今CNN加速器設計所需面臨的重大挑戰。有鑑於此,我們提出一種可重構捲積神經網路加速器之設計,透過可重構機制來有效處理不同資料流之映射策略,並使用階層式控制機制有效管理神經網路之多維參數交錯海量運算控制,且支援三種控制模式:(a) Coarse-grained運算控制模式;(b) Medium-grained運算控制模式;(c) Fine-grained運算控制模式,再結合分散式控制機制,使每組運算單元由各自獨立的控制器來控制運算單元所需執行的運算任務。針對不同深度神經網路運算在可重構架構上,執行多層多維度並行映射運算控制策略,只需較小的可重構控制代價,就可達成可重構捲積神經網路加速器運算之最佳化運算執行目的。

    The increasing scale deployment of Deep Neural Network (DNN), strict time delay, throughput and energy constraints, etc., make the design of DNN hardware accelerators more and more complex. And applied to various DNN layers: CONV2D, Fully-Connected (FC), LSTM and other neural network layers are faced with many operation functions, among which Convolutional Neural Network (CNN) is the most widely used neural network. One of the networks, the convolution layer operation accounts for more than 90% of the overall operation of CNN. However, CNN Accelerators must control the multi-dimensional parallelism of hundreds of processing elements (PE) to achieve, which also makes the design of CNN hardware accelerators inevitably face data-flow, control-flow, address-flow data congestion and poor hardware scalability problems. Therefore, it provides a complete big data operation strategy for hierarchical parallelism, data locality, and data reuse in the neural network operation mode to optimize hardware utilization, throughput and performance. It is a major challenge for the design of CNN accelerators today.
    In view of this, we propose a design of a reconfigurable convolutional neural network accelerator, which can effectively handle the mapping strategies of different data-flow through a reconfigurable mechanism, and use a hierarchical control mechanism to effectively manage the multi-dimensional parameter interleaving and massive operation control of neural networks. And supports three control modes: (a) Coarse-grained operation control mode; (b) Medium-grained operation control mode; (c) Fine-grained operation control mode, combined with the distributed control mechanism, each group of operation units is controlled by independent controller is used to control the operation tasks that the operation unit needs to perform. For different deep neural network operations, on the reconfigurable architecture, the multi-layer and multi-dimensional parallel mapping operation control strategy can be implemented, and only a small reconfigurable control cost can be achieved to achieve the best reconfigurable convolutional neural network accelerator operation Optimization operation execution purpose.

    摘要I SUMMARY II OUR PROPOSED DESIGN III EXPERIMENTS V CONCLUSION VI 誌謝 VII 目錄 VIII 表目錄 IX 圖目錄 IX 第一章 緒論 1 1.1研究背景 1 1.2研究動機與目的 2 1.3論文架構 3 第二章 背景知識與相關研究 4 2.1深度神經網路 (Deep Neural Network, DNN) 4 2.2可重構運算架構 11 2.3控制器設計概述 15 第三章 神經網路之資料流與共用方案分析 18 3.1神經網路迴圈最佳化控制設計策略 19 3.2通用型捲積神經網路之設計空間探索 19 3.3 階層式深度神經網路之資料流與共用方案分析 23 第四章 可重構之神經網路硬體架構設計 29 4.1可重構之捲積神經網路硬體架構概述與挑戰 29 4.2可重構之捲積神經網路運算架構設計 31 4.3通用型階層且分散式控制架構設計 35 第五章 實驗結果與討論 48 5.1 開發平台 48 5.2 使用Python與Matlab建構Lenet-5和VGG16神經網路架構 49 5.3使用ModelSim分析基於Verilog實現之可重構神經網路硬體架構設計 54 第七章 結論與未來展望 60 參考文獻 61

    [1] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in ASPLOS, pp. 269–284, 2014.
    [2] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse convolutional neural networks,” in Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 27–40, ACM, 2017.
    [3] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138,2017.
    [4] N. P. Jouppi et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit,” in Proceedings of the International Symposium on Computer Architecture (ISCA), 2017, pp. 1–12.
    [5] J. Zhang, X. Chen, M. Song, and T. Li, “Eager Pruning: Algorithm and Architecture Support for Fast Training of Deep Neural Networks,” in Proceedings of the 46th International Symposium on Computer Architecture (ISCA), 2019, pp. 292–303.
    [6] HE, Kaiming, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. p. 770-778. 2016.
    [7] CHO, Kyunghyun, et al. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
    [8] J. Zhang, X. Chen, and S. Ray, “Optimizing the Whole-life Cost in End-to-end CNN Acceleration,” arXiv:2104.05541, 2021.
    [9] A. Parashar et al., “Timeloop: A Systematic Approach to DNN Accelerator Evaluation,” in Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019.
    [10] X. Yang et al., “Interstellar: Using Halide’s Scheduling Language to Analyze DNN Accelerators,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020, pp. 369–383.
    [11] H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and T. Krishna, “Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows: A Data-Centric Approach,” in Proceedings of the International Symposium on Microarchitecture (MICRO), 2019, pp. 754–768.
    [12] HE, Kaiming, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. p. 770-778. 2016.
    [13] Alex K., Ilya S., Geoffrey E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Proceedings of the 25th International Conference on Neural Information Processing Systems, vol. 1, 2012, pp.1097~1105.
    [14] RUCK, Dennis W.; ROGERS, Steven K.; KABRISKY, Matthew. Feature selection using a multilayer perceptron. Journal of Neural Network Computing, 2.2: 40-48, 1990.
    [15] MEDSKER, Larry R.; JAIN, L. C. Recurrent neural networks. Design and Applications, 5: 64-67, 2001.
    [16] HUANG, Zhiheng; XU, Wei; YU, Kai. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015.
    [17] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8),1735–1780.
    [18] B. Y-Lan, P. Jean, and L. Yann, "A Theoretical Analysis of Feature Pooling in Visual Recognition.," In Int. Conf. on Machine Learning, 2010.
    [19] G Estrin, B.Bussell, R.Turn, et al. Parallel Processing in a Restructurable Computer System [J]. IEEE Transactions on Electronic Computers, 1963, 12:747-755
    [20] Andr. Reconfigurable computing: what, why, and implications for design automation[C]. IEEE Design Automation Conference. 1999, 610-615
    [21] Raj Krishnamurthy. A Survey of Next-Generation Reconfigurable Architectures for Embedded Computing[M]. College of Computing, Georgia Institute of Technology, 2001.
    [22] WohM, Seo S, Mahlke S, et al. AnySP:anytime anywhere anyway signal processing[J]. IEEE Micro, 2010, 30(1):81-91
    [23] Liang C, Huang X. Mapping parallel fft algorithm onto smartcell coarse-grained reconfigurable architecture[J]. IEICE transactions on electronics,2010, 93(3):407-415
    [24] Yang C, Liu L, Yin S, et al. Efficient and flexible memory architecture to alleviate data and context bandwidth bottlenecks of coarse-grained reconfigurable arrays[J]. Science China Physics, Mechanics & Astronomy, 2014, 57(12):2214-2227
    [25] Jian X, Zhang J, Min Z, et al. Fast adaboost-based face detection system on a dynamically coarse grain reconfigurable architecture[J]. IEICE Transaction on Information and Systems, 2012, 95(2):392-402
    [26] Mei C, Cao P, Zhang Y, et al. Hierarchical Pipeline Optimization of Coarse Grained Reconfigurable Processor for Multimedia Applications[C]. In: IEEE International Parallel & Distributed Processing Symposium Workshops. 2014.281-286
    [27] Mei B, Berekovic M, Mignolet J. ADRES & DRESC: Architecture and compiler for coarse-grain reconfigurable processors[M]. In: Fine-and Coarse-Grain Reconfigurable Computing. [S.l.]: Springer, 2007. 255-297
    [28] Hsueh C W, Chung J F, Van L D, et al Anticipatory access pipeline design for phased cache[J]2008.2342-2345.
    [29] Pneto P, Puente V Gregono J A. MultiLeVeL Cache Modeling for Chip-Multiprocessor Systems[J]. IEEE Computer Architecture Letters, 2011, 10(2)49-52.
    [30] B. De Sutter, P. Raghavan, and A. Lambrechts. Coarse-grained reconfigurable array architectures[C]. Handbook of signal processing systems, ed: Springer, 2013, 553-592
    [31] L. Liu, J. Zhu, Z. Li, Y. Lu, Y. Deng, J. Han, et al., "A survey of coarse-grained reconfigurable architecture and design: Taxonomy challenges and applications", ACM Comput. Surv., vol. 52, no. 6, Oct. 2019.
    [32] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
    [33] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998a). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324.

    無法下載圖示 校內:2027-08-09公開
    校外:2027-08-09公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE