| 研究生: |
黃崇晉 Huang, Chung-Chin |
|---|---|
| 論文名稱: |
可重構捲積神經網路加速器之設計 Design Of A Reconfigurable CNN Accelerator |
| 指導教授: |
周哲民
Jou, Jer-Min |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 中文 |
| 論文頁數: | 63 |
| 中文關鍵詞: | 深度神經網路 、階層式控制單元 、分散式控制單元 、可重構 |
| 外文關鍵詞: | Deep Neural Network, Hierarchical Control Unit, Distributed Control Unit, Reconfigurable Accelerator |
| 相關次數: | 點閱:111 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
深度神經網絡(Deep Neural Network, DNN)規模部署遞增,嚴格的時間延遲、生產量 (throughput)和能量限制等,使DNN硬體加速器之設計日趨複雜。而應用於各種DNN layers:CONV2D、Fully-Connected (FC)、LSTM等神經網路層中面臨了眾多運算函式,其中捲積神經網路(Convolutional Neural Network, CNN)是最被廣泛應用的神經網路之一,捲積層運算佔CNN整體運算量的90%以上。而現今CNN加速器須控制數百個處理單元(Processing Elements, PE)的多維並行性來實現,這也使得在CNN硬體加速器設計上必然將面對資料流、控制流和地址流資料壅塞與硬體可擴展性差等問題。故對神經網路運算模式中的分層並行、資料局部性、資料重複使用狀況等提供完善的大數據運行策略來最佳化硬體使用率、生產量(throughput)與其效能,是現今CNN加速器設計所需面臨的重大挑戰。有鑑於此,我們提出一種可重構捲積神經網路加速器之設計,透過可重構機制來有效處理不同資料流之映射策略,並使用階層式控制機制有效管理神經網路之多維參數交錯海量運算控制,且支援三種控制模式:(a) Coarse-grained運算控制模式;(b) Medium-grained運算控制模式;(c) Fine-grained運算控制模式,再結合分散式控制機制,使每組運算單元由各自獨立的控制器來控制運算單元所需執行的運算任務。針對不同深度神經網路運算在可重構架構上,執行多層多維度並行映射運算控制策略,只需較小的可重構控制代價,就可達成可重構捲積神經網路加速器運算之最佳化運算執行目的。
The increasing scale deployment of Deep Neural Network (DNN), strict time delay, throughput and energy constraints, etc., make the design of DNN hardware accelerators more and more complex. And applied to various DNN layers: CONV2D, Fully-Connected (FC), LSTM and other neural network layers are faced with many operation functions, among which Convolutional Neural Network (CNN) is the most widely used neural network. One of the networks, the convolution layer operation accounts for more than 90% of the overall operation of CNN. However, CNN Accelerators must control the multi-dimensional parallelism of hundreds of processing elements (PE) to achieve, which also makes the design of CNN hardware accelerators inevitably face data-flow, control-flow, address-flow data congestion and poor hardware scalability problems. Therefore, it provides a complete big data operation strategy for hierarchical parallelism, data locality, and data reuse in the neural network operation mode to optimize hardware utilization, throughput and performance. It is a major challenge for the design of CNN accelerators today.
In view of this, we propose a design of a reconfigurable convolutional neural network accelerator, which can effectively handle the mapping strategies of different data-flow through a reconfigurable mechanism, and use a hierarchical control mechanism to effectively manage the multi-dimensional parameter interleaving and massive operation control of neural networks. And supports three control modes: (a) Coarse-grained operation control mode; (b) Medium-grained operation control mode; (c) Fine-grained operation control mode, combined with the distributed control mechanism, each group of operation units is controlled by independent controller is used to control the operation tasks that the operation unit needs to perform. For different deep neural network operations, on the reconfigurable architecture, the multi-layer and multi-dimensional parallel mapping operation control strategy can be implemented, and only a small reconfigurable control cost can be achieved to achieve the best reconfigurable convolutional neural network accelerator operation Optimization operation execution purpose.
[1] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in ASPLOS, pp. 269–284, 2014.
[2] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse convolutional neural networks,” in Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 27–40, ACM, 2017.
[3] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138,2017.
[4] N. P. Jouppi et al., “In-Datacenter Performance Analysis of a Tensor Processing Unit,” in Proceedings of the International Symposium on Computer Architecture (ISCA), 2017, pp. 1–12.
[5] J. Zhang, X. Chen, M. Song, and T. Li, “Eager Pruning: Algorithm and Architecture Support for Fast Training of Deep Neural Networks,” in Proceedings of the 46th International Symposium on Computer Architecture (ISCA), 2019, pp. 292–303.
[6] HE, Kaiming, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. p. 770-778. 2016.
[7] CHO, Kyunghyun, et al. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
[8] J. Zhang, X. Chen, and S. Ray, “Optimizing the Whole-life Cost in End-to-end CNN Acceleration,” arXiv:2104.05541, 2021.
[9] A. Parashar et al., “Timeloop: A Systematic Approach to DNN Accelerator Evaluation,” in Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019.
[10] X. Yang et al., “Interstellar: Using Halide’s Scheduling Language to Analyze DNN Accelerators,” in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020, pp. 369–383.
[11] H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and T. Krishna, “Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows: A Data-Centric Approach,” in Proceedings of the International Symposium on Microarchitecture (MICRO), 2019, pp. 754–768.
[12] HE, Kaiming, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. p. 770-778. 2016.
[13] Alex K., Ilya S., Geoffrey E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Proceedings of the 25th International Conference on Neural Information Processing Systems, vol. 1, 2012, pp.1097~1105.
[14] RUCK, Dennis W.; ROGERS, Steven K.; KABRISKY, Matthew. Feature selection using a multilayer perceptron. Journal of Neural Network Computing, 2.2: 40-48, 1990.
[15] MEDSKER, Larry R.; JAIN, L. C. Recurrent neural networks. Design and Applications, 5: 64-67, 2001.
[16] HUANG, Zhiheng; XU, Wei; YU, Kai. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015.
[17] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8),1735–1780.
[18] B. Y-Lan, P. Jean, and L. Yann, "A Theoretical Analysis of Feature Pooling in Visual Recognition.," In Int. Conf. on Machine Learning, 2010.
[19] G Estrin, B.Bussell, R.Turn, et al. Parallel Processing in a Restructurable Computer System [J]. IEEE Transactions on Electronic Computers, 1963, 12:747-755
[20] Andr. Reconfigurable computing: what, why, and implications for design automation[C]. IEEE Design Automation Conference. 1999, 610-615
[21] Raj Krishnamurthy. A Survey of Next-Generation Reconfigurable Architectures for Embedded Computing[M]. College of Computing, Georgia Institute of Technology, 2001.
[22] WohM, Seo S, Mahlke S, et al. AnySP:anytime anywhere anyway signal processing[J]. IEEE Micro, 2010, 30(1):81-91
[23] Liang C, Huang X. Mapping parallel fft algorithm onto smartcell coarse-grained reconfigurable architecture[J]. IEICE transactions on electronics,2010, 93(3):407-415
[24] Yang C, Liu L, Yin S, et al. Efficient and flexible memory architecture to alleviate data and context bandwidth bottlenecks of coarse-grained reconfigurable arrays[J]. Science China Physics, Mechanics & Astronomy, 2014, 57(12):2214-2227
[25] Jian X, Zhang J, Min Z, et al. Fast adaboost-based face detection system on a dynamically coarse grain reconfigurable architecture[J]. IEICE Transaction on Information and Systems, 2012, 95(2):392-402
[26] Mei C, Cao P, Zhang Y, et al. Hierarchical Pipeline Optimization of Coarse Grained Reconfigurable Processor for Multimedia Applications[C]. In: IEEE International Parallel & Distributed Processing Symposium Workshops. 2014.281-286
[27] Mei B, Berekovic M, Mignolet J. ADRES & DRESC: Architecture and compiler for coarse-grain reconfigurable processors[M]. In: Fine-and Coarse-Grain Reconfigurable Computing. [S.l.]: Springer, 2007. 255-297
[28] Hsueh C W, Chung J F, Van L D, et al Anticipatory access pipeline design for phased cache[J]2008.2342-2345.
[29] Pneto P, Puente V Gregono J A. MultiLeVeL Cache Modeling for Chip-Multiprocessor Systems[J]. IEEE Computer Architecture Letters, 2011, 10(2)49-52.
[30] B. De Sutter, P. Raghavan, and A. Lambrechts. Coarse-grained reconfigurable array architectures[C]. Handbook of signal processing systems, ed: Springer, 2013, 553-592
[31] L. Liu, J. Zhu, Z. Li, Y. Lu, Y. Deng, J. Han, et al., "A survey of coarse-grained reconfigurable architecture and design: Taxonomy challenges and applications", ACM Comput. Surv., vol. 52, no. 6, Oct. 2019.
[32] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[33] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998a). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324.
校內:2027-08-09公開