| 研究生: |
余誌偉 Yu, Chih-Wei |
|---|---|
| 論文名稱: |
適用於深度神經網路推論之嵌入式異質多核心架構之設計與實作 The Design and Implementation of a Heterogeneous Multi-core Embedded Architecture for Deep Neural Network Inference |
| 指導教授: |
侯廷偉
Hou, Ting-Wei |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 工程科學系 Department of Engineering Science |
| 論文出版年: | 2019 |
| 畢業學年度: | 108 |
| 語文別: | 英文 |
| 論文頁數: | 66 |
| 中文關鍵詞: | 神經網路加速 、異質多核心 、嵌入式系統 |
| 外文關鍵詞: | Neural acceleration, Heterogeneous multi-core, Embedded system |
| 相關次數: | 點閱:61 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
為提升嵌入式裝置針對深度神經網路推論(DNN inference)之效能所提出的設計透過NCU (Neural Computing Unit)軟協同處理器(soft co-processor)輔助嵌入式通用型處理器(general-purpose embedded processor)之不足,構成之異質多核心架構已實作於開發板。
有別於過去已被提出的深度神經網路硬體加速方案,NCU以指令為基礎的運作方式使其在處理不同結構的深度神經網路模型更加彈性,並可獨立運行推論主程序而不需透過硬處理器系統(Hard Processor System, HPS)控制。NCU執行階段函式庫 (runtime library)可產生對應的指令,而NCU模型轉換器可將Keras已訓練之模型檔轉換為NCU模型檔。硬處理器系統運行嵌入式Linux─Angstrom,並已實作與其相容之驅動程式處理硬體相依之操作。
效能評估方式採計模型推論之運行時間,於伺服器預先訓練十二個深度神經網路模型,並於選定之嵌入式平台上運行推論,用於比較之平台包含樹梅派3 Model B + 與NVIDIA Jetson TX2開發板,與TX2相比評估之結果,所實作之硬體提升了1.5至8.7倍。
The main target of the proposed design is to increase the performance of DNN inference on embedded devices by adding a soft co-processor, Neural Computing Unit (NCU), to a general-purpose embedded processor. The heterogeneous multi-core platform has been implemented on a development kit.
The design in this article differs from other efforts in the instruction-based co-processor. The NCU directly executed the stated instructions, and hence making the inference on different models more flexible. Furthermore, the design also makes the inference procedure be mainly performed by the NCU without control of the Hard Processor System (HPS). The instructions can be generated by the NCU runtime. The NCU converter provides the conversion from Keras pre-trained model files to NCU model files. Embedded Linux, Angstrom, is running on the HPS, and the NCU driver has been implemented to handle all hardware-dependent operations.
To evaluate the performance of the proposed platform, twelve DNN models are pre-trained by Keras on the server and are deployed onto the selected embedded platforms to perform the inference. The benchmark is obtained according to the execution time of the inference. For comparison, Raspberry Pi 3 Model B + and NVIDIA Jetson TX2 Developer Kit are used in the evaluation. The implemented hardware performs the DNN model inferences efficiently with the speedup of 1.5 to 8.7 times comparing with the TX2.
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, May 2015.
[2] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proc. Advances in Neural Information Processing Systems, 2014, pp. 568–576.
[3] S. S. Farfade, M. J. Saberian, and L. Li, “Multi-view face detection using deep convolutional neural networks,” in Proc. ACM International Conference on Multimedia Retrieval, 2015, pp. 643–650.
[4] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,” J. Mach. Learn. Res., vol. 12 pp. 2493–2537, Aug. 2011.
[5] Terasic, “DE10-Nano Kit.” Terasic - SoC Platform - Cyclone - DE10-Nano Kit. https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&No=1046 (accessed Sep. 2, 2019).
[6] T. Wang, C. Wang, X. Zhou, H. Chen, “A survey of FPGA based deep learning accelerators: challenges and opportunities,” 2018, arXiv: 1901.04988v1.
[7] A. Shawahna, S. M. Sait, A. El-Maleh, “FPGA-based accelerators of deep learning networks for learning and classification: A review,” in IEEE Access, vol. 7, pp. 7823–7859, 2019.
[8] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks,” in Proc. International Conference on Field Programmable Logic and Applications, 2017, pp. 1–8.
[9] Y. Ma, N. Suda, J.-s. Seo, Y. Cao, and S. Vrudhula, “Scalable and modularized RTL compilation of convolutional neural networks onto FPGA,” in Proc. International Conference on Field Programmable Logic and Applications, 2016, pp. 1–8.
[10] S. I. Venieris and C. S. Bouganis, “fpgaConvNet: A framework for mapping convolutional neural networks on FPGAs,” in Proc. IEEE International Symposium on Field-Programmable Custom Computing Machines, 2016, pp. 40–47.
[11] Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong, “FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates,” in Proc. IEEE International Symposium on Field-Programmable Custom Computing Machines, 2017, pp. 152–159.
[12] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Misra, and H. Esmaeilzadeh, “From high-level deep neural models to fpgas,” in IEEE/ACM International Symposium on Microarchitecture, 2016.
[13] Josh Patterson and Adam Gibson, Deep Learning. Sebastopol, CA, USA: O’Reilly Media, 2017.
[14] Nikhil Buduma and Nicholas Locascio, Fundamentals of Deep Learning. Sebastopol, CA, USA: O’Reilly Media, 2017.
[15] Rodolfo Giometti, Linux Device Driver Development Cookbook. Birmingham, UK: Packt Publishing, 2019.
[16] Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman, Linux Device Drivers, Third Edition. Sebastopol, CA, USA: O’Reilly Media, 2005.
[17] Terasic. DE10-Nano User manual. (2017). [Online]. Available: https://www.terasic.com.tw/cgi-bin/page/archive_download.pl?Language=English&No=1046&FID=1c19d1d50e0ee9b21678e881004f6d81
[18] Intel. Cyclone V Hard Processor System Technical Reference Manual. (2018). [Online]. Available: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/cyclone-v/cv_54001.pdf
[19] Intel. Floating-Point IP Cores User Guide. (2016). [Online]. Available: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug_altfp_mfug.pdf
[20] Intel. AN 812: Platform Designer System Design Tutorial. (2018). [Online]. Available: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/an/an812.pdf
[21] ARM. AMBA AXI and ACE Protocol Specification. (2013). [Online]. Available: https://static.docs.arm.com/ihi0022/d/IHI0022D_amba_axi_protocol_spec.pdf
[22] Intel. Avalon Interface Specifications. (2018). [Online]. Available: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/mnl_avalon_spec.pdf
[23] ARM. ARMv7-M Architecture Reference Manual. (2018). [Online]. Available: https://static.docs.arm.com/ddi0403/e/DDI0403E_d_armv7m_arm.pdf
[24] Raspberry Pi Foundation, “Raspberry Pi 3 Model B+.” Buy a Raspberry Pi 3 Model B+ – Raspberry Pi. https://www.raspberrypi.org/products/raspberry-pi-3-model-b-plus/ (accessed Sep. 2, 2019).
[25] Dustin Franklin, “NVIDIA Jetson TX2 Delivers Twice the Intelligence to the Edge.” NVIDIA Jetson TX2 Delivers Twice the Intelligence to the Edge | NVIDIA Developer Blog. https://devblogs.nvidia.com/jetson-tx2-delivers-twice-intelligence-edge/ (accessed Sep. 2, 2019).
校內:2024-01-01公開