| 研究生: |
紀旻志 Ji, Min-Zhi |
|---|---|
| 論文名稱: |
優化 YOLOv3 推論引擎並實現於終端裝置 Optimization of YOLOv3 Inference Engine for Edge Device |
| 指導教授: |
陳中和
Chen, Chung-Ho |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 中文 |
| 論文頁數: | 58 |
| 中文關鍵詞: | 終端裝置 、神經網路框架 、記憶體配置管理 |
| 外文關鍵詞: | Embedded system, Neural network framework, heap memory allocation |
| 相關次數: | 點閱:90 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年機器學習深度神經網路蓬勃發展,造就許多很好的神經網路模型,市場也為了因應模型應用將模型壓縮、量化與設計硬體加速器,使其能運行於終端裝置,而本論文觀察到目前常見的神經網路框架普遍推論模型時會有大量動態記憶體配置需求,如在Caffe神經網路框架進行圖像辨識AlexNet推論應用時,最大動態記憶配置為931 MByte,如此大的動態記憶體配置大小不適合運行於嵌入式裝置與移動設備,故基於本實驗室建立之YOLOv3推論引擎(YOLOv3 Inference Engine)修改推論流程,使其降低動態記憶體配置大小,以建立MDFI(Micro Darknet for Inference)。
而為何常見的神經網路框架會造成如此龐大的動態記憶體配置,因為常見神經網路框架都在初始時建立/還原完整神經網路模型與配置所需的記憶體空間,再進行推論運算,故如此需要大量的記憶體,所以本論文將修改推論流程,將配置空間移至每層運算階段進行配置所需的大小與載入運算參數,且在每層運算結束後會將運算用記憶體配置釋放,達到逐層管理記憶體(Layer-wise memory management),而現今模型越來越複雜,非以往的單純網路加深,而是透過增加residual connection來使提升訓練效果,而residual connection對於Layer-wise memory management流程會有layer dependency問題需要被解決,故本論文在分析階段建立每層相依計數器來記錄相依程度,以決定是否釋放相依的記憶體。總結Layer-wise memory management方法在YOLOv3神經網路模型中,相比於原本Darknet神經網路框架能減少92.0% 最大動態記憶體配置,在終端設備Raspberry PI 3上推論一張416 × 416圖片,原本Darknet需要14.53秒,而MDFI僅需要13.93秒,在圖像辨識AlexNet運算時間由12.35秒加速至5.341秒。
原MDFI僅支援物件偵測之YOLOv3神經網路模型,為了擴增MDFI使用場域特此增加圖像辨識與其他神經網路運算層,而所支援的網路運算層數由6種增加到11種運算。
最後本論文為MDFI增加OpenCL異質性運算流程,並將卷積層運算中的矩陣乘法轉由OpenCL設備運算,採用OpenCL SGEMM Naïve Implementation的派發運算方式,在原本為CPU i7 4770 @ 3.4GHz卷積層運算需要7.4秒,而加入OpenCL流程使用GPU NVidia GTX1080Ti 卷積層運算僅需1.4秒。
For neural networks used in low-end edge devices, there are several approaches to dealing with, such that compressing model, quantifying model and designing hardware accelerators. However, the number of parameters of the current NN (neural network) models is increasing, and the current NN frameworks typically initialize the entire NN model in the initial stage. So, memory requirement will be very huge. In order to reduce memory requirement, we propose layer-wise memory management based on Darknet. But NN models maybe have complex network structures with residual connections or routing connections for better training results. So, we propose a layer-dependency counter mechanism. Finally, we named the modified framework MDFI (Micro Darknet for Inference). According to our experimental result, the average memory consumption of MDFI is reduced by 76% compared to Darknet, and the average processing time of MDFI is reduced by 8%.
[1] Y. H.Chen, T.Krishna, J.Emer, andV.Sze, “Eyeriss: {An} energy-efficient reconfigurable accelerator for deep convolutional neural networks,” Proc. {IEEE} {International} {Solid}-{State} {Circuits} {Conference} ({ISSCC}), vol. 52, no. 1, pp. 262–263, 2016.
[2] S.Han et al., “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” Proc. - 2016 43rd Int. Symp. Comput. Archit. ISCA 2016, vol. 16, pp. 243–254, 2016.
[3] T.Wei-Chung, “Layer-wise Fixed Point Quantization for Deep Convolutional Neural Networks and Implementation of YOLOv3 Inference Engine/深度卷積網路之逐層定點數量化方法與實作YOLOv3推論引擎,” Natl. Cheng K. Univ. - NCKU, 2019.
[4] Jhi-Han Jheng, “Design of Cycle-accurate SIMT Core and Implementation,” Natl. Cheng K. Univ. - NCKU, 2018.
[5] J. E.Stone, D.Gohara, andG.Shi, “OpenCL: A parallel programming standard for heterogeneous computing systems,” Comput. Sci. Eng., vol. 12, no. 3, pp. 66–72, 2010.
[6] K.He, X.Zhang, S.Ren, andJ.Sun, “Deep Residual Learning for Image Recognition,” Comput. Vis. Pattern Recognit., pp. 770–778, 2016.
[7] “Definition of: memory footprint,” PC Mag, 2012. [Online]. Available: https://www.pcmag.com/encyclopedia/term/60598/memory-footprint.
[8] G. E. H.Krizhevsky, Alex, Ilya Sutskever, “ImageNet Classification with Deep Convolutional Neural Networks,” J. Geotech. Geoenvironmental Eng., vol. 12, p. 04015009, 2015.
[9] J.Redmon andA.Farhadi, “YOLOv3: An Incremental Improvement,” 2018.
[10] J.Redmon, S.Divvala, R.Girshick, andA.Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” 2015.
[11] J.Redmon, “Yolo9000,” Cvpr, 2017.
[12] M.Cho, U.Finkler, S.Kumar, D.Kung, V.Saxena, andD.Sreedhar, “PowerAI DDL,” no. over 100, pp. 1–10, 2017.
[13] S.Han, H.Mao, andW. J.Dally, “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,” pp. 1–14, 2015.
[14] K.Simonyan andA.Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv Prepr. arXiv1409.1556, 2014.
[15] F. N.Iandola, S.Han, M. W.Moskewicz, K.Ashraf, W. J.Dally, andK.Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size,” pp. 1–13, 2016.
[16] Y.Jia et al., “Caffe: Convolutional Architecture for Fast Feature Embedding *,” 2014.
[17] G. I.Kenton Varda, “Protocol Buffers.” .
[18] M.Abadi et al., “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems.”
[19] J. Redmon, “Darknet: Open source neural networks in c.,” 2013. [Online]. Available: http://pjreddie.com/darknet/.
[20] S.Chetlur et al., “cuDNN: Efficient Primitives for Deep Learning,” pp. 1–9, 2014.
[21] C.Nugteren, “CLBlast: A Tuned OpenCL BLAS Library,” Proc. Int. Work. OpenCL, 2017.
[22] Yu-Xiang Su, “Porting Tensorflow to CASLAB-GPUSIM and Optimization of Matrix Multiplication Library,” Natl. Cheng K. Univ. - NCKU, 2018.