成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	紀旻志 Ji, Min-Zhi
論文名稱：	優化 YOLOv3 推論引擎並實現於終端裝置 Optimization of YOLOv3 Inference Engine for Edge Device
指導教授：	陳中和 Chen, Chung-Ho
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2019
畢業學年度：	107
語文別：	中文
論文頁數：	58
中文關鍵詞：	終端裝置、神經網路框架、記憶體配置管理
外文關鍵詞：	Embedded system, Neural network framework, heap memory allocation
相關次數：	點閱：90 下載：2
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年機器學習深度神經網路蓬勃發展，造就許多很好的神經網路模型，市場也為了因應模型應用將模型壓縮、量化與設計硬體加速器，使其能運行於終端裝置，而本論文觀察到目前常見的神經網路框架普遍推論模型時會有大量動態記憶體配置需求，如在Caffe神經網路框架進行圖像辨識AlexNet推論應用時，最大動態記憶配置為931 MByte，如此大的動態記憶體配置大小不適合運行於嵌入式裝置與移動設備，故基於本實驗室建立之YOLOv3推論引擎(YOLOv3 Inference Engine)修改推論流程，使其降低動態記憶體配置大小，以建立MDFI(Micro Darknet for Inference)。
而為何常見的神經網路框架會造成如此龐大的動態記憶體配置，因為常見神經網路框架都在初始時建立/還原完整神經網路模型與配置所需的記憶體空間，再進行推論運算，故如此需要大量的記憶體，所以本論文將修改推論流程，將配置空間移至每層運算階段進行配置所需的大小與載入運算參數，且在每層運算結束後會將運算用記憶體配置釋放，達到逐層管理記憶體(Layer-wise memory management)，而現今模型越來越複雜，非以往的單純網路加深，而是透過增加residual connection來使提升訓練效果，而residual connection對於Layer-wise memory management流程會有layer dependency問題需要被解決，故本論文在分析階段建立每層相依計數器來記錄相依程度，以決定是否釋放相依的記憶體。總結Layer-wise memory management方法在YOLOv3神經網路模型中，相比於原本Darknet神經網路框架能減少92.0% 最大動態記憶體配置，在終端設備Raspberry PI 3上推論一張416 × 416圖片，原本Darknet需要14.53秒，而MDFI僅需要13.93秒，在圖像辨識AlexNet運算時間由12.35秒加速至5.341秒。
原MDFI僅支援物件偵測之YOLOv3神經網路模型，為了擴增MDFI使用場域特此增加圖像辨識與其他神經網路運算層，而所支援的網路運算層數由6種增加到11種運算。
最後本論文為MDFI增加OpenCL異質性運算流程，並將卷積層運算中的矩陣乘法轉由OpenCL設備運算，採用OpenCL SGEMM Naïve Implementation的派發運算方式，在原本為CPU i7 4770 @ 3.4GHz卷積層運算需要7.4秒，而加入OpenCL流程使用GPU NVidia GTX1080Ti 卷積層運算僅需1.4秒。

For neural networks used in low-end edge devices, there are several approaches to dealing with, such that compressing model, quantifying model and designing hardware accelerators. However, the number of parameters of the current NN (neural network) models is increasing, and the current NN frameworks typically initialize the entire NN model in the initial stage. So, memory requirement will be very huge. In order to reduce memory requirement, we propose layer-wise memory management based on Darknet. But NN models maybe have complex network structures with residual connections or routing connections for better training results. So, we propose a layer-dependency counter mechanism. Finally, we named the modified framework MDFI (Micro Darknet for Inference). According to our experimental result, the average memory consumption of MDFI is reduced by 76% compared to Darknet, and the average processing time of MDFI is reduced by 8%.

第1章 序論	1
1 論文動機	2
2 論文貢獻	2
3 論文架構	3
第2章 背景知識與相關研究	4
1 電腦視覺神經網路應用分類與其指標性模型	4
1.1 圖像辨識	4
1.2 物件偵測	9
2 神經網路運算層	11
2.1 Convolutional Layer	11
2.2 Pooling Layer	12
2.3 Route Layer	12
2.4 Shortcut Layer	13
2.5 Upsample Layer	13
2.6 YOLO Layer	14
3 神經網路框架	15
3.1 Caffe	15
3.2 TensorFlow	16
3.3 Darknet	18
3.4 小結	18
4 度量與分析工具介紹	19
4.1 Massif - Valgrind	19
第3章 MDFI - Micro Darknet For Inference	20
1 常見神經網路框架之推論流程與設計探討	20
1.1 Darknet	20
1.2 Caffe	21
1.3 TensorFlow	22
1.4 小結	23
2 YOLOv3推論引擎	24
3 如何優化YOLOv3推論引擎	25
3.1 Memory leak	26
3.2 Layer-wise memory management	27
3.3 Layer Dependency	28
3.4 小結	32
4 如何增設自定義網路層於MDFI	33
4.1 新增神經網路應用支援	33
4.2 新增神經網路運算層支援	33
5 Darknet、YOLOv3推論引擎及MDFI的支援列表	35
第4章 MDFI卷基層支援OpenCL異質性運算	37
1 神經網路框架實現卷積層	38
1.1 資料擺放格式	38
1.2 im2col	39
2 支援OpenCL異質性運算設計流程	42
第5章 實驗環境與實驗數據	45
1 實驗環境	45
2 神經網路模型參數	46
3 MDFI與原生Darknet之比較	47
3.1 記憶體使用量	47
3.2 執行時間	48
3.3 終端裝置使用情境	51
4 MDFI與其他神經網路框架之比較	54
5 OpenCL支援	55
6 小結	55
第6章 結論與未來目標	56
1 結論	56
2 未來目標	56
參考文獻	57


                                    

[1] Y. H.Chen, T.Krishna, J.Emer, andV.Sze, “Eyeriss: {An} energy-efficient reconfigurable accelerator for deep convolutional neural networks,” Proc. {IEEE} {International} {Solid}-{State} {Circuits} {Conference} ({ISSCC}), vol. 52, no. 1, pp. 262–263, 2016.
[2] S.Han et al., “EIE: Efficient Inference Engine on Compressed Deep Neural Network,” Proc. - 2016 43rd Int. Symp. Comput. Archit. ISCA 2016, vol. 16, pp. 243–254, 2016.
[3] T.Wei-Chung, “Layer-wise Fixed Point Quantization for Deep Convolutional Neural Networks and Implementation of YOLOv3 Inference Engine/深度卷積網路之逐層定點數量化方法與實作YOLOv3推論引擎,” Natl. Cheng K. Univ. - NCKU, 2019.
[4] Jhi-Han Jheng, “Design of Cycle-accurate SIMT Core and Implementation,” Natl. Cheng K. Univ. - NCKU, 2018.
[5] J. E.Stone, D.Gohara, andG.Shi, “OpenCL: A parallel programming standard for heterogeneous computing systems,” Comput. Sci. Eng., vol. 12, no. 3, pp. 66–72, 2010.
[6] K.He, X.Zhang, S.Ren, andJ.Sun, “Deep Residual Learning for Image Recognition,” Comput. Vis. Pattern Recognit., pp. 770–778, 2016.
[7] “Definition of: memory footprint,” PC Mag, 2012. [Online]. Available: https://www.pcmag.com/encyclopedia/term/60598/memory-footprint.
[8] G. E. H.Krizhevsky, Alex, Ilya Sutskever, “ImageNet Classification with Deep Convolutional Neural Networks,” J. Geotech. Geoenvironmental Eng., vol. 12, p. 04015009, 2015.
[9] J.Redmon andA.Farhadi, “YOLOv3: An Incremental Improvement,” 2018.
[10] J.Redmon, S.Divvala, R.Girshick, andA.Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” 2015.
[11] J.Redmon, “Yolo9000,” Cvpr, 2017.
[12] M.Cho, U.Finkler, S.Kumar, D.Kung, V.Saxena, andD.Sreedhar, “PowerAI DDL,” no. over 100, pp. 1–10, 2017.
[13] S.Han, H.Mao, andW. J.Dally, “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,” pp. 1–14, 2015.
[14] K.Simonyan andA.Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv Prepr. arXiv1409.1556, 2014.
[15] F. N.Iandola, S.Han, M. W.Moskewicz, K.Ashraf, W. J.Dally, andK.Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size,” pp. 1–13, 2016.
[16] Y.Jia et al., “Caffe: Convolutional Architecture for Fast Feature Embedding *,” 2014.
[17] G. I.Kenton Varda, “Protocol Buffers.” .
[18] M.Abadi et al., “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems.”
[19] J. Redmon, “Darknet: Open source neural networks in c.,” 2013. [Online]. Available: http://pjreddie.com/darknet/.
[20] S.Chetlur et al., “cuDNN: Efficient Primitives for Deep Learning,” pp. 1–9, 2014.
[21] C.Nugteren, “CLBlast: A Tuned OpenCL BLAS Library,” Proc. Int. Work. OpenCL, 2017.
[22] Yu-Xiang Su, “Porting Tensorflow to CASLAB-GPUSIM and Optimization of Matrix Multiplication Library,” Natl. Cheng K. Univ. - NCKU, 2018.

校內：2024-01-01公開
校外：2024-01-01公開

簡易檢索 / 詳目顯示

相關論文