| 研究生: |
蕭丞志 Hsiao, Cheng-Chih |
|---|---|
| 論文名稱: |
基於卷積神經網路推論引擎建立支援參數量化方法之硬體加速器 Quantization Implementation for Neural Network Accelerator based on CNN Inference Engine |
| 指導教授: |
陳中和
Chen, Chung-Ho |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 中文 |
| 論文頁數: | 63 |
| 中文關鍵詞: | 人工智慧 、人工神經網路 、卷積 、神經網路量化 、卷積硬體加速器 |
| 外文關鍵詞: | Artificial Intelligence, AI accelerator, Convolution, CNNs quantization, Edge Device, Machine Learning |
| 相關次數: | 點閱:234 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來人工神經網路與機器學習技術蓬勃發展,但在擴展領域應用的同時,必須面對運行神經網路時的硬體負擔,大量的神經網路模型權重參數必須存放在應用終端,且仰賴高耗能的GPU加速神經網路複雜的卷積運算,雖然在大型的應用場合下能夠利用雲端或是工作站加速,但對於一般終端應用必須找到其他方式來加速終端運算,因此許多專門用於加速神經網路的硬體加速器被提出。
過去實驗室已開發一顆32-bit浮點數運算之一維卷積硬體加速器,為了能夠將運行卷積神經網路的參數量減少並提升速度,本論文基於上述的加速器與實驗室原有輕量化神經網路軟體框架Micro Darknet For Inference(MDFI),將量化方法共同實作於軟體與硬體上。本論文實作量化方法於MDFI上的卷積層,大部分各層之間能夠使用8-bit Integer傳遞參數,不同卷積層之間的重新量化只需要使用Shift運算完成。同時我們將實作於卷積層的量化概念加入硬體加速器,讓神經網路硬體加速器內部使用Input feature、Weight為8位元、Partial Sum為16位元的資料流,加速器運算單元內部新增重新量化單元用於各卷積層之間的重新量化。本論文亦針對加速器內部資料傳遞更改流程,並且於PE內部新增Pooling Unit加速Pooling運算,在使用量化方法減少參數量的同時也能夠提升硬體加速器的效率。
本論文使用Int8-16-8 MDFI做為資料前處理,並將卷積參數、Input feature和Weight透過RTL的Testbench傳入DRAM model,當硬體加速器完成運算後,使用MDFI產生的Golden Data驗證正確性。根據本論文提出的參數量化方式,對比原本加速器,off-chip DRAM Access下降78%、on-chip SRAM Access下降56%,單獨執行Convolution與Pooling運算,在Yolov3-tiny可達到8.71 FPS(原CU 6.98 FPS)。
Convolution Neural Network (CNN) and machine learning have developed vigorously in the past few years. At the same time, user must face the burden of hardware when running neural network. The neural network model performance is limited by the complexity of computation and hardware accelerator design. Therefore, it is a major challenge to deploy CNNs to edge devices.
In this study, we propose an efficient 1-D PE architecture inference unit based on an original floating-point accelerator to accelerate convolution neural network by using 8-bit integer data. This accelerator provides a new method to reduce large amount of on-chip SRAM accesses. The processing unit can also support pooling operation and it runs parallelly after finishing convolutional layer. Compared with the original accelerator, this new inference unit reduces over 78% off-chip DRAM accesses and 56% on-chip SRAM accesses on Yolov3-tiny and Vgg16 models.
We also propose a quantized neural network framework to meet the requirement for our Inference Unit. The quantized framework reduces 75% parameters and keep mAP accuracy loss under 0.3% in Yolov3-tiny on VOC dataset.
[1] LeCun, Yann, et al. "Gradient-based learning applied to document recognition," Proceedings of the IEEE 86.11: 2278-2324, 1998.
[2] Krizhevsky, A., Sutskever, I., and Hinton, G. E. “Imagenet classification with deep convolutional neural networks,” In NIPS, pp. 1097-1105, 2012.
[3] O. Russakovsky, et al., “ImageNet Large Scale Visual Recognition Challenge,” Int. J. Comput. Vis, vol. 115, pp. 211–252, 2015.
[4] T. Wei-Chung, “Layer-wise Fixed Point Quantization for Deep Convolutional Neural Networks and Implementation of YOLOv3 Inference Engine/深度卷積網路之逐層定點數量化方法與實作YOLOv3推論引擎,” Natl. Cheng K. Univ. - NCKU, 2019.
[5] J. Min-Zhi, “Optimization of YOLOv3 Inference Engine for Edge Device/優化YOLOv3推論引擎並實現於終端裝置,” Natl. Cheng K. Univ. - NCKU, 2018.
[6] Rosenblatt, Frank. "The perceptron: a probabilistic model for information storage and organization in the brain," Psychological review 65.6: 386, 1958.
[7] Cybenko, George. "Approximation by superpositions of a sigmoidal function," Mathematics of control, signals and systems 2.4: 303-314, 1989.
[8] Hornik, Kurt. "Approximation capabilities of multilayer feedforward networks," Neural networks 4.2: 251-257, 1991.
[9] Simonyan, Karen, Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
[10] He, Kaiming, et al. "Deep residual learning for image recognition," Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
[11] Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation," Proceedings of the IEEE conference on computer vision and pattern recognition, 2014.
[12] Girshick, Ross. "Fast r-cnn," Proceedings of the IEEE international conference on computer vision, 2015.
[13] Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks," Advances in neural information processing systems, 2015.
[14] He, Kaiming, et al. "Mask r-cnn," Proceedings of the IEEE international conference on computer vision, 2017.
[15] Redmon, Joseph, et al. "You only look once: Unified, real-time object detection," Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
[16] Redmon, Joseph, and Ali Farhadi. "YOLO9000: better, faster, stronger," Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
[17] Redmon, Joseph, and Ali Farhadi. "Yolov3: An incremental improvement," arXiv preprint arXiv:1804.02767, 2018.
[18] Uijlings, Jasper RR, et al. "Selective search for object recognition," International journal of computer vision 104.2 : 154-171, 2013.
[19] J. Redmon, “Darknet: Open source neural networks in c,” 2013, [Online], Available: http://pjreddie.com/darknet/.
[20] Abadi, Martín, et al. "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," arXiv preprint arXiv:1603.04467, 2016.
[21] Jia, Yangqing, et al. "Caffe: Convolutional architecture for fast feature embedding," Proceedings of the 22nd ACM international conference on Multimedia, 2014.
[22] S. Migacz, ”8-bit inference with TensorRT. GPU Technology Conference, “ 2017.
[23] W. Ting-Jia, “A one-dimensional convolution accelerator supporting data reuse and multiple dimensional filters/支援資料復用及過濾器尺寸可擴展性之一維卷積加速器設計與其電子系統層級驗證平台,” Natl. Cheng K. Univ. - NCKU, 2020.
[24] L. Yu-An, “Convolutional Neural Network Model Compression and Quantization-aware Training/卷積神經網路模型壓縮與量化感知訓練,” Natl. Cheng K. Univ. - NCKU, 2020.
[25] L. Chia-Ning, “Design of 2D Systolic Array Accelerator for Quantized Convolutional Neural Networks/量化卷積神經網路之二維脈動陣列加速器設計,” Natl. Cheng K. Univ. - NCKU, 2020.
[26] Y. H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Curcyuts(JSSC), vol. 52, no. 1, pp. 127-138, Jan. 2017.
[27] Kneron, “AI Edge Computing Module with Kneron KL520 NPU”, 2018
[28] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally, “Eie: Efficient inference engine on compressed deep neural network,” International Symposium on Computer Architecture (ISCA), 2016a.