| 研究生: |
胡連鈞 Hu, Lien-Chun |
|---|---|
| 論文名稱: |
結合管線化半乘器與迴圈並行之捲積神經網路硬體加速器設計 Pipelining and Unrolling Design of a CNN Accelerator with Half Multipliers |
| 指導教授: |
周哲民
Jou, Jer-Min |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 中文 |
| 論文頁數: | 65 |
| 中文關鍵詞: | 捲積神經網路 、半乘器 、迴圈並行 、硬體加速器 、機器學習 |
| 外文關鍵詞: | CNN, half multiplier, loop unrolling, hardware accelerator, machine learning |
| 相關次數: | 點閱:106 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文實現了捲積神經網絡(CNN)硬體加速器設計。捲積神經網絡廣泛用於現代人工智能系統,但也存在通量和能量消耗的挑戰。藉由探討捲積運算中的平行性、設計空間探索、記憶體重複使用狀況等,做為我們的硬體加速器設計方案,並且提出雙緩衝器下之平衡定理,達到平衡平衡傳輸時間與運算時間之最佳化設計。
我們以VGG-16作為實驗架構,並用ImageNet LSVRC-2014做為我們的資料集。我們將預先訓練好的權值和偏移量的資料做分析處理,將參數值取出後,在固定16位元數下,載入給改良加速硬體加速器做捲積運算。我們利用迴圈並行、乘法器管線化和利用半乘法器取代全乘法器,雙緩衝器設計等方法加速運行以及節省硬體面積。
我們實做在DE2i-150 FPGA板上,並且利用SOPC架構實現我們的硬體加速器架構研究結果顯示了19.08 GOPS的通量(throughput)和每張圖片會花費1622.16毫秒的延遲(latency/Image),由於FPGA板上的限制導致延遲時間高。
Machine Learning is a category of artificial intelligence. The process allows machines to learn from "Training" to "Prediction". Machine learning applications are very broad, such as search keyword, weather forecasting, face recognition, fingerprint recognition, license plate recognition, certificate analysis, voice processing, etc. Machine learning uses the algorithm to classify the collected data and model training. Through the operation of the mathematical model, the machine can self-correct the update parameters from the error, and after multiple learning, find the answer that is closer to the correct inference. In the future, when new information is available, the AI model can predict the correct classification through the trained model. Machine learning theory was proposed in the 1980s. Due to the limitations of hardware performance at that time, computers could not perform a large number of calculations, resulting in low accuracy. In recent years, due to many algorithms and hardware performance improvement, the prediction of machine learning has been greatly improved. It is now widely used in various scenarios.
In this thesis, we implement a Convolution Neural Network (CNN) hardware accelerator. Convolutional neural network are widely used in modern artificial intelligence system, but also known with the challenges of throughput and energy consumption. We explore parallelism in convolutional operations, design space exploration, memory reuse, etc. We analyze CNN architecture and optimize hardware. Simultaneously, we propose the double buffer balance theorem to balance transfer time and operation time.
We use VGG-16 as CNN architecture and use ImageNet LSVRC-2014 as our data set. We analyze weights and biases that have trained. After we fetch the parameter, and write them into the improved accelerated hardware accelerator for CNN with a fixed 16-bits number. We use loop parallelism, pipelined multiplier, half-multipliers replace the full multipliers, double buffer design and other methods to speed up the calculation. Meanwhile, we also reduce hardware area.
We implement on the DE2i-150 FPGA board and implemente our hardware accelerator architecture with the SOPC architecture. The experimental results for VGG-16 CNN model achieved 19.08 GOPS of throughput and 1622.16 ms of latency. The delay time is higher than other design due to limitations on the FPGA board. Because our FPGA is not as good as other boards, we use throughtput divided by number of processing engines (Throughtput /# of PEs) as our comparison indicator. Our ideal CNN design’s throughtput /# of PEs can achieve to 0.26 is higher than other design.
Our CNN hardware accelerator can achieve a very high Throughtput /# of PEs because we systematically accelerate the implementation hierarchy.
1. We analyze and propose a method for balancing transmission time and calculation time:
In order to reduce the impact of data transmission on performance and make full use of loading and writing memory access, we use double buffer design and propose a double buffer balance theorem to find the balance with Bandwidth (B_bandwidth) and Unrolling number (Pm) of the CNN hardware. This process allows a large amount of data to be transferred on the data BUS without interruption. Using this balance theorem we can hide the access time required for data from external memory and ensure the consistency of data operations.
2. Systematic analysis of CNN six-layer loop, systematic analysis to reduce system delay:
There are six-layers of loops in the CNN operation. We propose four types of CNN loop parallelism in six-layer loop and carry out the parallel exploration in four types of CNN loop parallelism. We use limited resources (resource-constrained) to maximize the loop unrolling. We explore the different ways of loop unrolling by the impact on the number of data communication and finally decide to use OST(Output stationary) scheme and discuss the hierarchy of data handling situation from the inner layer to outside layer to reduce the number of data loading and reduce the overall execution time to improve efficiency.
3. We make internal memory inside CNN hardware:
We design the memory RAM that CNN has to store tile in the CNN hardware, so that the register of the RAM to the CNN accelerator does not need to communicate through the external BUS. The external BUS can be used as the exclusive channel of SDRAM and RAM. We design DMA controller on CNN hardware. It is easy to be compatible with other SOPC architectures.
4. Cut fine-grained pipelined multipliers:
Perform fine-grained cutting inside the arithmetic unit, improve pipeline processing on the half-passer. It can increase parallel granularity and reduce critical path.
In terms of area, our design uses a half multiplier instead of a full multiplier in the component hierarchy to reduce the area, although the error does not affect the CNN image judgment. This accelerator is mainly accelerate on implementation level. We do not use other level to accelerate CNN operation such as algorithm level, data compression level. In the future, we can improve the acceleration by different level of acceleration. On the other hand, the hardware parameters of this design must be given by the software and the parameters cannot be updated. There could be more error in the different data sets. Therefore, the self-training CNN hardware accelerator can be designed in the future.
[1] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, "Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks," IEEEE Asia and DAC , pp. 570~580, 2016.
[2] B. Y-Lan, P. Jean, and L. Yann, "A Theoretical Analysis of Feature Pooling in Visual Recognition.," In Int. Conf. on Machine Learning, 2010.
[3] K. Alex, S. Ilya, and E. H. Geoffrey, "ImageNet classification with deep convolutional neural networks," In NIPS, 1097-1105, 2012.
[4] K. Chellapilla, S. Puri, and P. Simard, "High Performance Convolutional Neural Networks for Document Processing," In Tenth International Workshop on Frontiers in Handwriting Recognition, 2006.
[5] M. Cho and D. Brand, "MEC:Memory-efficient Convolution for Deep Neural Network," in: International Conference on Machine Learning (ICML), pp. 815–824, 2017.
[6] A. Anderson∗, A. Vasudevan, C. Keane, and D. Gregg, "Low-memory GEMM-based convolution algorithms for deep neural networks," arXiv preprint arXiv:1709.03395, 2017.
[7] S. Winograd, "Arithmetic complexity of computations," SIAM, 1980.
[8] A. Lavin, "Fast algorithms for convolutional neural," CoRR, abs/1509.09308, 2015.
[9] M. Mathieu, M. Henaff, and Y. LeCun, "Fast Training of Convolutional Networks through FFTs," arXiv preprint arXiv:1312.5851, 2014.
[10] J. H. Ko, B. Mudassar, T. Na, and S. Mukhopadhyay, "Design of an Energy-Efficient Accelerator for Training of Convolutional Neural Networks using Frequency-Domain Computation," in: ACM/IEEE Design Automation Conference (DAC), pp. 1-6, 2017.
[11] M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. d. Freitas, "Predicting parameters in deep learning," Conference on Neural Information Processing Systems (NIPS), 2013.
[12] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran, "Low-rank matrix factorization for deep neural network training with high-dimensional output targets," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6655-6659, 2013.
[13] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong., "Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks," In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays ,pages 161–170, pp. 161-170, 2015.
[14] S. Anwar, K. Hwang, and W. Sung., "Fixed point optimization of deep convolutional neural networks for object recognition," In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP'15, 2015.
[15] M. Courbariaux, J.-P. David, and Y. Bengio, "Training deep neural networks with low precision multiplications," arXiv preprint arXiv:1412.7024., 2015.
[16] S. Han, H. Mao, and W. J. Dally, "Deep compression Compressing deep neural networks with pruning, trained quantization and human coding," International Conference on Learning Representations (ICLR), 2016.
[17] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Penksy, "Sparse Convolutional Neural Networks," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition - CVPR'15, pages 806–814, 2015.
[18] S. Han, J. Pool, J. Tran, and W. J. Dally., "Learning both Weights and Connections for Efficient Neural Network," In Advances in Neural Information Processing Systems - NIPS'15, pages 1135–1143, 2015.
[19] T.-J. Yang, Y.-H. Chen, and V. Sze, "Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning," n Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition - CVPR '17, 2017.
[20] M. Jaderberg, A. Vedaldi, and A. Zisserman, "Speeding up convolutional neural networks with low rank expansions," British Machine Vision Conference (BMVC), 2014.
[21] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran, "Low-rank matrix factorization for deep neural network training with high-dimensional output targets," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013.
[22] T. Schoenauer, A. Jahnke, U. Roth, and H. Klar, "Digital Neurohardware:Principles and Perspectives," Neuronal Networks in Applications , Magdeburg, 1998.
[23] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, "CNP:An FPGA-based processor for Convolutional Networks," In Proceedings of the International Conference on Field Programmable Logic and Applications - FPL '09, volume 53, pages 1689–1699, 2009.
[24] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, "A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition - CVPR '14, pp. 696-701, 2014.
[25] "Cyclone IV GX FPGA DE2i-150 from Altera with software Quartus FPGA System User Manual," https://www.intel.com/content/dam/altera-www/global/en_US/portal/dsn/42/doc-us-dsnbk-42-2204202203-de2i-150usermanual.pdf
[26] J. M. Jou, S. R. Kuang, and R. D. Chen, "Design of Low-Error Fixed-Width Multipliers for DSP Applications," IEEE Transactions on Circuits & Systems Part II, vol.46, no.6, pp.836-842, 1999.
[27] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, "Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks," in FPGA, pp. 45-54, 2017.
[28] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," IEEE Journal of Solid-State Circuits, vol. 52, pp. 127-138, 2017.
[29] J. Qiu, S. Song, Y. Wang, H. Yang, J. Wang, S. Yao, et al., "Going Deeper with Embedded FPGA Platform for Convolutional Neural Network," In ACM International Symposium on FPGA, pp. 26-35, 2016.
[30] A. Rahman, S. O, J. Lee, and K. Choi, "Design space exploration of FPGA accelators for convolutional neural," in DATE'17, 2017.
校內:2024-07-31公開