簡易檢索 / 詳目顯示

研究生: 胡連鈞
Hu, Lien-Chun
論文名稱: 結合管線化半乘器與迴圈並行之捲積神經網路硬體加速器設計
Pipelining and Unrolling Design of a CNN Accelerator with Half Multipliers
指導教授: 周哲民
Jou, Jer-Min
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2019
畢業學年度: 107
語文別: 中文
論文頁數: 65
中文關鍵詞: 捲積神經網路半乘器迴圈並行硬體加速器機器學習
外文關鍵詞: CNN, half multiplier, loop unrolling, hardware accelerator, machine learning
相關次數: 點閱:106下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文實現了捲積神經網絡(CNN)硬體加速器設計。捲積神經網絡廣泛用於現代人工智能系統,但也存在通量和能量消耗的挑戰。藉由探討捲積運算中的平行性、設計空間探索、記憶體重複使用狀況等,做為我們的硬體加速器設計方案,並且提出雙緩衝器下之平衡定理,達到平衡平衡傳輸時間與運算時間之最佳化設計。
    我們以VGG-16作為實驗架構,並用ImageNet LSVRC-2014做為我們的資料集。我們將預先訓練好的權值和偏移量的資料做分析處理,將參數值取出後,在固定16位元數下,載入給改良加速硬體加速器做捲積運算。我們利用迴圈並行、乘法器管線化和利用半乘法器取代全乘法器,雙緩衝器設計等方法加速運行以及節省硬體面積。
    我們實做在DE2i-150 FPGA板上,並且利用SOPC架構實現我們的硬體加速器架構研究結果顯示了19.08 GOPS的通量(throughput)和每張圖片會花費1622.16毫秒的延遲(latency/Image),由於FPGA板上的限制導致延遲時間高。

    Machine Learning is a category of artificial intelligence. The process allows machines to learn from "Training" to "Prediction". Machine learning applications are very broad, such as search keyword, weather forecasting, face recognition, fingerprint recognition, license plate recognition, certificate analysis, voice processing, etc. Machine learning uses the algorithm to classify the collected data and model training. Through the operation of the mathematical model, the machine can self-correct the update parameters from the error, and after multiple learning, find the answer that is closer to the correct inference. In the future, when new information is available, the AI model can predict the correct classification through the trained model. Machine learning theory was proposed in the 1980s. Due to the limitations of hardware performance at that time, computers could not perform a large number of calculations, resulting in low accuracy. In recent years, due to many algorithms and hardware performance improvement, the prediction of machine learning has been greatly improved. It is now widely used in various scenarios.
    In this thesis, we implement a Convolution Neural Network (CNN) hardware accelerator. Convolutional neural network are widely used in modern artificial intelligence system, but also known with the challenges of throughput and energy consumption. We explore parallelism in convolutional operations, design space exploration, memory reuse, etc. We analyze CNN architecture and optimize hardware. Simultaneously, we propose the double buffer balance theorem to balance transfer time and operation time.
    We use VGG-16 as CNN architecture and use ImageNet LSVRC-2014 as our data set. We analyze weights and biases that have trained. After we fetch the parameter, and write them into the improved accelerated hardware accelerator for CNN with a fixed 16-bits number. We use loop parallelism, pipelined multiplier, half-multipliers replace the full multipliers, double buffer design and other methods to speed up the calculation. Meanwhile, we also reduce hardware area.
    We implement on the DE2i-150 FPGA board and implemente our hardware accelerator architecture with the SOPC architecture. The experimental results for VGG-16 CNN model achieved 19.08 GOPS of throughput and 1622.16 ms of latency. The delay time is higher than other design due to limitations on the FPGA board. Because our FPGA is not as good as other boards, we use throughtput divided by number of processing engines (Throughtput /# of PEs) as our comparison indicator. Our ideal CNN design’s throughtput /# of PEs can achieve to 0.26 is higher than other design.
    Our CNN hardware accelerator can achieve a very high Throughtput /# of PEs because we systematically accelerate the implementation hierarchy.
    1. We analyze and propose a method for balancing transmission time and calculation time:
    In order to reduce the impact of data transmission on performance and make full use of loading and writing memory access, we use double buffer design and propose a double buffer balance theorem to find the balance with Bandwidth (B_bandwidth) and Unrolling number (Pm) of the CNN hardware. This process allows a large amount of data to be transferred on the data BUS without interruption. Using this balance theorem we can hide the access time required for data from external memory and ensure the consistency of data operations.
    2. Systematic analysis of CNN six-layer loop, systematic analysis to reduce system delay:
    There are six-layers of loops in the CNN operation. We propose four types of CNN loop parallelism in six-layer loop and carry out the parallel exploration in four types of CNN loop parallelism. We use limited resources (resource-constrained) to maximize the loop unrolling. We explore the different ways of loop unrolling by the impact on the number of data communication and finally decide to use OST(Output stationary) scheme and discuss the hierarchy of data handling situation from the inner layer to outside layer to reduce the number of data loading and reduce the overall execution time to improve efficiency.
    3. We make internal memory inside CNN hardware:
    We design the memory RAM that CNN has to store tile in the CNN hardware, so that the register of the RAM to the CNN accelerator does not need to communicate through the external BUS. The external BUS can be used as the exclusive channel of SDRAM and RAM. We design DMA controller on CNN hardware. It is easy to be compatible with other SOPC architectures.
    4. Cut fine-grained pipelined multipliers:
    Perform fine-grained cutting inside the arithmetic unit, improve pipeline processing on the half-passer. It can increase parallel granularity and reduce critical path.
    In terms of area, our design uses a half multiplier instead of a full multiplier in the component hierarchy to reduce the area, although the error does not affect the CNN image judgment. This accelerator is mainly accelerate on implementation level. We do not use other level to accelerate CNN operation such as algorithm level, data compression level. In the future, we can improve the acceleration by different level of acceleration. On the other hand, the hardware parameters of this design must be given by the software and the parameters cannot be updated. There could be more error in the different data sets. Therefore, the self-training CNN hardware accelerator can be designed in the future.

    摘要 III SUMMARY IV 致謝 VII 圖目錄 X 表目錄 XII 第1章 緒論 1 1.1 研究背景 1 1.2 研究動機與目的 1 1.3 論文架構 2 第2章 捲積神經網路背景與文獻回顧 3 2.1 機器學習與類神經網路架構回顧 3 2.2 捲積神經網路架構理論 4 2.2.1 捲積層(Convolutional layer, CONV) 5 2.2.2 池化層(Pooling layer, PL) 5 2.2.3 激活函數(Activation function) 6 2.2.4 全連接層(Fully conneted layer, FCN) 6 2.2.5 捲積神經網路的挑戰 6 2.3 FPGA-based加速器文獻回顧 7 2.3.1 演算法階層(Algorithm Level) 7 2.3.2 結構階層(Structure Level) 8 2.3.3 實現階層(Implementation Level) 10 第3章 硬體加速器設計理論 13 3.1 CNN平行性探索 13 3.2 CNN雙緩衝器下設計之平衡 15 3.3 CNN迴圈最佳化方案 23 第4章 硬體設計 30 4.1 硬體架構圖 30 4.2 CNN_IP硬體加速器設計 35 4.2.1 雙緩衝器(Ping-Pong Buffer)設計 35 4.2.2 直接記憶體存取控制器(DMA Controller)設計 36 4.2.3 Local controller的Data-stationary設計 38 4.3 運算單元設計 39 4.3.1 CONV運算單元 39 4.3.2 Pooling運算單元 46 第5章 實驗環境與數據分析 47 5.1 開發平台 47 5.2 實驗環境與實驗方式 49 5.3 實驗結果 52 5.4 不同迴圈展開方式的記憶體溝通次數比較 53 5.5 半乘器誤差與面積比較 55 5.6 IC晶片Layout 58 第6章 結論與未來展望 60 第7章 參考文獻 62

    [1] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, "Design Space Exploration of FPGA-Based Deep Convolutional Neural Networks," IEEEE Asia and DAC , pp. 570~580, 2016.
    [2] B. Y-Lan, P. Jean, and L. Yann, "A Theoretical Analysis of Feature Pooling in Visual Recognition.," In Int. Conf. on Machine Learning, 2010.
    [3] K. Alex, S. Ilya, and E. H. Geoffrey, "ImageNet classification with deep convolutional neural networks," In NIPS, 1097-1105, 2012.
    [4] K. Chellapilla, S. Puri, and P. Simard, "High Performance Convolutional Neural Networks for Document Processing," In Tenth International Workshop on Frontiers in Handwriting Recognition, 2006.
    [5] M. Cho and D. Brand, "MEC:Memory-efficient Convolution for Deep Neural Network," in: International Conference on Machine Learning (ICML), pp. 815–824, 2017.
    [6] A. Anderson∗, A. Vasudevan, C. Keane, and D. Gregg, "Low-memory GEMM-based convolution algorithms for deep neural networks," arXiv preprint arXiv:1709.03395, 2017.
    [7] S. Winograd, "Arithmetic complexity of computations," SIAM, 1980.
    [8] A. Lavin, "Fast algorithms for convolutional neural," CoRR, abs/1509.09308, 2015.
    [9] M. Mathieu, M. Henaff, and Y. LeCun, "Fast Training of Convolutional Networks through FFTs," arXiv preprint arXiv:1312.5851, 2014.
    [10] J. H. Ko, B. Mudassar, T. Na, and S. Mukhopadhyay, "Design of an Energy-Efficient Accelerator for Training of Convolutional Neural Networks using Frequency-Domain Computation," in: ACM/IEEE Design Automation Conference (DAC), pp. 1-6, 2017.
    [11] M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. d. Freitas, "Predicting parameters in deep learning," Conference on Neural Information Processing Systems (NIPS), 2013.
    [12] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran, "Low-rank matrix factorization for deep neural network training with high-dimensional output targets," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6655-6659, 2013.
    [13] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong., "Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks," In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays ,pages 161–170, pp. 161-170, 2015.
    [14] S. Anwar, K. Hwang, and W. Sung., "Fixed point optimization of deep convolutional neural networks for object recognition," In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP'15, 2015.
    [15] M. Courbariaux, J.-P. David, and Y. Bengio, "Training deep neural networks with low precision multiplications," arXiv preprint arXiv:1412.7024., 2015.
    [16] S. Han, H. Mao, and W. J. Dally, "Deep compression Compressing deep neural networks with pruning, trained quantization and human coding," International Conference on Learning Representations (ICLR), 2016.
    [17] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Penksy, "Sparse Convolutional Neural Networks," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition - CVPR'15, pages 806–814, 2015.
    [18] S. Han, J. Pool, J. Tran, and W. J. Dally., "Learning both Weights and Connections for Efficient Neural Network," In Advances in Neural Information Processing Systems - NIPS'15, pages 1135–1143, 2015.
    [19] T.-J. Yang, Y.-H. Chen, and V. Sze, "Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning," n Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition - CVPR '17, 2017.
    [20] M. Jaderberg, A. Vedaldi, and A. Zisserman, "Speeding up convolutional neural networks with low rank expansions," British Machine Vision Conference (BMVC), 2014.
    [21] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran, "Low-rank matrix factorization for deep neural network training with high-dimensional output targets," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013.
    [22] T. Schoenauer, A. Jahnke, U. Roth, and H. Klar, "Digital Neurohardware:Principles and Perspectives," Neuronal Networks in Applications , Magdeburg, 1998.
    [23] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, "CNP:An FPGA-based processor for Convolutional Networks," In Proceedings of the International Conference on Field Programmable Logic and Applications - FPL '09, volume 53, pages 1689–1699, 2009.
    [24] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, "A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks," In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition - CVPR '14, pp. 696-701, 2014.
    [25] "Cyclone IV GX FPGA DE2i-150 from Altera with software Quartus FPGA System User Manual," https://www.intel.com/content/dam/altera-www/global/en_US/portal/dsn/42/doc-us-dsnbk-42-2204202203-de2i-150usermanual.pdf
    [26] J. M. Jou, S. R. Kuang, and R. D. Chen, "Design of Low-Error Fixed-Width Multipliers for DSP Applications," IEEE Transactions on Circuits & Systems Part II, vol.46, no.6, pp.836-842, 1999.
    [27] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, "Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks," in FPGA, pp. 45-54, 2017.
    [28] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," IEEE Journal of Solid-State Circuits, vol. 52, pp. 127-138, 2017.
    [29] J. Qiu, S. Song, Y. Wang, H. Yang, J. Wang, S. Yao, et al., "Going Deeper with Embedded FPGA Platform for Convolutional Neural Network," In ACM International Symposium on FPGA, pp. 26-35, 2016.
    [30] A. Rahman, S. O, J. Lee, and K. Choi, "Design space exploration of FPGA accelators for convolutional neural," in DATE'17, 2017.

    無法下載圖示 校內:2024-07-31公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE