| 研究生: |
黃姵瑄 Huang, Pei-Xuan |
|---|---|
| 論文名稱: |
具有修剪演算法協同設計的模式感知計算 DNN 處理器 A pattern-aware computation DNN processor with pruning algorithm co-design |
| 指導教授: |
蔡家齊
Tsai, Chia-Chi |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2024 |
| 畢業學年度: | 113 |
| 語文別: | 英文 |
| 論文頁數: | 99 |
| 中文關鍵詞: | 剪枝 、模式剪枝 、資料壓縮 、稀疏加速器 、硬體架構分析 |
| 外文關鍵詞: | Pruning, pattern pruning, data compression, sparse accelerator, hardware architecture analysis |
| 相關次數: | 點閱:49 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著人工智慧需求的持續增長,像是圖片辨識、影像偵測和自動駕駛等實際應用和產品在各領域的普及,無論是在雲端還是邊緣計算(Edge)環境中,對人工智慧的需求顯著上升。
然而,當前性能優越的模型架構通常需要大量的計算資源和記憶體,這使得它們在資源有限的環境中難以有效部署。因此,減少模型的計算量和記憶體需求,以及提升硬體計算效能,變得愈加重要。這樣不僅有助於擴大人工智慧的應用範圍,還能促進其在不同場景中的實際應用。
因此,我們提出了一種系統性的解決方案,通過軟體層面的模式剪枝(pattern-based pruning)技術,深入分析模型中權重的重要性。針對某些關鍵權重位置,我們定義了一個模式集(pattern set),並在剪枝過程中確保所有卷積核的非零值位置符合這一模式,以限制非零數值的分佈。這樣的做法不僅能有效減少模型的參數量以及計算量,還能保持模型的精度,同時提升硬體的運行效率。
為了降低傳輸需求,我們針對模型的權重和輸入進行了不同方式的壓縮。在硬體層面上,我們提出了一種新的基於模式的加速器(pattern-based accelerator)架構,該架構能在不進行解壓縮的情況下進行計算,並有效利用模式剪枝所帶來的特性。此外,架構內部的兩個模組——Scheduler和Pattern_Analyzer——在協同工作時,可以有效避免稀疏加速器中常見的部分和回寫競爭所引發的處理單元閒置問題,從而提升計算單元的使用率。此外將加速器與匯流排和DRAM整合並進行週期精確模擬(cycle-accurate simulation)。
整體來說,我們提出的軟硬整合方法可以有效加速模型的推論,未來的研究可以進一步對於硬體架構內部的記憶體管理做修改,以更大幅度優化其效能。
As the demand for artificial intelligence continues to grow, the adoption of practical applications and products such as image recognition, object detection, and autonomous driving has significantly increased across various fields, both in cloud and edge computing environments. However, high-accuracy model architectures typically require substantial computational resources and memory, making effective deployment challenging in resource-constrained environments. Consequently, reducing computational load and memory requirements while enhancing hardware efficiency has become increasingly important. This not only broadens the scope of AI applications but also promotes their practical implementation across a wide range of scenarios.
To address these challenges, we propose a systematic solution that employs pattern-based pruning techniques at the software level to analyze the significance of weights within the model. We define a pattern set for critical weight positions, ensuring that during the pruning process, the non-zero value positions of all convolutional kernels adhere to this pattern, thereby constraining the distribution of non-zero values. This approach effectively reduces the model's parameters and computational demands while maintaining accuracy and enhancing hardware performance.
To further alleviate transmission demands, we apply various compression techniques to the model's weights and inputs. At the hardware level, we propose a novel pattern-based accelerator architecture that enables computations without requiring decompression and effectively leveraging the advantages of pattern pruning. Moreover, the two internal modules—the Scheduler and Pattern Analyzer—collaborate to address the common challenge of processing unit idling caused by partial sum write-back contention in sparse accelerators, significantly enhancing the utilization of computational units. Additionally, we integrate the accelerator with a bus and DRAM to conduct cycle-accurate simulations.
Overall, our integrated software-hardware approach effectively accelerates model inference. Future research may focus on refining memory management within the hardware architecture to achieve even greater performance optimization.
[1] H. Yakura, S. Shinozaki, R. Nishimura, Y. Oyama, and J. Sakuma, "Malware Analysis of Imaged Binary Samples by Convolutional Neural Network with Attention Mechanism," presented at the Proceedings of the Eighth ACM Conference on Data and Application Security and Privacy, 2018.
[2] W. Luo, Y. Li, R. Urtasun, and R. Zemel, "Understanding the effective receptive field in deep convolutional neural networks," Advances in neural information processing systems, vol. 29, 2016.
[3] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, "A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects," IEEE Trans Neural Netw Learn Syst, vol. 33, no. 12, pp. 6999-7019, Dec 2022, doi: 10.1109/TNNLS.2021.3084827.
[4] J. Lee, S. Kang, J. Lee, D. Shin, D. Han, and H.-J. Yoo, "The Hardware and Algorithm Co-Design for Energy-Efficient DNN Processor on Edge/Mobile Devices," IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 67, no. 10, pp. 3458-3470, 2020, doi: 10.1109/tcsi.2020.3021397.
[5] F. Es-Sabery, A. Hair, J. Qadir, B. Sainz-De-Abajo, B. Garcia-Zapirain, and I. Torre-Diez, "Sentence-Level Classification Using Parallel Fuzzy Deep Learning Classifier," IEEE Access, vol. 9, pp. 17943-17985, 2021, doi: 10.1109/access.2021.3053917.
[6] S.-K. Yeom et al., "Pruning by explaining: A novel criterion for deep neural network pruning," Pattern Recognition, vol. 115, 2021, doi: 10.1016/j.patcog.2021.107899.
[7] R. C. Gerum, A. Erpenbeck, P. Krauss, and A. Schilling, "Sparsity through evolutionary pruning prevents neuronal networks from overfitting," Neural Netw, vol. 128, pp. 305-312, Aug 2020, doi: 10.1016/j.neunet.2020.05.007.
[8] V. Akhlaghi, A. Yazdanbakhsh, K. Samadi, R. K. Gupta, and H. Esmaeilzadeh, "SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks," presented at the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018.
[9] M. Song, J. Zhao, Y. Hu, J. Zhang, and T. Li, "Prediction Based Execution on Deep Neural Networks," presented at the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018.
[10] H. Mao, Han, S., Pool, J., Li, W., Liu, X., Wang, Y., & Dally, W. J., "Exploring the Granularity of Sparsity in Convolutional Neural Networks," presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017.
[11] C. Zhang et al., "Clicktrain: Efficient and accurate end-to-end deep learning training via fine-grained architecture-preserving pruning," presented at the Proceedings of the ACM International Conference on Supercomputing, 2021.
[12] J. Wang et al., "PACA: A Pattern Pruning Algorithm and Channel-Fused High PE Utilization Accelerator for CNNs," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 11, pp. 5043-5056, 2022, doi: 10.1109/tcad.2022.3140730.
[13] W. Niu et al., "PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning," presented at the Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020.
[14] D. L. Yamins and J. J. DiCarlo, "Using goal-driven deep learning models to understand sensory cortex," Nat Neurosci, vol. 19, no. 3, pp. 356-65, Mar 2016, doi: 10.1038/nn.4244.
[15] S. Dave, R. Baghdadi, T. Nowatzki, S. Avancha, A. Shrivastava, and B. Li, "Hardware Acceleration of Sparse and Irregular Tensor Computations of ML Models: A Survey and Insights," Proceedings of the IEEE, vol. 109, no. 10, pp. 1706-1752, 2021, doi: 10.1109/jproc.2021.3098483.
[16] Z. T. J. S. X. M. S.-H. T. H. Chen;, "PCNN: Pattern-based Fine-Grained Regular Pruning Towards Optimizing CNN Accelerators," presented at the 2020 57th ACM/IEEE Design Automation Conference (DAC), 2020.
[17] M. A. Qureshi and A. Munir, "Sparse-PE: A Performance-Efficient Processing Engine Core for Sparse Convolutional Neural Networks," IEEE Access, vol. 9, pp. 151458-151475, 2021, doi: 10.1109/access.2021.3126708.
[18] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, "Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing," presented at the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016.
[19] S. Z. Z. D. L. Z. H. L. S. L. L. Li, "Cambricon-X: An accelerator for sparse neural networks," presented at the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016.
[20] X. Zhou et al., "Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach," presented at the 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2018.
[21] A. Parashar et al., "SCNN: An accelerator for compressed-sparse convolutional neural networks," presented at the Proceedings of the 44th Annual International Symposium on Computer Architecture, 2017.
[22] Z. Yuan et al., "STICKER: An Energy-Efficient Multi-Sparsity Compatible Accelerator for Convolutional Neural Networks in 65-nm CMOS," IEEE Journal of Solid-State Circuits, vol. 55, no. 2, pp. 465-477, 2020, doi: 10.1109/jssc.2019.2946771.
[23] J.-F. Zhang, C.-E. Lee, C. Liu, Y. S. Shao, S. W. Keckler, and Z. Zhang, "SNAP: An Efficient Sparse Neural Acceleration Processor for Unstructured Sparse Deep Neural Network Inference," IEEE Journal of Solid-State Circuits, vol. 56, no. 2, pp. 636-647, 2021, doi: 10.1109/jssc.2020.3043870.
[24] A. Gondimalla, N. Chesnut, M. Thottethodi, and T. N. Vijaykumar, "SparTen: A sparse tensor accelerator for convolutional neural networks," presented at the Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019.
[25] S. Gudaparthi, S. Singh, S. Narayanan, R. Balasubramonian, and V. Sathe, "CANDLES: Channel-Aware Novel Dataflow-Microarchitecture Co-Design for Low Energy Sparse Neural Network Acceleration," presented at the 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2022.
[26] S. Kang, G. Park, S. Kim, S. Kim, D. Han, and H.-J. Yoo, "An Overview of Sparsity Exploitation in CNNs for On-Device Intelligence With Software-Hardware Cross-Layer Optimizations," IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 11, no. 4, pp. 634-648, 2021, doi: 10.1109/jetcas.2021.3120417.
[27] C. Zhu, K. Huang, S. Yang, Z. Zhu, H. Zhang, and H. Shen, "An Efficient Hardware Accelerator for Structured Sparse Convolutional Neural Networks on FPGAs," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 9, pp. 1953-1965, 2020, doi: 10.1109/tvlsi.2020.3002779.
[28] G. Zhang, Zhou, S., Duan, Z., & Zhou, W, "FPGA accelerator for CNN an exploration of the kernel structured sparsity and hybrid arithmetic computation," Journal of Electronic Imaging, 2021, doi: 10.1117/1.JEI.30.3.
[29] S. Yu et al., "High Area/Energy Efficiency RRAM CNN Accelerator with Pattern-Pruning-Based Weight Mapping Scheme," presented at the 2021 IEEE 10th Non-Volatile Memory Systems and Applications Symposium (NVMSA), 2021.
[30] L. Steiner, Jung, M., & Wehn, N., "Exploration of DDR5 with the Open Source Simulator DRAMSys," MBMV 2021; 24th Workshop. , 2021.
[31] https://github.com/chenyaofo/pytorch-cifar-models (accessed.
校內:2029-11-04公開