簡易檢索 / 詳目顯示

研究生: 柯岳宏
Ke, Yue-Hung
論文名稱: 階層式稀疏度優化之高效稀疏矩陣乘法加速器
A High-Efficiency Sparse Matrix Multiplication Accelerator with Hierarchical Sparsity Optimization
指導教授: 周哲民
Jou, Jer-Min
郭致宏
Kuo, Chih-Hung
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 86
中文關鍵詞: 稀疏矩陣乘法Gustavson演算法資料壓縮加速器
外文關鍵詞: sparse matrix multiplication, Gustavson's algorithm, hierarchical sparsity, sparse formats, specialized accelerator
相關次數: 點閱:44下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 稀疏矩陣乘法(Sparse Matrix Multiplication, SPMM)是現今深度學習模型中重要的核心,特別是在Transformer模型中,扮演著至關重要的角色。由於稀疏矩陣的特性,在通用架構上執行稀疏矩陣乘法的運算效率較低,這突顯了專用硬體加速器設計的重要性。然而,先前的稀疏矩陣乘法加速器主要採用內積或外積資料流的方式,這些方法因為輸入或輸出的重用性有限,導致了計算效能的瓶頸。先前的加速器尚未充分探索 Gustavson演算法,該演算法通過列乘列資料流有效避免傳統資料流架構中重用性問題,從而潛在地提升運算效率。但Gustavson演算法涉及不規則的記憶體存取模式,這在先前的加速器架構中尚未得到充分的支援與優化。
    本文提出了階層式稀疏度優化之高效稀疏矩陣乘法加速器,利用Gustavson演算法的資料流特性來解決現有技術的挑戰,並結合了對稀疏矩陣中不同階層的稀疏度進行優化設計,有效地提升了稀疏矩陣在不同階層中的重用性和計算效率。此外,我們針對稀疏矩陣的存儲結構進行了優化,顯著降低了記憶體的需求,從而達到降低存儲成本的目標。

    Sparse Matrix Multiplication (SPMM) is a critical component in modern deep learning models, especially within Transformer architectures. The inherent properties of sparse matrices lead to inefficient execution of SPMM on general-purpose architectures, highlighting the necessity for specialized hardware accelerators. Previous accelerators for sparse matrix multiplication predominantly employed inner-product or outer-product dataflows. These methods often encountered performance bottlenecks due to limited reuse of input or output data. Notably, earlier accelerators have not thoroughly explored the Gustavson algorithm, which can potentially enhance computational efficiency through its column-by-column dataflow that effectively mitigates the reuse issues inherent in traditional dataflow architectures. However, the Gustavson algorithm involves irregular memory access patterns, which previous accelerator designs have not sufficiently supported or optimized.
    This paper proposes a high-efficiency sparse matrix multiplication accelerator with hierarchical sparsity optimization, leveraging the dataflow characteristics of the Gustavson algorithm to address existing technical challenges. The design incorporates optimizations tailored to different levels of sparsity within the sparse matrices, significantly improving data reuse and computational efficiency across various hierarchy levels. Additionally, we optimize the storage structure for sparse matrices, substantially reducing memory requirements and thereby achieving lower storage costs.

    摘要 I SUMMARY II METHODS AND ARCHITECTURE III RESULTS AND DISCUSSION V CONCLUSION VIII 致謝 IX 目錄 X 表目錄 XII 圖目錄 XIII 第一章 緒論 1 1.1 前言 1 1.2 研究動機與目的 1 1.3 論文架構 3 第二章 背景知識與相關研究 4 2.1 Transformer模型概述 4 2.2 注意力機制產生的稀疏矩陣 14 2.3 稀疏矩陣的壓縮格式 17 2.4 稀疏矩陣乘法加速器的現今研究與發展 19 第三章 稀疏矩陣乘法加速器之設計考量 21 3.1 矩陣乘法運算優化討論 21 3.2 矩陣乘法之資料流分析 22 3.3 矩陣乘法的切塊技術與迴圈優化 27 3.4 矩陣乘法資料流之管線化設計 28 第四章 階層式稀疏度優化設計 32 4.1 字階層優化之設計理念 32 4.2 字階層優化之硬體實現 34 4.3 位元階層優化之設計理念 36 4.4 位元階層優化之硬體實現 37 第五章 高效稀疏矩陣乘法加速器設計 39 5.1 加速器總體架構概述 39 5.2 控制單元之設計 41 5.3 運算單元之設計 45 5.4 晶片上緩衝器之設計 46 第六章 實驗結果與討論 49 6.1 開發環境與硬體規格 49 6.2 實驗方法 51 6.3 階層式稀疏度優化對加速器的影響 54 6.4 不同稀疏度對執行效率的影響 57 6.5 不同矩陣尺寸對執行效率的影響 59 6.6 管線化分析與硬體使用率 60 6.7 與其他加速器比較結果 63 第七章 結論與未來展望 65 7.1 結論 65 7.2 未來展望 65 參考文獻 66

    [1] F. Lopez-Garcia et al., "Classification of Honey Pollens with ImageNet Neural Networks," in 20th International Conference on Computer Analysis of Images and Patterns, CAIP 2023, September 25, 2023 - September 28, 2023, Limassol, Cyprus, 2023, vol. 14185 LNCS: Springer Science and Business Media Deutschland GmbH, in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pp. 192-200, doi: 10.1007/978-3-031-44240-7_19.
    [2] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in 29th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, June 26, 2016 - July 1, 2016, Las Vegas, NV, United states, 2016, vol. 2016-December: IEEE Computer Society, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 770-778, doi: 10.1109/CVPR.2016.90.
    [3] R. M. Schmidt, "Recurrent Neural Networks (RNNs): A gentle Introduction and Overview," arXiv, 2019.
    [4] N. Bell and M. Garland, "Implementing sparse matrix-vector multiplication on throughput-oriented processors," in Conference on High Performance Computing Networking, Storage and Analysis, SC '09, November 14, 2009 - November 20, 2009, Portland, OR, United states, 2009: Association for Computing Machinery, in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, doi: 10.1145/1654059.1654078.
    [5] A. Vaswani et al., "Attention is all you need," in 31st Annual Conference on Neural Information Processing Systems, NIPS 2017, December 4, 2017 - December 9, 2017, Long Beach, CA, United states, 2017, vol. 2017-December: Neural information processing systems foundation, in Advances in Neural Information Processing Systems, pp. 5999-6009.
    [6] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-C. Woo, "Convolutional LSTM network: A machine learning approach for precipitation nowcasting," in 29th Annual Conference on Neural Information Processing Systems, NIPS 2015, December 7, 2015 - December 12, 2015, Montreal, QC, Canada, 2015, vol. 2015-January: Neural information processing systems foundation, in Advances in Neural Information Processing Systems, pp. 802-810.
    [7] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, "Convolutional sequence to sequence learning," in 34th International Conference on Machine Learning, ICML 2017, August 6, 2017 - August 11, 2017, Sydney, NSW, Australia, 2017, vol. 70: International Machine Learning Society (IMLS), in 34th International Conference on Machine Learning, ICML 2017, pp. 1243-1252, doi: 10.5555/3305381.3305510.
    [8] S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, "Linformer: Self-Attention with Linear Complexity," arXiv, 2020.
    [9] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention," in 37th International Conference on Machine Learning, ICML 2020, July 13, 2020 - July 18, 2020, Virtual, Online, 2020, vol. PartF168147-7: International Machine Learning Society (IMLS), in 37th International Conference on Machine Learning, ICML 2020, pp. 5112-5121.
    [10] R. Child, S. Gray, A. Radford, and I. Sutskever, "Generating long sequences with sparse transformers," arXiv, 2019.
    [11] K. Choromanski et al., "RETHINKING ATTENTION WITH PERFORMERS," in 9th International Conference on Learning Representations, ICLR 2021, May 3, 2021 - May 7, 2021, Virtual, Online, 2021: International Conference on Learning Representations, ICLR, in ICLR 2021 - 9th International Conference on Learning Representations, p. Amazon; DeepMind; et al.; Facebook AI; Microsoft; OpenAI.
    [12] A. Dosovitskiy et al., "AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE," in 9th International Conference on Learning Representations, ICLR 2021, May 3, 2021 - May 7, 2021, Virtual, Online, 2021: International Conference on Learning Representations, ICLR, in ICLR 2021 - 9th International Conference on Learning Representations, p. Amazon; DeepMind; et al.; Facebook AI; Microsoft; OpenAI.
    [13] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," in 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2019, June 2, 2019 - June 7, 2019, Minneapolis, MN, United states, 2019, vol. 1: Association for Computational Linguistics (ACL), in NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, pp. 4171-4186.
    [14] F. Tu et al., "MulTCIM: Digital Computing-in-Memory-Based Multimodal Transformer Accelerator With Attention-Token-Bit Hybrid Sparsity," IEEE Journal of Solid-State Circuits, vol. 59, no. 1, pp. 90-101, 2024, doi: 10.1109/JSSC.2023.3305663.
    [15] F. Tu et al., "TranCIM: Full-Digital Bitline-Transpose CIM-based Sparse Transformer Accelerator With Pipeline/Parallel Reconfigurable Modes," IEEE Journal of Solid-State Circuits, vol. 58, no. 6, pp. 1798-1809, 2023, doi: 10.1109/JSSC.2022.3213542.
    [16] K. Hegde, J. Yu, R. Agrawal, M. Yan, M. Pellauer, and C. W. Fletcher, "UCNN: Exploiting computational reuse in deep neural networks via weight repetition," in 45th ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2018, June 2, 2018 - June 6, 2018, Los Angeles, CA, United states, 2018: Institute of Electrical and Electronics Engineers Inc., in Proceedings - International Symposium on Computer Architecture, pp. 674-687, doi: 10.1109/ISCA.2018.00062.
    [17] E. Qin et al., "SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training," in 26th IEEE International Symposium on High Performance Computer Architecture, HPCA 2020, February 22, 2020 - February 26, 2020, San Diego, CA, United states, 2020: Institute of Electrical and Electronics Engineers Inc., in Proceedings - 2020 IEEE International Symposium on High Performance Computer Architecture, HPCA 2020, pp. 58-70, doi: 10.1109/HPCA47549.2020.00015.
    [18] S. Pal et al., "OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator," in 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018, February 24, 2018 - February 28, 2018, Vienna, Austria, 2018, vol. 2018-February: IEEE Computer Society, in Proceedings - International Symposium on High-Performance Computer Architecture, pp. 724-736, doi: 10.1109/HPCA.2018.00067.
    [19] Z. Zhang, H. Wang, S. Han, and W. J. Dally, "SpArch: Efficient architecture for sparse matrix multiplication," in 26th IEEE International Symposium on High Performance Computer Architecture, HPCA 2020, February 22, 2020 - February 26, 2020, San Diego, CA, United states, 2020: Institute of Electrical and Electronics Engineers Inc., in Proceedings - 2020 IEEE International Symposium on High Performance Computer Architecture, HPCA 2020, pp. 261-274, doi: 10.1109/HPCA47549.2020.00030.
    [20] A. Parashar et al., "SCNN: An accelerator for compressed-sparse convolutional neural networks," in 44th Annual International Symposium on Computer Architecture - ISCA 2017, June 24, 2017 - June 28, 2017, Toronto, ON, Canada, 2017, vol. Part F128643: Institute of Electrical and Electronics Engineers Inc., in Proceedings - International Symposium on Computer Architecture, pp. 27-40, doi: 10.1145/3079856.3080254.
    [21] K. Hegde et al., "ExTensor: An accelerator for sparse tensor algebra," in 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, October 12, 2019 - October 16, 2019, Columbus, OH, United states, 2019: IEEE Computer Society, in Proceedings of the Annual International Symposium on Microarchitecture, MICRO, pp. 319-333, doi: 10.1145/3352460.3358275.
    [22] N. Srivastava, H. Jin, J. Liu, D. Albonesi, and Z. Zhang, "Matraptor: A sparse-sparse matrix multiplication accelerator based on row-wise product," in 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020, October 17, 2020 - October 21, 2020, Virtual, Athens, Greece, 2020, vol. 2020-October: IEEE Computer Society, in Proceedings of the Annual International Symposium on Microarchitecture, MICRO, pp. 766-780, doi: 10.1109/MICRO50266.2020.00068.
    [23] G. Zhang, N. Attaluri, J. S. Emer, and D. Sanchez, "Gamma: Leveraging Gustavson's Algorithm to Accelerate Sparse Matrix Multiplication," in 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2021, April 19, 2021 - April 23, 2021, Virtual, Online, United states, 2021: Association for Computing Machinery, in International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS, pp. 687-701, doi: 10.1145/3445814.3446702.
    [24] F. G. Gustavson, "TWO FAST ALGORITHMS FOR SPARSE MATRICES: MULTIPLICATION AND PERMUTED TRANSPOSITION," ACM Transactions on Mathematical Software, vol. 4, no. 3, pp. 250-269, 1978
    [25] S. E. Kurt, A. Sukumaran-Rajam, F. Rastello, and P. Sadayyapan, "Efficient tiled sparse matrix multiplication through matrix signatures," in 2020 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, November 9, 2020 - November 19, 2020, Virtual, Atlanta, GA, United states, 2020, vol. 2020-November: IEEE Computer Society, in International Conference for High Performance Computing, Networking, Storage and Analysis, SC, p. ACM's Special Interest Group on High Performance Computing (SIGHPC); Association for Computing Machinery; IEEE Computer Society; IEEE's Technical Committee on High Performance Computing (TCHPC), doi: 10.1109/SC41405.2020.00091.
    [26] T. A. Davis and Y. Hu, "The University of Florida Sparse Matrix Collection," ACM Transactions on Mathematical Software, vol. 38, no. 1, 2011, doi: 10.1145/2049662.2049663.
    [27] A. Radford, J. W. Shinn, S. C. Narasimhan, and I. Sutskever, "Language Models are Few-Shot Learners," in Proc. of the 34th Conf. on Neural Information Processing Systems (NeurIPS), 2020.
    [28] I. Beltagy, M. Cohan, and J. L. Clark, "Longformer: The Long Document Transformer," in Proc. of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020, pp. 1-15.
    [29] A. Dosovitskiy, L. Beyer, X. Zhai, A. R. M. Kolesnikov, D. Wei, F. Li, and H. T. T. K. H. L. R., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," in Proc. of the 9th International Conference on Learning Representations (ICLR), 2021.
    [30] Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. Intel Math Kernel Library. In High-Performance Computing on the Intel Xeon Phi. Springer, 2014.

    下載圖示 校內:2025-08-22公開
    校外:2025-08-22公開
    QR CODE