簡易檢索 / 詳目顯示

研究生: 王駿瀚
Wang, Chun-Han
論文名稱: Novella-NPU 之自訂編譯器指令生成優化及 N:M稀疏矩陣運算支援
Custom Compiler Instruction Generation and Scheduling Optimization for Novella-NPU with N:M Sparse Matrix Operations Support
指導教授: 陳中和
Chen, Chung-Ho
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2025
畢業學年度: 113
語文別: 中文
論文頁數: 122
中文關鍵詞: NPU指令生成指令層級平行指令排程準結構化稀疏矩陣運算
外文關鍵詞: NPU instruction generation, Instruction-level parallelism, Instruction scheduling, Semi-structured sparse matrix computation
相關次數: 點閱:4下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著人工智慧技術的快速發展,邊緣運算裝置在模型推論中的應用日益重要。本研究聚焦於專為深度學習推論設計的 NPU (神經處理單元),基於本實驗室的Novella-NPU 硬體架構,結合編譯器優化與指令排程技術,以提升運算效率與資源利用率。本文透過編譯器實現張量運算的指令生成,支援從模型到 Novella-NPU 的端到端執行,並基於雙緩衝(Double Buffer)技術優化資料讀寫與計算平衡,顯著提升運算性能。此外,針對 N:M 準結構化稀疏矩陣運算,設計新的計算模式與指令排程,進一步提高執行效率。實驗結果顯示,透過指令相依性分析與雙緩衝優化,MobileNetv2 的 FPS 從 84.99 提升至 105.05,DeiT-tiny 模型的 FPS 從 6.82 提升至7.30。N:M 稀疏矩陣運算支援則在 DeiT 系列模型中實現最高 1.8 倍的加速比,並降低記憶體使用量。本研究不僅驗證了 Novella-NPU 與編譯器設計的有效性,也為邊緣運算中的高效深度學習推論提供了實用解決方案,未來可進一步探索硬體平行度與模型優化的結合。

    With the rapid advancement of artificial intelligence, edge computing devices play an increasingly critical role in model inference. This study focuses on the design of a Neural Processing Unit (NPU) tailored for deep learning inference, based on the Novella-NPU hardware architecture. By integrating compiler optimizations and instruction scheduling techniques, we enhance computational efficiency and resource utilization. The proposed compiler enables end-to-end execution from model to Novella-NPU through optimized tensor operation instruction generation. Leveraging double buffering, we balance data access and computation, significantly improving performance. Additionally, we introduce a novel computation mode and instruction scheduling for N:M semi-structured sparse matrix operations, further boosting efficiency. Experimental results demonstrate that, through instruction dependency analysis and double buffering, the FPS of MobileNetv2 increases from 84.99 to 105.05, and DeiT-tiny improves from 6.82 to 7.30. The N:M sparse matrix support achieves up to a 1.8× speedup on DeiT models while reducing memory usage. This work validates the effectiveness of the Novella-NPU and its compiler, offering a practical solution for efficient deep learning inference on edge devices. Future work will explore further integration of hardware parallelism and model optimization.

    摘要 I 誌謝 XX 目錄 XXI 表目錄 XXV 圖目錄 XXVI 第一章 緒論 1 1.1 研究動機 1 1.2 論文貢獻 2 1.3 論文架構 2 第二章 背景知識 3 2.1 CNN Network 3 2.1.1 Convolution 2D 3 2.1.2 Depthwise Convolution 4 2.1.3 Pooling 5 2.1.4 Residual Network 6 2.2 Vision Transformer 6 2.2.1 Encoder 7 2.2.2 Patch Embedding 9 2.2.3 Cls Token 9 2.3 Novella Anderson NPU 10 2.3.1 Convolution Transformer Unit 11 2.3.2 Input SRAM 和 Output SRAM 13 2.3.3 MLU 記憶體操作與位址計算 14 2.3.4 Programming Model 16 2.3.5 Central Controller 17 2.4 TVM 20 2.4.1 TVM Compilation Flow 20 2.4.2 Relay IR 21 2.5 Novella Compilation Flow 22 2.5.1 Algorithm Simulator 24 2.5.2 Fusion 26 2.5.3 Tiling 27 2.5.4 FM Alloc and Const Gen 29 2.6 稀疏矩陣格式 29 2.6.1 模型剪枝 29 2.6.2 unstructural pruning 30 2.6.3 structural pruning 31 2.6.4 semi-structural pruning 32 第三章 Novella-NPU 指令生成 35 3.1 Macro OP Lowering 35 3.1.1 Macro OP 介紹 35 3.1.2 Lowering 37 3.1.3 SRAM Allocation 40 3.1.4 Emitter 42 3.2 Transpose Codegen 43 3.2.1 Store Macro OP with Transpose 43 3.2.2 Multi-byte data access 45 3.3 CTU Macro OP Optimization 47 3.3.1 Sub-Tile Double Buffer 48 3.3.2 Tile Double Buffer 49 3.3.3 16-8 Convolution 50 3.4 Dependency Check 51 3.4.1 Dependency Tag 51 3.5 Double Buffer Optimization .55 第四章 支援 Novella-NPU N:M 準結構化稀疏矩陣計算之設計與方法 59 4.1 N:M Sparse Matric Multiplication 59 4.1.1 CTU with N:M sparse support 59 4.1.2 Double Throughput with Input SRAM 61 4.2 Sparse Weight Compaction 63 4.3 Micro OP of Sparse Mode Computation on CTU 65 第五章 實驗環境與效能分析 66 5.1 實驗環境 66 5.1.1 ESL 模擬參數 66 5.1.2 模型參數 67 5.2 指令生成優化分析 70 5.2.1 指令平行度分析 72 5.2.2 Double Buffer 效能分析 75 5.2.3 指令數量分析 78 5.3 N:M 準結構化稀疏運算效能分析 80 5.3.1 權重壓縮分析 81 5.3.2 N:M 準結構化稀疏運算對執行時間之影響 83 第六章 結論與未來展望 87 6.1 結論 87 6.2 未來展望 88 參考文獻 89

    [1] Y. Tai. Design of a kernel-agnostic compute core for convolution and gemm. Master’s thesis, NCKU, 2024.
    [2] P.-H. Lee. Memory layout unit design of cnn/transformer unified accelerator and memory subsystem analysis. Master’s thesis, NCKU, 2024.
    [3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
    [4] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoeit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), pages 5998–6008, 2017.
    [5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
    [6] W.-L. Lo. Compiler-hardware co-optimization on novella-npu, and the design of msfp-supported compute core. Master’s thesis, NCKU, 2025.
    [7] T.-J. Wu. A one-dimensional convolution accelerator supporting data reuse and multiple dimensional filters. Master’s thesis, NCKU, 2020.
    [8] C.-C. Hsiao. Quantization implementation for neural network accelerator based on cnn inference engine. Master’s thesis, NCKU, 2021.
    [9] H.-Q. Huang. Integration of machine learning compiler framework with custom instruction set architecture for caslab-dla. Master’s thesis, NCKU, 2022.
    [10] C.-Y. Xie. Optimization of cnn micro-architecture and memory sub-system for caslab-dla. Master’s thesis, NCKU, 2022.
    [11] T.-Y. Lo. Compressed sparse convolution hardware and software co-design on caslab-dla. Master’s thesis, NCKU, 2023.
    [12] H.-Y. Wang. Instruction scheduling optimization for convolution neural network on scalable caslab-dla –tvm system. Master’s thesis, NCKU, 2023.
    [13] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Tvm: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18), pages 578–594, Carlsbad, CA, USA, 2018. USENIX Association.
    [14] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip H. S. Torr. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. Advances in Neural Information Processing Systems, 34:21620–21632, 2021.
    [15] Jeff Pool, Abhishek Sawarkar, and Jay Rodge. Accelerating sparse deep neural networks.arXiv preprint, 2021.
    [16] E.-Y. Pong. Non-linear function approximations and their npu operator legalization in tvm compiler. Master’s thesis, NCKU, 2025.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE