| 研究生: |
羅威龍 Lo, Wei-Lung |
|---|---|
| 論文名稱: |
Novella-NPU 之編譯器與硬體共同優化及支援 MSFP 之運算核心設計 Compiler-Hardware Co-optimization and MSFP-Supported Compute Core Design for Novella-NPU |
| 指導教授: |
陳中和
Chen, Chung-Ho |
| 學位類別: |
碩士 Master |
| 系所名稱: |
智慧半導體及永續製造學院 - 晶片設計學位學程 Program on Integrated Circuit Design |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 中文 |
| 論文頁數: | 166 |
| 中文關鍵詞: | Layer Fusion 、 NPU前端架構設計 、MSFP格式運算核心 |
| 外文關鍵詞: | Layer Fusion, NPU Front-end Design, MSFP supported Compute Core |
| 相關次數: | 點閱:92 下載:6 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文以Novella-NPU 為平台,提出一套從模型編譯器到硬體微架構的軟硬體共同優化流程。首先,在TVM編譯器後端加入靜態DRAMfeaturemap分配與編譯期垃圾回收,壓低推論期間的外部記憶體峰值需求;接著透過自動推導之Feature Map Tiling,在受限 SRAM 下仍能盡可能最大化計算tile並提升資料重用。Layer Fusion 則依硬體資料路徑限制將卷積、Planar與Element-wise 運算串聯,平均減少46.7% 外部存取,並使MobileNet V2 推論 FPS 由 48.79 提升至 101.3,效能倍增、DeiT-tiny 推論 FPS 提升 17.8%。為配合多運算單元,中央控制器採四級pipeline與μOPQueue+dependency table 設計,支援跨功能單元Out-of-OrderIssue;在Queue大小8/8/16/16 設定下可同時獲得最高平行度與最短執行時間,對MobileNetV2來說相比In-order Issue 可加速約 20%,對 DeiT-tiny 則可提升 65%。Roofline model 分析顯示,融合後相比融合前能向右上顯著移動,成功將工作負載自memory-bound推向compute-bound。
為了降低部署模型之門檻,本研究進一步提出支援MicroSoftFloatingPoint(MSFP)的CTU核心,在MACArray前加入Alignment Unit 與以及 CTU 中加入指數/尾數double buffer,以 BF16 輸入並與 MSFP12/16 權重運算,配合編譯器的權重重排與反量化流程,使用者可直接部署浮點模型而免除繁複整數量化。模擬結果顯示,在MSFP16 下 MobileNet V2 仍維持 Top-1 精度 71.5%,與 FP32 幾乎等同,而MSFP12因模型較小而有較大的精度下降。硬體行為已於InstructionSetSimulator/Algorithm Simulator 中完成 bit-match 驗證,證明資料路徑正確性與準確度損失,為未來多格式混合運算奠定基礎。
This paper presents a holistic hardware-software co-optimization process for Novella-NPU, spanning from model compilation to microarchitecture refinement. First, the TVM compiler backend incorporates static DRAM feature map allocation and compile-time garbage collection, significantly reducing peak external memory demand during inference. Then, automatic feature map tiling is employed to maximize computational tile size and data reuse, even with constrained SRAM resources. Layer Fusion strategically links convolution, planar, and element-wise operations according to hardware data path constraints, reducing external memory access by an average of 46.7%. This optimization doubles the inference FPS for MobileNet V2 (from 48.79 to 101.3 FPS) and enhances 17.8% performance for DeiT-tiny. To support multi-processing units, the central controller is designed with a four-stage pipeline and μOP Queue + dependency table, enabling Out-of-Order Issue across function units. With queue sizes set to 8/8/16/16, this configuration achieves maximum parallelism and shortest execution time, improving MobileNet V2 processing speed by approximately 20% compared to In-order Issue and DeiT-tiny by 6.5%. Roofline model analysis shows a significant shift upward and to the right, successfully transforming workloads from memory-bound to compute-bound.
To lower the barriers to model deployment, this study introduces CTU core support for Microsoft Floating Point (MSFP), integrating an Alignment Unit before the MAC Array and a double buffer for exponent/mantissa within the CTU. By processing BF16 inputs with MSFP12/16 weights, along with compiler-assisted weight reordering and dequantization, users can deploy floating-point models without requiring complex integer quantization. Simulation results indicate that under MSFP16, MobileNet V2 maintains a Top-1 accuracy of 71.5%, nearly identical to FP32, while MSFP12 experiences a greater accuracy drop due to the smaller model size. The bit-match verification conducted via Instruction Set Simulator and Algorithm Simulator confirms data path correctness and accuracy retention, laying a solid foundation for future mixed-format computations.
[1] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018.
[2] Bita Darvish Rouhani, Daniel Lo, Ritchie Zhao, Ming Liu, Jeremy Fowers, Kalin Ovtcharov, Anna Vinogradsky, Sarah Massengill, Lita Yang, Ray Bittner, et al. Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point. Advances in neural information processing systems, 33:10271–10281, 2020.
[3] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [4] Ping Hsuan Lee. Memory layout unit design of cnn/transformer unified accelerator and memory subsystem analysis, 2024.
[5] Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. arXiv preprint arXiv:2405.04532, 2024.
[6] En Yu Pong. Non-linear function approximations and their npu operator legalization in tvm compiler, 2025.
[7] Yuan Tai. Design of a kernel-agnostic compute core for convolution and gemm, 2024.
[8] Chun Han Wang. Custom compiler instruction generation and scheduling optimization for novella-npu with n:m sparse matrix operations support, 2025.
[9] Shibo Wang and Pankaj Kanwar. Bfloat16: The secret to high performance on cloud tpus. Google Cloud Blog, 4(1), 2019