簡易檢索 / 詳目顯示

研究生: 羅威龍
Lo, Wei-Lung
論文名稱: Novella-NPU 之編譯器與硬體共同優化及支援 MSFP 之運算核心設計
Compiler-Hardware Co-optimization and MSFP-Supported Compute Core Design for Novella-NPU
指導教授: 陳中和
Chen, Chung-Ho
學位類別: 碩士
Master
系所名稱: 智慧半導體及永續製造學院 - 晶片設計學位學程
Program on Integrated Circuit Design
論文出版年: 2025
畢業學年度: 113
語文別: 中文
論文頁數: 166
中文關鍵詞: Layer Fusion NPU前端架構設計MSFP格式運算核心
外文關鍵詞: Layer Fusion, NPU Front-end Design, MSFP supported Compute Core
相關次數: 點閱:92下載:6
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文以Novella-NPU 為平台,提出一套從模型編譯器到硬體微架構的軟硬體共同優化流程。首先,在TVM編譯器後端加入靜態DRAMfeaturemap分配與編譯期垃圾回收,壓低推論期間的外部記憶體峰值需求;接著透過自動推導之Feature Map Tiling,在受限 SRAM 下仍能盡可能最大化計算tile並提升資料重用。Layer Fusion 則依硬體資料路徑限制將卷積、Planar與Element-wise 運算串聯,平均減少46.7% 外部存取,並使MobileNet V2 推論 FPS 由 48.79 提升至 101.3,效能倍增、DeiT-tiny 推論 FPS 提升 17.8%。為配合多運算單元,中央控制器採四級pipeline與μOPQueue+dependency table 設計,支援跨功能單元Out-of-OrderIssue;在Queue大小8/8/16/16 設定下可同時獲得最高平行度與最短執行時間,對MobileNetV2來說相比In-order Issue 可加速約 20%,對 DeiT-tiny 則可提升 65%。Roofline model 分析顯示,融合後相比融合前能向右上顯著移動,成功將工作負載自memory-bound推向compute-bound。
    為了降低部署模型之門檻,本研究進一步提出支援MicroSoftFloatingPoint(MSFP)的CTU核心,在MACArray前加入Alignment Unit 與以及 CTU 中加入指數/尾數double buffer,以 BF16 輸入並與 MSFP12/16 權重運算,配合編譯器的權重重排與反量化流程,使用者可直接部署浮點模型而免除繁複整數量化。模擬結果顯示,在MSFP16 下 MobileNet V2 仍維持 Top-1 精度 71.5%,與 FP32 幾乎等同,而MSFP12因模型較小而有較大的精度下降。硬體行為已於InstructionSetSimulator/Algorithm Simulator 中完成 bit-match 驗證,證明資料路徑正確性與準確度損失,為未來多格式混合運算奠定基礎。

    This paper presents a holistic hardware-software co-optimization process for Novella-NPU, spanning from model compilation to microarchitecture refinement. First, the TVM compiler backend incorporates static DRAM feature map allocation and compile-time garbage collection, significantly reducing peak external memory demand during inference. Then, automatic feature map tiling is employed to maximize computational tile size and data reuse, even with constrained SRAM resources. Layer Fusion strategically links convolution, planar, and element-wise operations according to hardware data path constraints, reducing external memory access by an average of 46.7%. This optimization doubles the inference FPS for MobileNet V2 (from 48.79 to 101.3 FPS) and enhances 17.8% performance for DeiT-tiny. To support multi-processing units, the central controller is designed with a four-stage pipeline and μOP Queue + dependency table, enabling Out-of-Order Issue across function units. With queue sizes set to 8/8/16/16, this configuration achieves maximum parallelism and shortest execution time, improving MobileNet V2 processing speed by approximately 20% compared to In-order Issue and DeiT-tiny by 6.5%. Roofline model analysis shows a significant shift upward and to the right, successfully transforming workloads from memory-bound to compute-bound.
    To lower the barriers to model deployment, this study introduces CTU core support for Microsoft Floating Point (MSFP), integrating an Alignment Unit before the MAC Array and a double buffer for exponent/mantissa within the CTU. By processing BF16 inputs with MSFP12/16 weights, along with compiler-assisted weight reordering and dequantization, users can deploy floating-point models without requiring complex integer quantization. Simulation results indicate that under MSFP16, MobileNet V2 maintains a Top-1 accuracy of 71.5%, nearly identical to FP32, while MSFP12 experiences a greater accuracy drop due to the smaller model size. The bit-match verification conducted via Instruction Set Simulator and Algorithm Simulator confirms data path correctness and accuracy retention, laying a solid foundation for future mixed-format computations.

    摘要 i 英文延伸摘要 ii 誌謝 xxxvi Table of Contents xxxvii List of Tables xl List of Figures xli Chapter1.緒論 1 1.1.論文動機 1 1.2.論文挑戰 2 1.3.論文貢獻 2 1.4.論文架構 3 Chapter2.背景知識與相關研究 4 2.1.卷積神經網路(ConvolutionNeuralNetwork) 4 2.1.1.卷積層(ConvolutionLayer) 4 2.1.2.全連接層(FullyConnectedLayer) 6 2.1.3.深度可分離卷積(Depth-wiseSeparableConvolution) 6 2.2. VisionTransformer 8 2.2.1. Preprocess 9 2.2.2. EncoderStack 10 2.3. Novella-NPUMicroArchitecture 11 2.3.1.運算單元 11 2.3.2.運算子集合 14 2.4. Novella-NPUInstructionSetArchitecture 16 2.5.模型編譯器 17 2.5.1. TVM編譯流程 18 2.5.2. TVM中介表達式之資料結構 18 2.5.3. TVMBringYourOwnCodegen(BYOC) 20 2.5.4. RelayIRRewriter 20 2.5.5. Novella-NPU的模型編譯器 21 2.6. BlockFloating-Point 25 2.6.1.浮點數與BFP之間的轉換 27 2.6.2. BFP點積運算 27 Chapter3.編譯器後端優化 29 3.1. InvertedResidualBlock 29 3.2. LayerFusion 30 3.2.1. RelayGraph的Scope定義 30 3.2.2.硬體限制與支援範圍 31 3.2.3. LayerFusion在編譯器的實作 32 3.3. FeatureMapTiling 34 3.3.1.設計挑戰與原則 34 3.3.2. Tiling策略與排布方式 36 3.3.3. TileSize計算流程 37 3.3.4. Overlap計算與Padding處理 38 3.3.5.舉例說明 39 3.4. StaticDRAMFeatureMapAllocation 40 3.4.1. StaticDRAMFeatureMapAllocation的實作 40 3.4.2. Traversal與動態回收策略 42 3.4.3. Memory-Operation-FreeAllocation 43 Chapter4.中央控制器之設計與實作 45 4.1. MacroOperation、Instruction與MicroOperation之關係 45 4.2. MicroOperation 46 4.2.1. MLUμOP 46 4.2.2. CTUμOP 47 4.2.3. PPUμOP 48 4.2.4. RAUμOP 50 4.3.中央控制器架構 50 4.3.1.架構特性與設計考量 51 4.3.2.中央控制器四個Stages之行為 53 Chapter5.實驗結果與分析 70 5.1.實驗環境與模型參數 70 5.1.1.實驗環境 70 5.1.2. Benchmark模型參數 70 5.2.實驗結果 74 5.2.1. MobileNetV2 76 5.2.2. DeiT-tiny 83 Chapter6.支援MSFP之運算核心設計 98 6.1. CTU之架構設計 99 6.1.1. ProcessElement 99 6.1.2. MACArray 101 6.1.3.外部記憶體之資料佈局與CTU記憶體定址 110 6.1.4. CTU架構 111 6.1.5.整數模型之編譯器架構 111 6.1.6.整體架構 114 6.1.7. Accuracy模擬結果 116 Chapter7.結論與未來展望 118 7.1.結論 118 7.1.1. FeatureMapAllocation 118 7.1.2. LayerFusion 118 7.1.3. FeatureMapTiling 118 7.1.4. Out-of-OrderIssue 119 7.2.未來展望 119 7.2.1.不同Tilingstrategies 119 7.2.2. Convolution-to-ConvolutionFusion與SpecialFunctionFusion 119 7.2.3.增加支援MSFP的Operations 119 References 120

    [1] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018.
    [2] Bita Darvish Rouhani, Daniel Lo, Ritchie Zhao, Ming Liu, Jeremy Fowers, Kalin Ovtcharov, Anna Vinogradsky, Sarah Massengill, Lita Yang, Ray Bittner, et al. Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point. Advances in neural information processing systems, 33:10271–10281, 2020.
    [3] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. [4] Ping Hsuan Lee. Memory layout unit design of cnn/transformer unified accelerator and memory subsystem analysis, 2024.
    [5] Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. arXiv preprint arXiv:2405.04532, 2024.
    [6] En Yu Pong. Non-linear function approximations and their npu operator legalization in tvm compiler, 2025.
    [7] Yuan Tai. Design of a kernel-agnostic compute core for convolution and gemm, 2024.
    [8] Chun Han Wang. Custom compiler instruction generation and scheduling optimization for novella-npu with n:m sparse matrix operations support, 2025.
    [9] Shibo Wang and Pankaj Kanwar. Bfloat16: The secret to high performance on cloud tpus. Google Cloud Blog, 4(1), 2019

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE