簡易檢索 / 詳目顯示

研究生: 崔哲瑋
Tsui, Che-Wei
論文名稱: 用於可變尺寸變換器之可配置化加速器
Reconfigurable Accelerator for Variable Size Transformer
指導教授: 郭致宏
Kuo, Chih-Hung
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2026
畢業學年度: 114
語文別: 中文
論文頁數: 66
中文關鍵詞: 變換器網路巢狀指令集架構脈動陣列可配置化加速器
外文關鍵詞: Systolic array, Transformer model, Nested-loop instruction set architecture
相關次數: 點閱:6下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 由於注意力模型(Attention Model)在計算時常面臨指令數量龐大、難以完全暫存於晶片上記憶體,加速器需額外花費外部記憶體搬運時間。為解決此瓶頸,本研究根據注意力模型的重複性計算特性,提出一套客製化指令集架構,可使指令數量不再隨模型尺寸等比例成長。實驗結果顯示,相較於未優化設計,對 ViT-Huge 模型可節省約 300 倍以上的指令儲存空間,並提供約1.5 倍的效能提升。

    此外,本論文亦提出專用之記憶體位址產生單元,以高效處理注意力模型中的記憶體位址管理,使有限的晶片上記憶體空間得以達到更高使用效率。此單元僅儲存少量模型資訊,即可於運算期間自主調整記憶體配置以避免資料衝突,提升單一記憶體 (unified SRAM) 計算的平行度。實驗結果顯示,結合指令產生策略與記憶體定址優化機制後,相較於現有加速器架構,可達到最高約 3.4 倍的效能增益。

    In this work, we present an instruction-driven accelerator tailored for Transformer-based models. Traditional reconfigurable accelerators rely on customized instructions to define each computational function or operation, causing the instruction counts scale with model size. For large-scale models such as Vision Transformers (ViTs), an unoptimized instruction generation strategy generates an excessive number of instructions that need a large amount of on-chip memory and frequent instruction loading from external memory. To address this challenge, we exploit the repetitive structure inherent in attention-based architectures and introduce a reusable instruction generation strategy inspired by nested-for-loops. This approach significantly reduces the total number of instructions. Experimental results demonstrate that our method reduces the instruction count for ViT-Huge by a factor of 308. Additionally, unlike prior designs that allocate separate on-chip SRAMs for different computations, our accelerator adopts an unified on-chip SRAM for input data to avoid low memory utilization. To handle memory allocation under this design and prevent data conflicts between computations, we introduce a dedicated Address Generate Unit (AGU) for efficient memory management on large matrix tiling and nonlinear operations. This module only stores a small amount of known model information and autonomously schedules on-chip memory allocation to enhance storage utilization and minimize redundant off-chip data accesses. Our AGU module achieves up to a 1.7× improvement in frames per second (FPS) on the ViT-Huge model.

    中文摘要 i 英文延伸摘要 ii 誌謝 xi 第一章 緒論 1 1-1 前言 1 1-2 研究動機 1 1-3 研究貢獻 2 1-4 論文架構 3 第二章 相關研究背景介紹 4 2-1 變換器網路 (Transformer model) 4 2-1-1 視覺變換器 (Vision Transformer) 5 2-1-2 Swin 變換器 (Swin Transformer) 6 2-2 AI 加速器硬體設計 8 2-2-1 脈動陣列 (Systolic Array) 8 2-2-2 客製化指令集架構設計 (Customized ISA) 8 第三章 視覺變換器硬體加速器相關文獻回顧 10 3-1 基於脈動陣列硬體加速器架構設計 10 3-1-1 In-Datacenter Performance Analysis of a Tensor Processing Unit 10 3-1-2 Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer 11 3-2 視覺任務網路模型之相關加速器架構 13 3-2-1 針對視覺變化器設計之加速器 13 3-2-2 針對 Swin 變換器設計之加速器 14 3-2-3 ME-ViT 15 3-3 相關作品分析比較 17 第四章 基於變換器模型設計之高度可配置化硬體 AI 加速器 18 4-1 加速器各模組功能 20 4-2 高效率指令產生策略 21 4-2-1 未優化指令生成之缺點 21 4-2-2 視覺變換器模型迴圈式架構 22 4-2-3 基於變換器巢狀迴圈架構特性優化流程 23 4-3 晶片上記憶體設計規劃 25 4-4 記憶體位址產生單元 25 4-4-1 AGU 架構 26 4-4-2 MLP 計算之優化方式以及優缺點 27 4-5 加速器指令集架構 30 第五章 實驗環境與結果分析 34 5-1 實驗環境 34 5-2 FPGA 資源使用量及效能 36 5-3 視覺變換器模型各尺寸之效能比較 36 5-4 巢狀指令集架構之指令優化成效 37 5-5 與參考論文之 ISA 設計比較 38 5-6 AGU 單元效能提升評估和比較 39 5-6-1 MLP 區塊計算優化前後比較 39 5-6-2 unified SRAM 與 seperate SRAM 架構差異比較 40 5-7 與相關 works 之比較 41 第六章 結論與未來展望 44 6-1 結論 44 6-2 未來展望 44 參考文獻 46

    [1] K. Marino, P. Zhang, and V. K. Prasanna, “Me-vit: A single-load memory-efficient fpga accelerator for vision transformers,” in 2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC), pp. 213–223, IEEE, 2023.
    [2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention Is All You Need,” Advances in neural information processing systems, vol. 30, 2017.
    [3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale,” arXiv preprint arXiv:2010.11929, 2020.
    [4] S. Nag, G. Datta, S. Kundu, N. Chandrachoodan, and P. A. Beerel, “ViTA: A Vision Transformer Inference Accelerator for Edge Applications,” in 2023 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5, IEEE, 2023.
    [5] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “Cambricon: An instruction set architecture for neural networks,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 393–405, 2016.
    [6] Z. Liu, P. Yin, and Z. Ren, “An efficient fpga-based accelerator for swin transformer,”arXiv preprint arXiv:2308.13922, 2023.
    [7] Y.-C. Wu, C.-H. Kuo, and C.-W. Tsui, “Save: Systolic array-based accelerator for vision transformer with efficient tiling strategy,” in 2025 International VLSI Symposium on Technology, Systems and Applications (VLSI TSA), pp. 1–4, IEEE, 2025.
    [8] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021.
    [9] S. Lu, M. Wang, S. Liang, J. Lin, and Z. Wang, “Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer,” in 2020 IEEE 33rd International System-on-Chip Conference (SOCC), pp. 84–89, IEEE, 2020.
    [10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186, 2019.
    [11] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th annual international symposium on computer architecture, pp. 1–12, 2017.
    [12] S.-Y. Kung, “Vlsi array processors,” IEEE ASSP Magazine, vol. 2, no. 3, pp. 4–22,1985.

    QR CODE