成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	崔哲瑋 Tsui, Che-Wei
論文名稱：	用於可變尺寸變換器之可配置化加速器 Reconfigurable Accelerator for Variable Size Transformer
指導教授：	郭致宏 Kuo, Chih-Hung
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 電機工程學系 Department of Electrical Engineering
論文出版年：	2026
畢業學年度：	114
語文別：	中文
論文頁數：	66
中文關鍵詞：	變換器網路、巢狀指令集架構、脈動陣列、可配置化加速器
外文關鍵詞：	Systolic array, Transformer model, Nested-loop instruction set architecture
相關次數：	點閱：6 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

由於注意力模型（Attention Model）在計算時常面臨指令數量龐大、難以完全暫存於晶片上記憶體，加速器需額外花費外部記憶體搬運時間。為解決此瓶頸，本研究根據注意力模型的重複性計算特性，提出一套客製化指令集架構，可使指令數量不再隨模型尺寸等比例成長。實驗結果顯示，相較於未優化設計，對 ViT-Huge 模型可節省約 300 倍以上的指令儲存空間，並提供約1.5 倍的效能提升。

此外，本論文亦提出專用之記憶體位址產生單元，以高效處理注意力模型中的記憶體位址管理，使有限的晶片上記憶體空間得以達到更高使用效率。此單元僅儲存少量模型資訊，即可於運算期間自主調整記憶體配置以避免資料衝突，提升單一記憶體 (unified SRAM) 計算的平行度。實驗結果顯示，結合指令產生策略與記憶體定址優化機制後，相較於現有加速器架構，可達到最高約 3.4 倍的效能增益。

In this work, we present an instruction-driven accelerator tailored for Transformer-based models. Traditional reconfigurable accelerators rely on customized instructions to define each computational function or operation, causing the instruction counts scale with model size. For large-scale models such as Vision Transformers (ViTs), an unoptimized instruction generation strategy generates an excessive number of instructions that need a large amount of on-chip memory and frequent instruction loading from external memory. To address this challenge, we exploit the repetitive structure inherent in attention-based architectures and introduce a reusable instruction generation strategy inspired by nested-for-loops. This approach significantly reduces the total number of instructions. Experimental results demonstrate that our method reduces the instruction count for ViT-Huge by a factor of 308. Additionally, unlike prior designs that allocate separate on-chip SRAMs for different computations, our accelerator adopts an unified on-chip SRAM for input data to avoid low memory utilization. To handle memory allocation under this design and prevent data conflicts between computations, we introduce a dedicated Address Generate Unit (AGU) for efficient memory management on large matrix tiling and nonlinear operations. This module only stores a small amount of known model information and autonomously schedules on-chip memory allocation to enhance storage utilization and minimize redundant off-chip data accesses. Our AGU module achieves up to a 1.7× improvement in frames per second (FPS) on the ViT-Huge model.

中文摘要 i

英文延伸摘要 ii

誌謝 xi

第一章 緒論 1
1-1 前言 1
1-2 研究動機 1
1-3 研究貢獻 2
1-4 論文架構 3

第二章 相關研究背景介紹 4
2-1 變換器網路 (Transformer model) 4
2-1-1 視覺變換器 (Vision Transformer) 5
2-1-2 Swin 變換器 (Swin Transformer) 6
2-2 AI 加速器硬體設計 8
2-2-1 脈動陣列 (Systolic Array) 8
2-2-2 客製化指令集架構設計 (Customized ISA) 8

第三章 視覺變換器硬體加速器相關文獻回顧 10
3-1 基於脈動陣列硬體加速器架構設計 10
3-1-1 In-Datacenter Performance Analysis of a Tensor Processing Unit 10
3-1-2 Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer 11
3-2 視覺任務網路模型之相關加速器架構 13
3-2-1 針對視覺變化器設計之加速器 13
3-2-2 針對 Swin 變換器設計之加速器 14
3-2-3 ME-ViT 15
3-3 相關作品分析比較 17

第四章 基於變換器模型設計之高度可配置化硬體 AI 加速器 18
4-1 加速器各模組功能 20
4-2 高效率指令產生策略 21
4-2-1 未優化指令生成之缺點 21
4-2-2 視覺變換器模型迴圈式架構 22
4-2-3 基於變換器巢狀迴圈架構特性優化流程 23
4-3 晶片上記憶體設計規劃 25
4-4 記憶體位址產生單元 25
4-4-1 AGU 架構 26
4-4-2 MLP 計算之優化方式以及優缺點 27
4-5 加速器指令集架構 30

第五章 實驗環境與結果分析 34
5-1 實驗環境 34
5-2 FPGA 資源使用量及效能 36
5-3 視覺變換器模型各尺寸之效能比較 36
5-4 巢狀指令集架構之指令優化成效 37
5-5 與參考論文之 ISA 設計比較 38
5-6 AGU 單元效能提升評估和比較 39
5-6-1 MLP 區塊計算優化前後比較 39
5-6-2 unified SRAM 與 seperate SRAM 架構差異比較 40
5-7 與相關 works 之比較 41

第六章 結論與未來展望 44
6-1 結論 44
6-2 未來展望 44

參考文獻 46
                                    

[1] K. Marino, P. Zhang, and V. K. Prasanna, “Me-vit: A single-load memory-efficient fpga accelerator for vision transformers,” in 2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC), pp. 213–223, IEEE, 2023.
[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention Is All You Need,” Advances in neural information processing systems, vol. 30, 2017.
[3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale,” arXiv preprint arXiv:2010.11929, 2020.
[4] S. Nag, G. Datta, S. Kundu, N. Chandrachoodan, and P. A. Beerel, “ViTA: A Vision Transformer Inference Accelerator for Edge Applications,” in 2023 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5, IEEE, 2023.
[5] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “Cambricon: An instruction set architecture for neural networks,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 393–405, 2016.
[6] Z. Liu, P. Yin, and Z. Ren, “An efficient fpga-based accelerator for swin transformer,”arXiv preprint arXiv:2308.13922, 2023.
[7] Y.-C. Wu, C.-H. Kuo, and C.-W. Tsui, “Save: Systolic array-based accelerator for vision transformer with efficient tiling strategy,” in 2025 International VLSI Symposium on Technology, Systems and Applications (VLSI TSA), pp. 1–4, IEEE, 2025.
[8] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021.
[9] S. Lu, M. Wang, S. Liang, J. Lin, and Z. Wang, “Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer,” in 2020 IEEE 33rd International System-on-Chip Conference (SOCC), pp. 84–89, IEEE, 2020.
[10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186, 2019.
[11] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th annual international symposium on computer architecture, pp. 1–12, 2017.
[12] S.-Y. Kung, “Vlsi array processors,” IEEE ASSP Magazine, vol. 2, no. 3, pp. 4–22,1985.

簡易檢索 / 詳目顯示

相關論文