| 研究生: |
周宸葳 Chou, Chen-Wei |
|---|---|
| 論文名稱: |
以FPGA實現一使用MNIST資料集之心脈陣列VAE解碼器 An FPGA Implementation of a Systolic Array VAE Decoder Using the MNIST Dataset |
| 指導教授: |
陳進興
Chen, Chin-Hsing 張名先 Chang, Ming-Xian |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2025 |
| 畢業學年度: | 114 |
| 語文別: | 英文 |
| 論文頁數: | 54 |
| 中文關鍵詞: | 現場可規劃邏輯電路(FPGA) 、變分自動編碼器(VAE) 、亞像素卷積(Sub-Pixel Convolution) 、心脈陣列(Systolic Array) 、RS232 |
| 外文關鍵詞: | FPGA, VAE, Sub-Pixel Convolution, Systolic Array, RS232 |
| 相關次數: | 點閱:55 下載:8 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著人工智慧的快速發展,深度學習模型日益複雜,對硬體運算能力的需求也持續增加。為因應此挑戰,高效能運算晶片的重要性逐漸凸顯。本論文針對此一需求,採用心脈陣列(Systolic Array)技術,於現場可程式化邏輯閘陣列(FPGA)上實現變分自動編碼器(VAE)之解碼器的硬體設計。
在網路架構方面,本論文以亞像素卷積(Sub-pixel Convolution)取代 VAE 模型中常見的轉置卷積(Transposed Convolution),用於影像尺寸的放大與重建。所設計的 VAE 解碼器包含一層全連接層、兩層亞像素卷積層以及最後一層卷積層,並透過心脈陣列加速矩陣運算,以提升運算吞吐量並降低功耗,達成高效能硬體加速的目標。
在資料傳輸方面,訓練完成後的權重由 PC 端透過 RS232 模組傳送至 FPGA,隨後進行解碼器的硬體推論運算,最終再將輸出影像透過 RS232 傳回 PC,並於 MATLAB GUI 上顯示重建結果。此設計不僅能實現完整的系統流程,也驗證了 FPGA 與 PC 互動的可行性。
在軟體訓練階段,本論文比較了採用亞像素卷積與轉置卷積之 VAE 模型效能。實驗結果顯示,兩者在生成影像的平均 PSNR 上僅相差 0.02 dB,但亞像素卷積模型可有效降低 71.5% 的浮點運算量。在硬體推論效能方面,本設計於 50 MHz 時脈下完成一次推論約需 47 ms,對應約 21 FPS。此時間僅包含計算過程,不含最終輸出影像透過 UART 傳輸至 PC 的延遲,顯示計算效能本身已足以支援即時影像生成。因此,採用亞像素卷積層的 VAE 模型在硬體實現上更具優勢,亦為後續深度生成模型之硬體加速研究提供了可行方向。
With the rapid development of artificial intelligence, deep learning models have become increasingly complex, leading to higher demands on hardware computing capabilities. To address this challenge, the importance of high-performance computing chips has grown significantly. This thesis targets this need by implementing a decoder of a Variational Autoencoder (VAE) on a Field-Programmable Gate Array (FPGA) using systolic array technology.
In terms of network architecture, this work adopts sub-pixel convolution to replace the commonly used transposed convolution in VAE models for image upscaling and reconstruction. The designed VAE decoder consists of one fully connected layer, two sub-pixel convolution layers, and a final convolution layer. Matrix operations within these layers are accelerated by the systolic array, thereby improving computational throughput and reducing power consumption to achieve efficient hardware acceleration.
For data transmission, the trained weights are transferred from the PC to the FPGA through an RS232 module. The FPGA then performs the decoding operations in hardware, and the reconstructed image is sent back to the PC via RS232 for display on a MATLAB GUI. This design realizes a complete system workflow and verifies the feasibility of FPGA–PC interaction.
In the software training stage, this thesis compares VAE models with sub-pixel convolution and transposed convolution. Results show only a 0.02 dB difference in average PSNR, while the sub-pixel model reduces FLOPs by 71.5%. For hardware inference, the design requires about 47 ms per inference at 50 MHz, equivalent to 21 FPS. This time excludes the UART transmission of the final output image, indicating that the computation alone is sufficient for real-time image generation. Thus, adopting sub-pixel convolution layers provides clear advantages for hardware implementation and future acceleration of deep generative models.
[1] B. Asgari, R. Hadidi, and H. Kim, "MEISSA: multiplying matrices efficiently in a scale systolic architecture," Georgia Institute of Technology, GA, USA, 2020.
[2] E. Baek, D. Kwon, and J. Kim, "A multi-neural network acceleration architecture," Proc.ACM/IEEE 47th Annu. Int. Symp. Comput. Archit. (ISCA), Valencia, Spain, vol.47, pp. 940–953, 2020.
[3] M. Bakiri, C. Guyeux, J. F. Couchot, and A. K. Oudjida, "Survey on hardware implementation of random number generators on FPGA: theory and experimental analyses," Comput. Sci. Rev., vol. 27, pp. 135–153, 2018.
[4] S. H. Chua, T. H. Teo, M. A. Tiruye, and I. C. Wey, "Systolic array based convolutional neural network inference on FPGA," in Proc. 2022 IEEE 15th Int. Symp. Embedded Multicore/Many-core Syst.-on-Chip (MCSoC), Penang, Malaysia, pp. 128–133, Dec. 2022.
[5] V. Dumoulin and F. Visin, "A guide to convolution arithmetic for deep learning," arXiv preprint arXiv:1603.07285, 2018.
[6] M. Kalbasi, "Architectural insights: comparing weight stationary and output stationary systolic arrays for efficient computation," in Proc. 15th Int. Conf. Inf. Knowl. Technol. (IKT), Isfahan, Iran, pp. 146–150, 2024.
[7] K. Kingma and M. Welling, "Auto-encoding variational Bayes," arXiv preprint arXiv:1312.6114, 2013.
[8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," in Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[9] Z. Que, M. Zhang, H. Fan, H. Li, C. Guo, and W. Luk, "Low latency variational autoencoder on FPGAs," IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 14, no. 2, pp. 323–335, 2024.
[10] A. Samajdar, Y. Zhu, P. Watmough, M. Mattina, and T. Krishna, "Scale-sim: systolic CNN accelerator simulator," Tech. Rep., Cornell Univ., NY, USA, 2019.
[11] A. Shawahna, S. Sait, and A. El-Maleh, "FPGA-based accelerators of deep learning networks for learning and classification: a review," IEEE Access, vol. 7, pp. 7823–7859, 2019.
[12] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, "Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 1874–1883, 2016.
[13] Terasic Technology Inc., "DE2-115 user manual," Terasic Technology Inc., 2012.
[14] V. S. Tida, S. V. Chilukoti, S. H. Hsu, and X. Hei, "Kernel-segregated transpose convolution operation," in Proc. 56th Hawaii Int. Conf. Syst. Sci. (HICSS), Maui, HI, USA, pp. 6934–6943, 2023.
[15] A. Yue, H. Jia, and J. Gonski, "Variational autoencoders for at-source data reduction and anomaly detection in high energy particle detectors," arXiv preprint arXiv:2411.01118v2, 2025.