簡易檢索 / 詳目顯示

研究生: 江冠霆
Chiang, Guan-Ting
論文名稱: 基於 FPGA 的異質多核 ISP 架構並整合 NPU 加速器
Integrating an FPGA-Based Heterogeneous Multicore ISP Architecture with NPU Accelerators
指導教授: 侯廷偉
Hou, Ting-Wei
學位類別: 碩士
Master
系所名稱: 工學院 - 工程科學系
Department of Engineering Science
論文出版年: 2025
畢業學年度: 113
語文別: 中文
論文頁數: 55
中文關鍵詞: 儲存內運算類神經網路加速器硬體整合系統軟體開發
外文關鍵詞: In-Storage Processing, NPU, Hardware Integration, System Software Development
相關次數: 點閱:5下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究探討的是儲存內運算 (In-storage processing, ISP) 架構對於神經網路推論的效能增進,傳統 ISP 架構的目的在於解決伺服器與次級儲存裝置之間頻寬不足的問題,面對傳統的任務都已經有能力解決,但是面對現今的 AI 伺服器這個架構顯然受到挑戰。本研究的解決辦法,是引入加快神經網路推論的硬體加速器 (NPU)。本研究採用的是 Xilinx DPU 以及一款DPU-HLS4-ML 。最後效能評估環節,以量子分類案例以及圖片辨識案例去分析加入 NPU 的成效,分別是量子狀態的分類以及圖片辨識,前者是擁有萬筆資料的資料集,後者則是視覺辨識最常出現的 ImageNet 。最後的結果顯示加入 NPU 後效能遠勝於沒有加入的版本,並且兩個案例的深度學習模型包括了 DNN 以及 CNN 模型,這也突顯了只要是深度學模型都可以獲得加速效果。透過兩個不同的案例結果顯示引入 NPU 後大幅提高效能, FPGA 的使用資源也僅佔一半。

    This work explores the use of In-Storage Processing (ISP) architectures to enhance neural network inference. While traditional ISP addresses bandwidth limitations between servers and secondary storage, it struggles to meet the demands of modern AI workloads. To address this, we integrate Neural Processing Units (NPUs), specifically the Xilinx DPU and a DPU-HLS4ML design, into the ISP framework. Performance is evaluated on two case studies: quantum state classification with tens of thousands of samples, and image recognition with ImageNet. Both deep neural networks (DNNs) and convolutional neural networks (CNNs) are tested. Results show that NPU integration outperforms the baseline without NPUs, while reducing FPGA resource usage by nearly half. These findings demonstrate that incorporating NPUs significantly accelerates diverse deep learning models in ISP systems, offering a cost-efficient and resource-efficient solution for AI servers.

    摘要 I Extended Abstract II 致謝 VI 目錄 VII 表目錄 IX 圖目錄 X 一、緒論 1 二、文獻探討 3 2-1 In-storage Processing 3 2-2 Neural Processing Unit 3 2-3 相關論文比較 6 三、 研究方法 8 3-1 研究方法簡介 8 3-2 FPGA 整合開發流程 9 3-3 DPU-HLS4-ML 硬體架構 14 3-4 DPU-HLS4-ML 軟體架構 18 3.4.1. Linux 核心 18 3.4.2. 使用者層級 19 3-5 DPU-HLS4-ML 規格書 19 3.5.1. DPU-HLS4-ML 19 3-6 Xilinx DPU 硬體結構 20 3-7 Xilinx DPU 軟體結構 22 3.7.1. Vitis-AI Quantizer 、 Compiler 23 3.7.2. Vitis-AI Runtime (VAIR) 23 3.7.3. 研究案例一 DPU 實作細節 24 四、 效能分析 25 4-1 研究案例一 25 4-2 研究案例二 26 4-3 評估工具與方法 28 4-4 案例一實驗 29 4-5 案例二實驗 31 4-6 FPGA 使用資源分析 32 4-7 實作問題與討論 34 五、 結論 36 5-1 結果與討論 36 5-2 未來展望 37 參考文獻 38 附錄 42 7-1 HLS4ML 硬體整合 42 7-2 Xilinx DPU 深度學習模型佈署 42 7-3 困難以及解決辦法 42

    [1] J. Johnson, M. Douze and H. Jégou, "Billion-Scale Similarity Search with GPUs," in IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535-547, 1 July 2021, doi: 10.1109/TBDATA.2019.2921572.
    [2] M. Torabzadehkashi, S. Rezaei, A. Heydarigorji, H. Bobarshad, V. Alves and N. Bagherzadeh, "Catalina: In-Storage Processing Acceleration for Scalable Big Data Analytics," 2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), Pavia, Italy, 2019, pp. 430-437, doi: 10.1109/EMPDP.2019.8671589.
    [3] Faiss[online].Available:https://github.com/facebookresearch/faiss
    [4] “Datasets for approximate nearest neighbor search,” [online]. Available: http://corpus-texmex.irisa.fr
    [5] M. Torabzadehkashi, S. Rezaei, V. Alves and N. Bagherzadeh, "CompStor: An In-storage Computation Platform for Scalable Distributed Processing," 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Vancouver, BC, Canada, 2018, pp. 1260-1267, doi: 10.1109/IPDPSW.2018.00195.
    [6] ZynqMP Technical Reference Manual [online] https://docs.amd.com/v/u/en-US/ug1085-zynq-ultrascale-trm
    [7] gzip/grep [online] https://linux.die.net/man/1/gzip / https://man7.org/linux/man-pages/man1/grep.1.html
    [8] Alam, S., Yakopcic, C., Wu, Q., Barnell, M., Khan, S., and Taha, T. M. (2024). “Survey of Deep Learning Accelerators for Edge and Emerging Computing,”Electronics, 13(15), 2988. https://doi.org/10.3390/electronics13152988.
    [9] Cristina Silvano, Daniele Ielmini, Fabrizio Ferrandi, et al. (. “A Survey on Deep Learning Hardware Accelerators for Heterogeneous HPC Platforms,” ACM Comput. Surv. 57, 11, Article 286,November 2025, 39 pages. https://doi.org/10.1145/3729215
    [10] N. Zhang, S. Ni, L. Chen, T. Wang and H. Chen, "High-Throughput and Energy-Efficient FPGA-Based Accelerator for All Adder Neural Networks," in IEEE Internet of Things Journal, vol. 12, no. 12, pp. 20357-20376, 15 June15, 2025, doi: 10.1109/JIOT.2025.3543213.
    [11] S. Bouguezzi, H. B. Fredj, T. Belabed, C. Valderrama, H. Faiedh, and C. Souani, ‘‘An Efficient FPGA-Based Convolutional Neural Network For Classification: Ad-Mobilenet,’’ Electronics, vol. 10, no. 18, p. 2272, Sep. 2021.
    [12] T.-K. Chi; T.-Y. Chen; Y.-C. Lin; et al. “An Edge Computing System with AMD Xilinx FPGA AI Customer Platform for Advanced Driver Assistance System,” Sensors 2024, 24(10), 3098. https://doi.org/10.3390/s24103098
    [13] Y. Lu et al., "Automatic Implementation of Large-Scale CNNs on FPGA Cluster Based on HLS4ML," 2024 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), Kaifeng, China, 2024, pp. 1080-1087, doi: 10.1109/ISPA63168.2024.00143.
    [14] Ngadiuba, J., Loncar, V., Pierini, M., et al., “Compressing deep neural networks on FPGAs to binary and ternary precision with hls4ml,”2021 Machine Learning: Science and Technology, vol.2, no.1, 015001.
    [15] Baldi, Pierre, et al. "Jet Substructure Classification in High-Energy Physics With Deep Neural Networks." Physical Review, D 93.9 (2016): 094034.
    [16] Lim, Sung Hak, and Mihoko M. Nojiri. "Morphology for Jet Classification," Physical Review, D 105.1 (2022): 014004.
    [17] Yu, Zhongkai, et al. "Cambricon-llm: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70b llm." 2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2024.
    [18] Farah, et al. “hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices,” arXiv preprint arXiv:2103.05579, 2021.
    [19] DPUCZDX8G for Zynq UltraScale+ MPSoCs Product Guide [online] https://docs.amd.com/r/en-US/pg338-dpu/Introduction?tocId=3xsG16y_QFTWvAJKHbisEw
    [20] Vivado User guide [online] https://docs.amd.com/access/sources/dita/map?Doc_Version=2024.2%20English&url=ug888-vivado-design-flows-overview-tutorial
    [21] Vitis Unified Software Platform Documentation [online] https://docs.amd.com/r/en-US/ug1400-vitis-embedded/Getting-Started-with-Vitis
    [22] Petalinux [online] https://docs.amd.com/r/en-US/ug1144-petalinux-tools-reference-guide/Introduction
    [23] AMBA Specification [online] https://www.arm.com/architecture/system-architectures/amba/amba-specifications
    [24] Vitis High-Level Synthesis User Guide (UG1399) [online] https://docs.amd.com/r/en-US/ug1399-vitis-hls
    [25] NumPy Array [online] https://github.com/oysteijo/npy_array
    [26] Xilinx Vitis AI [online] https://xilinx.github.io/Vitis-AI/3.5/html/index.html#
    [27] ImageNet [online] https://en.wikipedia.org/wiki/ImageNet
    [28] The python profilers [online] https://docs.python.org/3/library/profile.html
    [29] Perf-events [online] https://docs.kernel.org/arch/arm64/perf.html
    [30] Inception[online]https://en.wikipedia.org/wiki/Inception_(deep_learning_architecture)
    [31] Resnet-50 [online] https://en.wikipedia.org/wiki/Residual_neural_network

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE