簡易檢索 / 詳目顯示

研究生: 陳錦麟
Chen, Jin-Lin
論文名稱: 以 LLVM OpenMP Offloading擴展儲存導向異質邊緣運算平台之設計
Design of a Storage-Oriented Heterogeneous Edge Computing Platform Extended with LLVM OpenMP Offloading
指導教授: 侯廷偉
Hou, Ting-Wei
學位類別: 碩士
Master
系所名稱: 工學院 - 工程科學系
Department of Engineering Science
論文出版年: 2025
畢業學年度: 113
語文別: 中文
論文頁數: 66
中文關鍵詞: 近資料運算儲存裝置內運算運算型儲存裝置OpenMPOpenMPILLVM
外文關鍵詞: Near-Data Processing(NDP), In-Storage Processing (ISP), Computational Storage Device (CSD), OpenMP, OpenMPI, LLVM
相關次數: 點閱:4下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近資料運算(Near-Data Processing, NDP)被廣泛視為解決資料密集型應用中「記憶體牆」與「資料搬移瓶頸」的潛在方案。傳統研究多著眼於計算單元的記憶體擴充,或在儲存裝置內建特定加速功能,但缺乏能兼具通用性與彈性的開發框架。為回應此一挑戰,本研究提出一套以 OpenMP 為基礎的開發介面,允許開發者自由分配主機與裝置間的工作,藉此驗證通用儲存導向異質運算裝置之可行性。
    為支援本研究之裝置端卸載,本系統對 LLVM 的 OpenMP 架構進行擴充:在 libomptarget 下新增自訂 RTL 外掛 libomptarget.rtl.sylph 以實作目標裝置的裝置發現、記憶體配置、資料搬移與函式啟動介面,並於裝置側提供相對應的伺服器端形成。工具鏈方面,透過客製 ToolChain/Linker 將目標端物件與依賴封裝為 offloading image,並嵌入主程式之描述檔;執行期由 RTL 解析該描述檔並載入映像。此設計同時保留 OpenMP 規範之資料映射與並行語意,並在傳輸層導入 OpenMPI通訊機制,使現有 OpenMP 程式僅需最小改動即可於目標裝置上卸載執行。
    本研究成功建立支援該裝置之 OpenMP 卸載機制,並以多組實驗驗證系統效能。結果顯示,本架構能有效減少主機與裝置間的資料傳輸量,實現近資料運算之優勢;同時也揭露了系統的限制,例如 NAND Flash 存取速度過慢造成效能瓶頸,以及裝置端運算能力不足導致總執行時間偏高。進一步的測試亦表明,若能利用 NPU 執行特定應用,裝置端的效能將獲得顯著改善,證明異質加速單元對整體架構的重要性。

    This thesis investigates the design of a storage-oriented heterogeneous computing platform built upon LLVM OpenMP offloading, targeting the emerging demands of Near-Data Processing (NDP). Traditional computing architectures suffer from severe inefficiencies caused by the “memory wall” and excessive host–device data transfers. Although accelerators such as GPUs or TPUs provide dedicated memory, their effectiveness diminishes once datasets surpass memory capacity. Similarly, existing Computational Storage Devices (CSDs) embed limited accelerators but often lack programmability. To address these issues, this work extends the LLVM OpenMP framework with a custom runtime and toolchain, enabling embedded storage-oriented devices to be treated as legitimate offload targets. A prototype platform based on the Nuvoton MA35D1-A1 SoC and the Milk-V Duo 256M NPU was implemented, and its feasibility was validated through multiple benchmarks. Results indicate that the proposed design reduces data transfer overhead, demonstrates the potential of heterogeneous accelerators, and provides a practical foundation for future NDP systems.

    摘要 i Extended Abstract ii 致謝 vii 目錄 viii 表目錄 x 圖目錄 xi 第一章 緒論 1 1.1 研究動機 1 1.2 研究目的 3 1.3 研究貢獻 4 第二章 文獻探討 6 2.1 記憶體牆(Memory Wall) 6 2.2 運算型儲存裝置(Computational Storage Device, CSD) 7 2.3 LLVM 9 2.4 LLVM OpenMP 12 2.5 OpenMPI 17 第三章 系統設計與實作 19 3.1 硬體架構設計 19 3.2 軟體架構設計 22 3.3 SylphToolChain 23 3.4 執行時動態函式庫 25 3.5 應用程式執行方法 33 3.6 環境配置 33 3.6.1 OpenMPI 配置 33 3.6.2 OpenMP 配置 35 3.6.3 OpenSSH 配置 35 3.6.4 Loopback 驅動程式配置 36 3.6.5 USB Gadget NCM 和 Mass Storage 驅動程式配置 36 第四章 實驗結果與討論 38 4.1 實驗環境 38 4.2 Rodinia Benchmark 40 4.3 HeCBench Attention 43 4.4 YOLOv8:推論任務在多元加速器下的比較 46 4.4.1 Milk-V Duo 256M 46 4.4.2 資料前處理 47 4.4.3 實驗數據分析 47 4.5 實驗結論 49 第五章 結論與未來展望 50 5.1 結論 50 5.2 未來研究方向 50 參考文獻 52

    [1] A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer, “AI and Memory Wall,” IEEE Micro, vol. 44, no. 3, pp. 33–39, May 2024, doi: 10.1109/MM.2024.3373763.
    [2] Advanced Micro Devices, Inc., “RadeonTM Pro SSG Technical Brief,” Technical Brief, 2016.
    [3] J. Zhang, M. Kwon, H. Kim, H. Kim, and M. Jung, “FlashGPU: Placing New Flash Next to GPU Cores,” in Proceedings of the 56th Annual Design Automation Conference 2019, Las Vegas NV USA: ACM, June 2019, pp. 1–6. doi: 10.1145/3316781.3317827.
    [4] J. Zhang and M. Jung, “ZnG: Architecting GPU Multi-Processors with New Flash for Scalable Data Analysis,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain: IEEE, May 2020, pp. 1064–1075. doi: 10.1109/ISCA45697.2020.00090.
    [5] H. Zhang, Y. Zhou, Y. Xue, Y. Liu, and J. Huang, “G10: Enabling An Efficient Unified GPU Memory and Storage Architecture with Smart Tensor Migrations,” in 56th Annual IEEE/ACM International Symposium on Microarchitecture, Toronto, ON, Canada: ACM, Oct. 2023, pp. 395–410. doi: 10.1145/3613424.3614309.
    [6] D. Fakhry, M. Abdelsalam, M. W. El-Kharashi, and M. Safar, “Memories-Materials,Devices,CircuitsandSystems,” Mem. - Mater. Devices Circuits Syst., vol. 4, p. 100051, July 2023, doi: 10.1016/j.memori.2023.100051.
    [7] A. Barbalace and J. Do, “Computational Storage: Where Are We Today? ,” Innovative Data Systems Research, Virtual Confernece, Jane, 2021, pp. 1-6.
    [8] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural Acceleration for General-Purpose Approximate Programs,” in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, Vancouver, BC, Canada: IEEE, Dec. 2012, pp. 449–460. doi: 10.1109/micro.2012.48.
    [9] S. Alam, C. Yakopcic, Q. Wu, M. Barnell, S. Khan, and T. M. Taha, “Survey of Deep Learning Accelerators for Edge and Emerging Computing,” Electronics, vol. 13, no. 15, p. 2988, July 2024, doi: 10.3390/electronics13152988.
    [10] Wm. A. Wulf and S. A. McKee, “Hitting the memory wall: implications of the obvious,” ACM SIGARCH Comput. Archit. News, vol. 23, no. 1, pp. 20–24, Mar. 1995, doi: 10.1145/216585.216588.
    [11] Insoon Jo, Duck-Ho Bae, Andre S. Yoon, et al., “YourSQL: a high-performance database system leveraging in-storage computing,” Proc. VLDB Endow., vol. 9, no. 12, pp. 924–935, Aug. 2016, doi: 10.14778/2994509.2994512.
    [12] “The LLVM Compiler Infrastructure.” [Online]. Available: https://llvm.org/
    [13] L. Dagum and R. Menon, “OpenMP: an industry standard API for shared-memory programming,” IEEE Comput. Sci. Eng., vol. 5, no. 1, pp. 46–55, Mar. 1998, doi: 10.1109/99.660313.
    [14] A. Chikin, G. Tyler, and José Nelson Amaral, Eds., “OpenMP Code Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries,” in Lecture Notes in Computer Science. 2018, pp. 51–74. doi: 10.1007/978-3-030-12274-4.
    [15] Atmn Patel and Johannes Doerfert, Eds., “Remote OpenMP Offloading,” in Lecture Notes in Computer Science. Cham: Springer International Publishing, 2022, pp. 315–333. doi: 10.1007/978-3-031-07312-0.
    [16] B. Shan, M. Araya-Polo, A. M. Malik, and B. Chapman, “MPI-based Remote OpenMP Offloading: A More Efficient and Easy-to-use Implementation,” in Proceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores, Montreal, QC, Canada: ACM, Feb. 2023, pp. 50–59. doi: 10.1145/3582514.3582519.
    [17] Pranjal Walia, Ishan Shanware, Dhiraj D. Kalamkar, and Uma M. Natarajan, “LOSM: Leveraging OpenMP and Shared Memory for Accelerating Blocking MPI Allreduce,” presented at the 2024 IEEE 31st International Conference on High Performance Computing, Data and Analytics Workshop (HiPCW), IEEE Computer Society, Dec. 2024, pp. 123–124.
    [18] Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, and Jack J. Dongarra, “Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation,” Berlin/Heidelberg: Springer Berlin Heidelberg, 2004, pp. 97–104.
    [19] Shuai Che, Michael Boyer, Jiayuan Meng, et al., “Rodinia: A benchmark suite for heterogeneous computing,” in 2009 IEEE International Symposium on Workload Characterization (IISWC), Austin, TX, USA: IEEE, Oct. 2009, pp. 44–54. doi: 10.1109/IISWC.2009.5306797.
    [20] Z. Jin and J. S. Vetter, “A Benchmark Suite for Improving Performance Portability of the SYCL Programming Model,” in 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Raleigh, NC, USA: IEEE, Apr. 2023, pp. 325–327. doi: 10.1109/ISPASS57527.2023.00041.
    [21] P. Hu, M. Lu, L. Wang, and G. Jiang, “TPU-MLIR: A Compiler For TPU Using MLIR,” Feb. 09, 2023, arXiv: arXiv:2210.15016. doi: 10.48550/arXiv.2210.15016.

    無法下載圖示 校內:2030-08-22公開
    校外:2030-08-22公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE