| 研究生: |
陳錦麟 Chen, Jin-Lin |
|---|---|
| 論文名稱: |
以 LLVM OpenMP Offloading擴展儲存導向異質邊緣運算平台之設計 Design of a Storage-Oriented Heterogeneous Edge Computing Platform Extended with LLVM OpenMP Offloading |
| 指導教授: |
侯廷偉
Hou, Ting-Wei |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 工程科學系 Department of Engineering Science |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 中文 |
| 論文頁數: | 66 |
| 中文關鍵詞: | 近資料運算 、儲存裝置內運算 、運算型儲存裝置 、OpenMP 、OpenMPI 、LLVM |
| 外文關鍵詞: | Near-Data Processing(NDP), In-Storage Processing (ISP), Computational Storage Device (CSD), OpenMP, OpenMPI, LLVM |
| 相關次數: | 點閱:4 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近資料運算(Near-Data Processing, NDP)被廣泛視為解決資料密集型應用中「記憶體牆」與「資料搬移瓶頸」的潛在方案。傳統研究多著眼於計算單元的記憶體擴充,或在儲存裝置內建特定加速功能,但缺乏能兼具通用性與彈性的開發框架。為回應此一挑戰,本研究提出一套以 OpenMP 為基礎的開發介面,允許開發者自由分配主機與裝置間的工作,藉此驗證通用儲存導向異質運算裝置之可行性。
為支援本研究之裝置端卸載,本系統對 LLVM 的 OpenMP 架構進行擴充:在 libomptarget 下新增自訂 RTL 外掛 libomptarget.rtl.sylph 以實作目標裝置的裝置發現、記憶體配置、資料搬移與函式啟動介面,並於裝置側提供相對應的伺服器端形成。工具鏈方面,透過客製 ToolChain/Linker 將目標端物件與依賴封裝為 offloading image,並嵌入主程式之描述檔;執行期由 RTL 解析該描述檔並載入映像。此設計同時保留 OpenMP 規範之資料映射與並行語意,並在傳輸層導入 OpenMPI通訊機制,使現有 OpenMP 程式僅需最小改動即可於目標裝置上卸載執行。
本研究成功建立支援該裝置之 OpenMP 卸載機制,並以多組實驗驗證系統效能。結果顯示,本架構能有效減少主機與裝置間的資料傳輸量,實現近資料運算之優勢;同時也揭露了系統的限制,例如 NAND Flash 存取速度過慢造成效能瓶頸,以及裝置端運算能力不足導致總執行時間偏高。進一步的測試亦表明,若能利用 NPU 執行特定應用,裝置端的效能將獲得顯著改善,證明異質加速單元對整體架構的重要性。
This thesis investigates the design of a storage-oriented heterogeneous computing platform built upon LLVM OpenMP offloading, targeting the emerging demands of Near-Data Processing (NDP). Traditional computing architectures suffer from severe inefficiencies caused by the “memory wall” and excessive host–device data transfers. Although accelerators such as GPUs or TPUs provide dedicated memory, their effectiveness diminishes once datasets surpass memory capacity. Similarly, existing Computational Storage Devices (CSDs) embed limited accelerators but often lack programmability. To address these issues, this work extends the LLVM OpenMP framework with a custom runtime and toolchain, enabling embedded storage-oriented devices to be treated as legitimate offload targets. A prototype platform based on the Nuvoton MA35D1-A1 SoC and the Milk-V Duo 256M NPU was implemented, and its feasibility was validated through multiple benchmarks. Results indicate that the proposed design reduces data transfer overhead, demonstrates the potential of heterogeneous accelerators, and provides a practical foundation for future NDP systems.
[1] A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer, “AI and Memory Wall,” IEEE Micro, vol. 44, no. 3, pp. 33–39, May 2024, doi: 10.1109/MM.2024.3373763.
[2] Advanced Micro Devices, Inc., “RadeonTM Pro SSG Technical Brief,” Technical Brief, 2016.
[3] J. Zhang, M. Kwon, H. Kim, H. Kim, and M. Jung, “FlashGPU: Placing New Flash Next to GPU Cores,” in Proceedings of the 56th Annual Design Automation Conference 2019, Las Vegas NV USA: ACM, June 2019, pp. 1–6. doi: 10.1145/3316781.3317827.
[4] J. Zhang and M. Jung, “ZnG: Architecting GPU Multi-Processors with New Flash for Scalable Data Analysis,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain: IEEE, May 2020, pp. 1064–1075. doi: 10.1109/ISCA45697.2020.00090.
[5] H. Zhang, Y. Zhou, Y. Xue, Y. Liu, and J. Huang, “G10: Enabling An Efficient Unified GPU Memory and Storage Architecture with Smart Tensor Migrations,” in 56th Annual IEEE/ACM International Symposium on Microarchitecture, Toronto, ON, Canada: ACM, Oct. 2023, pp. 395–410. doi: 10.1145/3613424.3614309.
[6] D. Fakhry, M. Abdelsalam, M. W. El-Kharashi, and M. Safar, “Memories-Materials,Devices,CircuitsandSystems,” Mem. - Mater. Devices Circuits Syst., vol. 4, p. 100051, July 2023, doi: 10.1016/j.memori.2023.100051.
[7] A. Barbalace and J. Do, “Computational Storage: Where Are We Today? ,” Innovative Data Systems Research, Virtual Confernece, Jane, 2021, pp. 1-6.
[8] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural Acceleration for General-Purpose Approximate Programs,” in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, Vancouver, BC, Canada: IEEE, Dec. 2012, pp. 449–460. doi: 10.1109/micro.2012.48.
[9] S. Alam, C. Yakopcic, Q. Wu, M. Barnell, S. Khan, and T. M. Taha, “Survey of Deep Learning Accelerators for Edge and Emerging Computing,” Electronics, vol. 13, no. 15, p. 2988, July 2024, doi: 10.3390/electronics13152988.
[10] Wm. A. Wulf and S. A. McKee, “Hitting the memory wall: implications of the obvious,” ACM SIGARCH Comput. Archit. News, vol. 23, no. 1, pp. 20–24, Mar. 1995, doi: 10.1145/216585.216588.
[11] Insoon Jo, Duck-Ho Bae, Andre S. Yoon, et al., “YourSQL: a high-performance database system leveraging in-storage computing,” Proc. VLDB Endow., vol. 9, no. 12, pp. 924–935, Aug. 2016, doi: 10.14778/2994509.2994512.
[12] “The LLVM Compiler Infrastructure.” [Online]. Available: https://llvm.org/
[13] L. Dagum and R. Menon, “OpenMP: an industry standard API for shared-memory programming,” IEEE Comput. Sci. Eng., vol. 5, no. 1, pp. 46–55, Mar. 1998, doi: 10.1109/99.660313.
[14] A. Chikin, G. Tyler, and José Nelson Amaral, Eds., “OpenMP Code Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries,” in Lecture Notes in Computer Science. 2018, pp. 51–74. doi: 10.1007/978-3-030-12274-4.
[15] Atmn Patel and Johannes Doerfert, Eds., “Remote OpenMP Offloading,” in Lecture Notes in Computer Science. Cham: Springer International Publishing, 2022, pp. 315–333. doi: 10.1007/978-3-031-07312-0.
[16] B. Shan, M. Araya-Polo, A. M. Malik, and B. Chapman, “MPI-based Remote OpenMP Offloading: A More Efficient and Easy-to-use Implementation,” in Proceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores, Montreal, QC, Canada: ACM, Feb. 2023, pp. 50–59. doi: 10.1145/3582514.3582519.
[17] Pranjal Walia, Ishan Shanware, Dhiraj D. Kalamkar, and Uma M. Natarajan, “LOSM: Leveraging OpenMP and Shared Memory for Accelerating Blocking MPI Allreduce,” presented at the 2024 IEEE 31st International Conference on High Performance Computing, Data and Analytics Workshop (HiPCW), IEEE Computer Society, Dec. 2024, pp. 123–124.
[18] Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, and Jack J. Dongarra, “Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation,” Berlin/Heidelberg: Springer Berlin Heidelberg, 2004, pp. 97–104.
[19] Shuai Che, Michael Boyer, Jiayuan Meng, et al., “Rodinia: A benchmark suite for heterogeneous computing,” in 2009 IEEE International Symposium on Workload Characterization (IISWC), Austin, TX, USA: IEEE, Oct. 2009, pp. 44–54. doi: 10.1109/IISWC.2009.5306797.
[20] Z. Jin and J. S. Vetter, “A Benchmark Suite for Improving Performance Portability of the SYCL Programming Model,” in 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Raleigh, NC, USA: IEEE, Apr. 2023, pp. 325–327. doi: 10.1109/ISPASS57527.2023.00041.
[21] P. Hu, M. Lu, L. Wang, and G. Jiang, “TPU-MLIR: A Compiler For TPU Using MLIR,” Feb. 09, 2023, arXiv: arXiv:2210.15016. doi: 10.48550/arXiv.2210.15016.
校內:2030-08-22公開