| 研究生: |
張仁謙 Chang, Jen-Chien |
|---|---|
| 論文名稱: |
搭載裝置端DDR之GPU架構下大型語言模型推論的頻寬感知選擇性卸載 Bandwidth-Aware Selective Offloading for LLM Inference on GPU Architectures with Device-Attached DDR |
| 指導教授: |
謝明得
Shieh, Ming-Der |
| 共同指導: |
林偉棻
Lin, Wei-Fen |
| 學位類別: |
碩士 Master |
| 系所名稱: |
敏求智慧運算學院 - 智慧運算碩士學位學程 MS Degree in Intelligent Computing |
| 論文出版年: | 2026 |
| 畢業學年度: | 114 |
| 語文別: | 英文 |
| 論文頁數: | 205 |
| 中文關鍵詞: | 大型語言模型推論 、異質記憶體架構 、資料卸載 |
| 外文關鍵詞: | LLM Inference, Heterogeneous Memory Architecture, Data Offloading |
| 相關次數: | 點閱:22 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著大型語言模型(LLM)規模持續增長,GPU 記憶體壓力日益加劇:模型權重與鍵值(KV)快取競爭有限的高頻寬記憶體(HBM)容量,限制了記憶體受限之自迴歸解碼階段的批次大小,進而直接限制推論的吞吐量。新興 AI 加速器架構透過整合裝置端 DDR 記憶體,將主要與次要記憶體層級間的頻寬差距從約 50 倍(PCIe)縮減至 7-17 倍,為延遲隱藏式資料串流創造了先前基於 PCIe 之卸載方法無法完全實現的契機。
本論文針對線上 LLM 服務情境,提出一套頻寬感知選擇性卸載方法,將裝置端 DDR 視為推論過程中的主動元件。該方法將總量受頻寬限制之部分模型權重或 KV 快取項目卸載至 DDR,釋放 HBM 容量以容納更多 KV 快取與更大的批次大小。雙緩衝串流架構透過 GPU 的 Copy Engine 使 DDR 至 HBM 的資料傳輸與 GPU 運算重疊執行,並以解析式頻寬預算確保傳輸在 kernel 需要資料前完成,避免 kernel 啟動停頓。扣除串流緩衝區開銷後的淨 HBM 釋放量(∆HBM)被識別出來,並於所測試之卸載策略中以模擬呈現出與吞吐量排序一致的對應關係。
本研究分析三種卸載策略:整體權重卸載(WW)、部分權重卸載(PW)及 KV 快取卸載。透過一套事件驅動 GPU 效能模擬器進行假想 HBM+DDR 架構之系統化設計空間探索,該模擬器於 NVIDIA A100 40GB PCIe 硬體之 HBM-only 算子上驗證,平均絕對誤差為 7.16%(組合 transformer block 之誤差為 -11.24% 至 +2.56%;DDR 路徑為間接驗證,待實機驗證)。所有結果皆基於模擬,以兩種模型(Llama 2-13B、Qwen2.5-14B,涵蓋 MHA 與 GQA 注意力架構)與單一 GPU(A100 40GB)評估,並限於解碼階段。本方法論適用於密集注意力架構(MHA、GQA),其前提為每次解碼迭代之 kernel 執行時間於設計階段可預測。針對輸入相依之路由架構,本研究於一 Mixture-of-Experts 模型上進行兩項適用性探測以刻劃其邊界:當期望專家啟動密度顯著低於全啟動假設時,實際單次迭代時間過短而無法隱藏靜態 DDR 傳輸,由此產生之 kernel 啟動停頓抵銷容量擴展所帶來之吞吐量增益;當批次大小足以使實際單次迭代時間容納該傳輸時,本方法論回到淨正吞吐量區間。具啟動感知能力之頻寬預算延伸已識別為未來研究方向。
六項實驗以交易層級(transaction-level)模擬驗證本框架推導之全部九項分析預測,該模擬層級可捕捉超越分析模型粒度的競爭效應;其中 C1(零 kernel 啟動停頓)以充足 HBM 頻寬裕度為前提。所呈現之策略間次百分點差異(如串流開銷)落在已驗證之組合 block 誤差範圍以下,應視為方向性結果而非精確量級。在容量飽和運行且 HBM 頻寬裕度充足的條件下,以單一次解碼迭代(每個序列產生一個新 token)為量測單位,實驗結果顯示:在現有 GPU 軟體堆疊已可部署的策略(WW、KV)下,DDR 100 GiB/s 時可達 10-18% 吞吐量提升,DDR 200 GiB/s 時可達 22-29%。逐算子 PW+KV 卸載在完整頻寬預算下,於 DDR 200 GiB/s 達到 36% 的預估上界;此上界仰賴 VMM 重新映射(其每次緩衝區切換所需之主機-裝置同步開銷未於實機環境量測)或目前生產級函式庫尚未提供之客製多指標 GEMM kernel。在所測試之配置下,PW 與 KV 快取策略因更細緻的緩衝區粒度,於較低 DDR 頻寬時達到較 WW 更高的 ∆HBM;而當頻寬預算足以使 ∆HBM 趨於一致時,所有策略的效能表現收斂。串流開銷與策略選擇無關,在充足 HBM 頻寬裕度下落入模擬器解析度以下而無法與零區分;新產生之 KV 快取造成的回寫開銷在所測試之批次大小範圍內亦可忽略。
本研究另提出以無因次參數表達之六步驟實務決策框架,供系統設計者判斷特定硬體與模型配置是否適合採用基於 DDR 之串流機制及其具體配置方式。
The growing scale of Large Language Models (LLMs) has intensified GPU memory pressure, as model weights and Key-Value (KV) caches compete for limited High Bandwidth Memory (HBM) capacity. This competition constrains batch sizes during the memory-bound autoregressive decoding phase, directly limiting inference throughput. Emerging AI accelerator architectures that integrate device-attached DDR memory narrow the bandwidth gap between primary and secondary memory tiers from approximately 50× (PCIe) to 7-17×, creating an opportunity for latency-hidden data streaming that prior PCIe-based offloading approaches cannot fully exploit.
This thesis proposes a bandwidth-aware selective offloading methodology for online LLM serving, treating device-attached DDR as an active component of the inference pipeline. The methodology offloads a bandwidth-constrained subset of model weights or KV cache entries to DDR, freeing HBM capacity for additional KV cache allocation and larger batch sizes. A double-buffered streaming architecture overlaps DDR-to-HBM data transfers with GPU computation via Copy Engines, and a closed-form bandwidth budget ensures transfers complete without kernel-launch stalls. The net HBM freed after accounting for streaming buffer overhead (∆HBM) is identified and, across the tested offloading strategies, ranks throughput consistently in simulation.
Three offloading strategies are analyzed: whole-weight (WW), partial-weight (PW), and KV cache offloading. An event-driven GPU performance simulator, validated against NVIDIA A100 40GB PCIe hardware at 7.16% mean absolute error on HBM-only operators (composed transformer block errors -11.24% to +2.56%; DDR path indirectly validated, pending hardware availability for end-to-end validation), enables systematic design space exploration of the hypothetical HBM+DDR architecture. All results are simulation-based, evaluated on two models (Llama 2-13B, Qwen2.5-14B) spanning MHA and GQA attention architectures and a single GPU (A100 40GB), and limited to the decoding phase. The methodology applies to dense attention architectures (MHA, GQA), under the assumption that per-iteration kernel execution time is design-time predictable. Two probes on a Mixture-of-Experts model characterize the boundary for input-dependent routing: when expected expert activation density is well below the all-active assumption, the realized iteration is too short to hide the static DDR transfer and the resulting kernel launch stalls offset the capacity benefit; at sufficiently large batch sizes the realized iteration accommodates the transfer and the methodology returns to net-positive throughput. An activation-aware budget extension is identified as future work.
Six experiments validate all nine analytical predictions against transaction-level simulation that captures contention effects beyond the analytical model's granularity, with C1 (zero kernel-launch stalls) conditional on sufficient HBM bandwidth headroom. Reported sub-percent inter-strategy differences (e.g., streaming overhead) lie below the validated composed-block error and should be read as directional rather than precise magnitudes. Under capacity-saturated operation with sufficient HBM bandwidth headroom, throughput measured per decoding iteration (each sequence producing one new token) shows 10-18% enhancement at DDR 100 GiB/s and 22-29% at DDR 200 GiB/s using strategies deployable on existing GPU software stacks (whole-weight, KV cache). Per-operator PW+KV offloading at the full bandwidth budget reaches a higher projected ceiling of up to 36% at DDR 200 GiB/s; this ceiling depends on either VMM remapping, whose buffer-swap synchronization overhead is unmeasured on real hardware in this thesis, or custom multi-pointer GEMM kernels not available in current production libraries. Under the tested configuration, PW and KV cache strategies achieve higher ∆HBM than WW at lower DDR bandwidths due to finer buffer granularity, while all strategies converge when the bandwidth budget is large enough to equalize ∆HBM. Streaming overhead is strategy-independent and indistinguishable from zero within the simulator's resolution under sufficient HBM bandwidth headroom; write-back overhead from newly generated KV cache entries is similarly negligible at the tested batch sizes.
A six-step practitioner decision framework, expressed in dimensionless parameters, enables system designers to determine whether and how to adopt DDR-based streaming for a given hardware-model configuration.
[1] A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, A. Tumanov, and R. Ramjee, "Taming throughput-latency tradeoff in LLM inference with Sarathi-Serve," in Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI'24. USA: USENIX Association, 2024.
[2] K. Alizadeh, I. Mirzadeh, D. Belenko, K. Khatamifard, M. Cho, C. C. D. Mundo, M. Rastegari, and M. Farajtabar, "LLM in a flash: Efficient large language model inference with limited memory," 2024. [Online]. Available: https://arxiv.org/abs/2312.11514
[3] R. Y. Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Rasley, and Y. He, "DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '22. IEEE Press, 2022.
[4] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," SIGARCH Comput. Archit. News, vol. 39, no. 2, p. 1–7, Aug. 2011. [Online]. Available: https://doi.org/10.1145/2024716.2024718
[5] N. Corporation, "Tensorrt-llm," 2023, accessed 2026-03-18. [Online]. Available: https://github.com/NVIDIA/TensorRT-LLM
[6] D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang, "DeepSeekMoE: Towards ultimate expert specialization in Mixture-of-Experts language models," 2024. [Online]. Available: https://arxiv.org/abs/2401.06066
[7] T. Dao, "FlashAttention-2: Faster attention with better parallelism and work partitioning," in International Conference on Learning Representations (ICLR), 2024.
[8] DeepSeek-AI, "deepseek-ai/DeepSeek-MoE-16b-base," 2024. [Online]. Available: https://huggingface.co/deepseek-ai/deepseek-moe-16b-base
[9] W. Fedus, B. Zoph, and N. Shazeer, "Switch transformers: scaling to trillion parameter models with simple and efficient sparsity," J. Mach. Learn. Res., vol. 23, no. 1, Jan. 2022.
[10] I. Goldwasser, H. Petty, P. Desale, and K. Devleker, "NVIDIA GB200 NVL72 delivers trillion-parameter LLM training and real-time inference," Mar. 2024. [Online]. Available: https://developer.nvidia.com/blog/nvidia-gb200-nvl72-delivers-trillion-parameter-llm-training-and-real-time-inference/
[11] C. Jiang, L. Gao, H. E. Zarch, and M. Annavaram, "KVPR: Efficient LLM inference with I/O-aware KV cache partial recomputation," 2025. [Online]. Available: https://arxiv.org/abs/2411.17089
[12] X. Jiang, Y. Zhou, S. Cao, I. Stoica, and M. Yu, "NEO: Saving GPU memory crisis with CPU offloading for online LLM inference," 2024. [Online]. Available: https://arxiv.org/abs/2411.01142
[13] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, "Efficient memory management for large language model serving with PagedAttention," in Proceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP '23. New York, NY, USA: Association for Computing Machinery, 2023, p. 611–626. [Online]. Available: https://doi.org/10.1145/3600006.3613165
[14] W. Lee, J. Lee, J. Seo, and J. Sim, "InfiniGen: efficient generative inference of large language models with dynamic KV cache management," in Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI'24. USA: USENIX Association, 2024.
[15] W.-F. Lin, J.-C. Chang, Y.-P. Chen, Z.-Y. Tai, Y.-C. Chang, C.-P. Chiang, Y.-Y. Lee, and Y.-J. Wang, "ACALSim: A scalable parallel simulation framework for high-performance system design space exploration," Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 10, no. 2, 2026.
[16] J. Lowe-Power, A. M. Ahmad, A. Akram, M. Alian, R. Amslinger, M. Andreozzi, A. Armejach, N. Asmussen, B. Beckmann, S. Bharadwaj, G. Black, G. Bloom, B. R. Bruce, D. R. Carvalho, J. Castrillon, L. Chen, N. Derumigny, S. Diestelhorst, W. Elsasser, C. Escuin, M. Fariborz, A. Farmahini-Farahani, P. Fotouhi, R. Gambord, J. Gandhi, D. Gope, T. Grass, A. Gutierrez, B. Hanindhito, A. Hansson, S. Haria, A. Harris, T. Hayes, A. Herrera, M. Horsnell, S. A. R. Jafri, R. Jagtap, H. Jang, R. Jeyapaul, T. M. Jones, M. Jung, S. Kannoth, H. Khaleghzadeh, Y. Kodama, T. Krishna, T. Marinelli, C. Menard, A. Mondelli, M. Moreto, T. Mück, O. Naji, K. Nathella, H. Nguyen, N. Nikoleris, L. E. Olson, M. Orr, B. Pham, P. Prieto, T. Reddy, A. Roelke, M. Samani, A. Sandberg, J. Setoain, B. Shingarov, M. D. Sinclair, T. Ta, R. Thakur, G. Travaglini, M. Upton, N. Vaish, I. Vougioukas, W. Wang, Z. Wang, N. Wehn, C. Weis, D. A. Wood, H. Yoon, and Éder F. Zulian, "The gem5 simulator: Version 20.0+," 2020. [Online]. Available: https://arxiv.org/abs/2007.03152
[17] C. Luo, Z. Cai, H. Sun, J. Xiao, B. Yuan, W. Xiao, J. Hu, J. Zhao, B. Chen, and A. Anandkumar, "HeadInfer: Memory-efficient LLM inference by head-wise offloading," arXiv preprint arXiv:2502.12574, 2025.
[18] W. Luo, R. Fan, Z. Li, D. Du, H. Liu, Q. Wang, and X. Chu, "Dissecting the NVIDIA Hopper architecture through microbenchmarking and multiple level analysis," 2025. [Online]. Available: https://arxiv.org/abs/2501.12084
[19] Meta AI, "meta-llama/Llama-2-13b-hf," Apr. 2024. [Online]. Available: https://huggingface.co/meta-llama/Llama-2-13b-hf
[20] NVIDIA, "NVIDIA A100 40GB PCIe GPU accelerator product brief," 2020. [Online]. Available: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/A100-PCIE-Prduct-Brief.pdf
[21] ——, "NVIDIA A100 tensor core GPU architecture," 2020. [Online]. Available: https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
[22] ——, "NVIDIA Grace Hopper superchip architecture in-depth," Nov. 2022. [Online]. Available: https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/
[23] C. Perry, "Introducing low-level GPU virtual memory management," Apr. 2020. [Online]. Available: https://developer.nvidia.com/blog/introducing-low-level-gpu-virtual-memory-management/
[24] R. Prabhakar, R. Sivaramakrishnan, D. Gandhi, Y. Du, M. Wang, X. Song, K. Zhang, T. Gao, A. Wang, X. Li, Y. Sheng, J. Brot, D. Sokolov, A. Vivek, C. Leung, A. Sabnis, J. Bai, T. Zhao, M. Gottscho, D. Jackson, M. Luttrell, M. K. Shah, Z. Chen, K. Liang, S. Jain, U. Thakker, D. Huang, S. Jairath, K. J. Brown, and K. Olukotun, "SambaNova SN40L: Scaling the AI memory wall with dataflow and composition of experts," in Proceedings of the 2024 57th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO '24. IEEE Press, 2024, p. 1353–1366. [Online]. Available: https://doi.org/10.1109/MICRO61859.2024.00100
[25] Qwen Team, "Qwen/Qwen2.5-14b," 2024. [Online]. Available: https://huggingface.co/Qwen/Qwen2.5-14B
[26] A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. Cooper-Balis, and B. Jacob, "The structural simulation toolkit," SIGMETRICS Perform. Eval. Rev., vol. 38, no. 4, p. 37–42, Mar. 2011. [Online]. Available: https://doi.org/10.1145/1964218.1964225
[27] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. Ré, I. Stoica, and C. Zhang, "FlexGen: high-throughput generative inference of large language models with a single GPU," in Proceedings of the 40th International Conference on Machine Learning, ser. ICML'23. JMLR.org, 2023.
[28] D. Vankov, N. Ivkin, K. Ulrich, X. Song, A. Khetan, and G. Karypis, "XShare: Collaborative in-Batch expert sharing for faster MoE inference," 2026. [Online]. Available: https://arxiv.org/abs/2602.07265
[29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," 2023. [Online]. Available: https://arxiv.org/abs/1706.03762
[30] S. Williams, A. Waterman, and D. Patterson, "Roofline: an insightful visual performance model for multicore architectures," Commun. ACM, vol. 52, no. 4, p. 65–76, Apr. 2009. [Online]. Available: https://doi.org/10.1145/1498765.1498785
[31] G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, "Orca: A distributed serving system for Transformer-Based generative models," in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 521–538. [Online]. Available: https://www.usenix.org/conference/osdi22/presentation/yu
[32] Z. Yuan, Y. Shang, Y. Zhou, Z. Dong, Z. Zhou, C. Xue, B. Wu, Z. Li, Q. Gu, Y. J. Lee, Y. Yan, B. Chen, G. Sun, and K. Keutzer, "LLM inference unveiled: Survey and Roofline model insights," 2024. [Online]. Available: https://arxiv.org/abs/2402.16363
[33] H. Zhang, Y. Zhou, Y. Xue, Y. Liu, and J. Huang, "G10: Enabling an efficient unified GPU memory and storage architecture with smart tensor migrations," in Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO '23. New York, NY, USA: Association for Computing Machinery, 2023, p. 395–410. [Online]. Available: https://doi.org/10.1145/3613424.3614309
[34] Y. Zhao, D. Wu, and J. Wang, "ALISA: Accelerating large language model inference via sparsity-aware KV caching," in 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 2024, pp. 1005–1017.
[35] L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y. Sheng, "SGLang: efficient execution of structured language model programs," in Proceedings of the 38th International Conference on Neural Information Processing Systems, ser. NIPS '24. Red Hook, NY, USA: Curran Associates Inc., 2024.