簡易檢索 / 詳目顯示

研究生: 張仁謙
Chang, Jen-Chien
論文名稱: 搭載裝置端DDR之GPU架構下大型語言模型推論的頻寬感知選擇性卸載
Bandwidth-Aware Selective Offloading for LLM Inference on GPU Architectures with Device-Attached DDR
指導教授: 謝明得
Shieh, Ming-Der
共同指導: 林偉棻
Lin, Wei-Fen
學位類別: 碩士
Master
系所名稱: 敏求智慧運算學院 - 智慧運算碩士學位學程
MS Degree in Intelligent Computing
論文出版年: 2026
畢業學年度: 114
語文別: 英文
論文頁數: 205
中文關鍵詞: 大型語言模型推論異質記憶體架構資料卸載
外文關鍵詞: LLM Inference, Heterogeneous Memory Architecture, Data Offloading
相關次數: 點閱:22下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著大型語言模型(LLM)規模持續增長,GPU 記憶體壓力日益加劇:模型權重與鍵值(KV)快取競爭有限的高頻寬記憶體(HBM)容量,限制了記憶體受限之自迴歸解碼階段的批次大小,進而直接限制推論的吞吐量。新興 AI 加速器架構透過整合裝置端 DDR 記憶體,將主要與次要記憶體層級間的頻寬差距從約 50 倍(PCIe)縮減至 7-17 倍,為延遲隱藏式資料串流創造了先前基於 PCIe 之卸載方法無法完全實現的契機。
    本論文針對線上 LLM 服務情境,提出一套頻寬感知選擇性卸載方法,將裝置端 DDR 視為推論過程中的主動元件。該方法將總量受頻寬限制之部分模型權重或 KV 快取項目卸載至 DDR,釋放 HBM 容量以容納更多 KV 快取與更大的批次大小。雙緩衝串流架構透過 GPU 的 Copy Engine 使 DDR 至 HBM 的資料傳輸與 GPU 運算重疊執行,並以解析式頻寬預算確保傳輸在 kernel 需要資料前完成,避免 kernel 啟動停頓。扣除串流緩衝區開銷後的淨 HBM 釋放量(∆HBM)被識別出來,並於所測試之卸載策略中以模擬呈現出與吞吐量排序一致的對應關係。
    本研究分析三種卸載策略:整體權重卸載(WW)、部分權重卸載(PW)及 KV 快取卸載。透過一套事件驅動 GPU 效能模擬器進行假想 HBM+DDR 架構之系統化設計空間探索,該模擬器於 NVIDIA A100 40GB PCIe 硬體之 HBM-only 算子上驗證,平均絕對誤差為 7.16%(組合 transformer block 之誤差為 -11.24% 至 +2.56%;DDR 路徑為間接驗證,待實機驗證)。所有結果皆基於模擬,以兩種模型(Llama 2-13B、Qwen2.5-14B,涵蓋 MHA 與 GQA 注意力架構)與單一 GPU(A100 40GB)評估,並限於解碼階段。本方法論適用於密集注意力架構(MHA、GQA),其前提為每次解碼迭代之 kernel 執行時間於設計階段可預測。針對輸入相依之路由架構,本研究於一 Mixture-of-Experts 模型上進行兩項適用性探測以刻劃其邊界:當期望專家啟動密度顯著低於全啟動假設時,實際單次迭代時間過短而無法隱藏靜態 DDR 傳輸,由此產生之 kernel 啟動停頓抵銷容量擴展所帶來之吞吐量增益;當批次大小足以使實際單次迭代時間容納該傳輸時,本方法論回到淨正吞吐量區間。具啟動感知能力之頻寬預算延伸已識別為未來研究方向。
    六項實驗以交易層級(transaction-level)模擬驗證本框架推導之全部九項分析預測,該模擬層級可捕捉超越分析模型粒度的競爭效應;其中 C1(零 kernel 啟動停頓)以充足 HBM 頻寬裕度為前提。所呈現之策略間次百分點差異(如串流開銷)落在已驗證之組合 block 誤差範圍以下,應視為方向性結果而非精確量級。在容量飽和運行且 HBM 頻寬裕度充足的條件下,以單一次解碼迭代(每個序列產生一個新 token)為量測單位,實驗結果顯示:在現有 GPU 軟體堆疊已可部署的策略(WW、KV)下,DDR 100 GiB/s 時可達 10-18% 吞吐量提升,DDR 200 GiB/s 時可達 22-29%。逐算子 PW+KV 卸載在完整頻寬預算下,於 DDR 200 GiB/s 達到 36% 的預估上界;此上界仰賴 VMM 重新映射(其每次緩衝區切換所需之主機-裝置同步開銷未於實機環境量測)或目前生產級函式庫尚未提供之客製多指標 GEMM kernel。在所測試之配置下,PW 與 KV 快取策略因更細緻的緩衝區粒度,於較低 DDR 頻寬時達到較 WW 更高的 ∆HBM;而當頻寬預算足以使 ∆HBM 趨於一致時,所有策略的效能表現收斂。串流開銷與策略選擇無關,在充足 HBM 頻寬裕度下落入模擬器解析度以下而無法與零區分;新產生之 KV 快取造成的回寫開銷在所測試之批次大小範圍內亦可忽略。
    本研究另提出以無因次參數表達之六步驟實務決策框架,供系統設計者判斷特定硬體與模型配置是否適合採用基於 DDR 之串流機制及其具體配置方式。

    The growing scale of Large Language Models (LLMs) has intensified GPU memory pressure, as model weights and Key-Value (KV) caches compete for limited High Bandwidth Memory (HBM) capacity. This competition constrains batch sizes during the memory-bound autoregressive decoding phase, directly limiting inference throughput. Emerging AI accelerator architectures that integrate device-attached DDR memory narrow the bandwidth gap between primary and secondary memory tiers from approximately 50× (PCIe) to 7-17×, creating an opportunity for latency-hidden data streaming that prior PCIe-based offloading approaches cannot fully exploit.
    This thesis proposes a bandwidth-aware selective offloading methodology for online LLM serving, treating device-attached DDR as an active component of the inference pipeline. The methodology offloads a bandwidth-constrained subset of model weights or KV cache entries to DDR, freeing HBM capacity for additional KV cache allocation and larger batch sizes. A double-buffered streaming architecture overlaps DDR-to-HBM data transfers with GPU computation via Copy Engines, and a closed-form bandwidth budget ensures transfers complete without kernel-launch stalls. The net HBM freed after accounting for streaming buffer overhead (∆HBM) is identified and, across the tested offloading strategies, ranks throughput consistently in simulation.
    Three offloading strategies are analyzed: whole-weight (WW), partial-weight (PW), and KV cache offloading. An event-driven GPU performance simulator, validated against NVIDIA A100 40GB PCIe hardware at 7.16% mean absolute error on HBM-only operators (composed transformer block errors -11.24% to +2.56%; DDR path indirectly validated, pending hardware availability for end-to-end validation), enables systematic design space exploration of the hypothetical HBM+DDR architecture. All results are simulation-based, evaluated on two models (Llama 2-13B, Qwen2.5-14B) spanning MHA and GQA attention architectures and a single GPU (A100 40GB), and limited to the decoding phase. The methodology applies to dense attention architectures (MHA, GQA), under the assumption that per-iteration kernel execution time is design-time predictable. Two probes on a Mixture-of-Experts model characterize the boundary for input-dependent routing: when expected expert activation density is well below the all-active assumption, the realized iteration is too short to hide the static DDR transfer and the resulting kernel launch stalls offset the capacity benefit; at sufficiently large batch sizes the realized iteration accommodates the transfer and the methodology returns to net-positive throughput. An activation-aware budget extension is identified as future work.
    Six experiments validate all nine analytical predictions against transaction-level simulation that captures contention effects beyond the analytical model's granularity, with C1 (zero kernel-launch stalls) conditional on sufficient HBM bandwidth headroom. Reported sub-percent inter-strategy differences (e.g., streaming overhead) lie below the validated composed-block error and should be read as directional rather than precise magnitudes. Under capacity-saturated operation with sufficient HBM bandwidth headroom, throughput measured per decoding iteration (each sequence producing one new token) shows 10-18% enhancement at DDR 100 GiB/s and 22-29% at DDR 200 GiB/s using strategies deployable on existing GPU software stacks (whole-weight, KV cache). Per-operator PW+KV offloading at the full bandwidth budget reaches a higher projected ceiling of up to 36% at DDR 200 GiB/s; this ceiling depends on either VMM remapping, whose buffer-swap synchronization overhead is unmeasured on real hardware in this thesis, or custom multi-pointer GEMM kernels not available in current production libraries. Under the tested configuration, PW and KV cache strategies achieve higher ∆HBM than WW at lower DDR bandwidths due to finer buffer granularity, while all strategies converge when the bandwidth budget is large enough to equalize ∆HBM. Streaming overhead is strategy-independent and indistinguishable from zero within the simulator's resolution under sufficient HBM bandwidth headroom; write-back overhead from newly generated KV cache entries is similarly negligible at the tested batch sizes.
    A six-step practitioner decision framework, expressed in dimensionless parameters, enables system designers to determine whether and how to adopt DDR-based streaming for a given hardware-model configuration.

    中文摘要 iii Abstract vi Acknowledgements ix Contents xi List of Tables xx List of Figures xxii List of Symbols xxiv 1 Introduction 1 1.1 Emerging Heterogeneous Memory Architectures 1 1.2 LLM Inference as a Usage Model 2 1.3 Opportunity: High-Bandwidth Data Streaming 3 1.4 Approach Overview 4 1.5 Contributions 5 1.6 Thesis Organization 6 2 Background 8 2.1 Transformer-based Large Language Models 8 2.1.1 Transformer Inference Pipeline 8 2.1.2 Prefill and Decoding Phases 9 2.1.3 Memory Footprint Components 10 2.1.4 KV Cache Growth and Memory Pressure 11 2.2 Graphics Processing Unit Architecture 11 2.2.1 Throughput-Oriented Execution Model 11 2.2.2 Streaming Multiprocessors and Copy Engines 12 2.2.3 Memory Hierarchy of Traditional GPU-based Systems 13 3 Related Work 15 3.1 Memory Management for LLM Inference 15 3.1.1 KV Cache Management 15 3.1.2 Continuous Batching and Chunked Prefill 16 3.1.3 Contrast with This Thesis 16 3.2 Tensor Offloading for LLM Inference 17 3.2.1 Weight Offloading 17 3.2.2 KV Cache Offloading 18 3.2.3 Hybrid Offloading with Recomputation 18 3.2.4 Attention Computation Offloading 19 3.2.5 Limitations of Prior Offloading Work 20 3.3 Emerging Heterogeneous Memory Architectures 21 3.3.1 Integrated CPU-GPU Systems 21 3.3.2 Dataflow Architectures with Integrated DDR 21 3.3.3 CXL-Attached Memory 22 3.3.4 Summary of Bandwidth Characteristics 22 3.4 GPU Data Movement Mechanisms 23 3.4.1 Double Buffering 23 3.4.2 Prefetching in Related Contexts 23 3.5 Research Gap and Thesis Positioning 24 3.5.1 Comparison with Prior Work 24 3.5.2 Research Gaps 24 3.5.3 Thesis Contribution Summary 25 4 Problem Definition 27 4.1 Target Workload: Memory-Bound Decoding Phase 27 4.1.1 Scope Boundaries 28 4.2 Roofline Analysis of LLM Decoding 29 4.2.1 Capacity-Throughput Chain 29 4.3 Target System Architecture 32 4.3.1 Essential Architectural Requirements 33 4.3.2 Architectural Assumptions and Abstractions 33 4.4 Problem Formulation 34 4.4.1 Research Questions 34 4.4.2 Formal Problem Statement 35 4.4.3 Capacity Constraint 36 4.4.4 Bandwidth Constraint 36 5 Bandwidth-Aware Selective Offloading for Device-Attached DDR 37 5.1 Solution Overview 37 5.2 Streaming Design and Mechanisms 39 5.2.1 Selective Offloading 39 5.2.2 Streaming Buffer Sizing Analysis 39 5.2.3 Double Buffering 46 5.2.4 Asynchronous Data Transfer via Copy Engines 47 5.2.5 Irregular Tensor Allocation for Partial Tensor Offloading 47 5.3 Offloading Tensor Selection 49 5.3.1 Timing Model for Latency Hiding 50 5.3.2 Selection Policy 55 5.3.3 Static Selection Scope 56 5.3.4 Layer Selection for Whole-Tensor Offloading 56 5.4 Streaming Candidate Analysis 58 5.4.1 KV Cache Streaming 59 5.4.2 Per-Operator Offloading 61 5.5 Analytical Predictions 65 5.5.1 Streaming Mechanism Predictions 66 5.5.2 Buffer Overhead Predictions 66 5.5.3 Throughput Enhancement Prediction 67 5.5.4 Mechanism Predictions 70 5.5.5 Throughput Ranking Prediction 71 6 Simulation Infrastructure 72 6.1 Motivation for Simulation-Based Evaluation 72 6.1.1 The Need for Simulation 72 6.1.2 Simulation Fidelity Requirements 74 6.2 GPU Architecture Model 74 6.2.1 Reference Architecture and Parameterization 76 6.2.2 Compute Subsystem 77 6.2.3 Memory Subsystem 78 6.2.4 Modeling Abstractions and Assumptions 81 6.3 Operator Models 82 6.3.1 Linear Projection Operators 83 6.3.2 Attention Operators 85 6.3.3 Workload Parameterization 86 7 Experiments and Results 88 7.1 Experimental Setup 89 7.1.1 Model Selection and Rationale 89 7.1.2 Simulated Hardware Configuration 90 7.1.3 Baseline and Treatment Configurations 92 7.1.4 Context Length Selection Methodology 92 7.2 Evaluation Methodology 93 7.2.1 Evaluation Unit 93 7.2.2 Throughput Metrics 94 7.2.3 Mechanism Validation Metrics 94 7.2.4 Throughput Decomposition Framework 95 7.3 Experiment 1: Throughput Enhancement Demonstration 96 7.3.1 Research Question 96 7.3.2 Experimental Design 96 7.3.3 Results 97 7.3.4 Discussion 98 7.4 Experiment 2: Layer Selection Sensitivity for Whole-Weight Offloading 98 7.4.1 Research Question 98 7.4.2 Experimental Design 98 7.4.3 Results 100 7.4.4 Discussion 100 7.5 Experiment 3: Write-Back Overhead of KV Cache Offloading 103 7.5.1 Research Question 103 7.5.2 Experimental Design 103 7.5.3 Results 104 7.5.4 Discussion 105 7.6 Experiment 4: Offloading Candidate Comparison 105 7.6.1 Research Question 105 7.6.2 Experimental Design 106 7.6.3 Results 107 7.6.4 Discussion 110 7.7 Experiment 5: Per-Operator PW-KV Offloading 113 7.7.1 Research Question 113 7.7.2 Experimental Design 114 7.7.3 Results 115 7.7.4 Discussion 115 7.8 Experiment 6: Bandwidth Budget Scaling 117 7.8.1 Research Question 117 7.8.2 Experimental Design 117 7.8.3 Results 118 7.8.4 Discussion 119 8 Discussion 121 8.1 Summary of Claims Validation 121 8.2 Cross-Experiment Synthesis 122 8.2.1 Unifying Mechanism 122 8.2.2 Normalized Parameter Analysis 124 8.3 Practical Implications 126 8.3.1 When to Use GPU-Attached DDR 126 8.3.2 Offloading Strategy Selection 127 8.3.3 Hardware Configuration Guidance 131 8.3.4 Integration Considerations 132 8.4 Methodology Scope Boundary: Mixture-of-Experts Architectures 133 8.4.1 Probe under Synthetic Activation Control 133 8.4.2 Probe under In-Batch Activation Model 135 8.4.3 Mechanism and Scope Implication 138 8.5 Threats to Validity 139 8.5.1 Internal Validity 139 8.5.2 External Validity 141 8.5.3 Construct Validity 142 8.6 Limitations 143 9 Simulator Validation 145 9.1 Validation Objectives 145 9.1.1 Purpose of Validation 145 9.1.2 Fitness for Purpose: Comparative Design Space Exploration 146 9.2 Correlation Methodology 146 9.2.1 Reference Measurement Infrastructure 146 9.2.2 Controlled Measurement Protocol 146 9.2.3 Iterative Correlation Workflow 147 9.2.4 Discrepancy Diagnosis 148 9.3 Methodology Extensibility 149 9.3.1 Representative versus Exhaustive Coverage 149 9.3.2 DDR Traffic: Indirect Validation 149 9.4 Validation Targets and Coverage 150 9.4.1 Linear Projection Operators 150 9.4.2 Attention Operators 150 9.4.3 Composed Operators (Transformer Block) 151 9.4.4 Mapping to Experimental Scope 151 9.5 Correlation Results and Analysis 151 9.5.1 Linear Operator Results 152 9.5.2 Attention Operator Results 153 9.5.3 Composed Operator Results 154 9.5.4 Aggregate Statistics 154 9.6 Discussion 155 9.6.1 Error Characteristics 155 10 Conclusion 156 10.1 Thesis Recap 156 10.2 Summary of Contributions 157 10.3 Key Findings and Answers to Research Questions 159 10.3.1 RQ1: Strategy Selection 159 10.3.2 RQ2: Effectiveness Conditions 161 10.4 Future Work 162 References 165 Appendix A: Multi-Operator Whole-Weight Offloading 172 A.1 DDR Bandwidth Underutilization 172 A.2 Structural Bubble 172 A.3 Parity Effect on Buffer Assignment 173 A.4 Prefetch Feasibility Condition 175 A.5 Buffer Cost 176 A.6 Configuration Selection Algorithm 176 A.7 Relationship to Single-Operator Offloading 178

    [1] A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, A. Tumanov, and R. Ramjee, "Taming throughput-latency tradeoff in LLM inference with Sarathi-Serve," in Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI'24. USA: USENIX Association, 2024.
    [2] K. Alizadeh, I. Mirzadeh, D. Belenko, K. Khatamifard, M. Cho, C. C. D. Mundo, M. Rastegari, and M. Farajtabar, "LLM in a flash: Efficient large language model inference with limited memory," 2024. [Online]. Available: https://arxiv.org/abs/2312.11514
    [3] R. Y. Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Rasley, and Y. He, "DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '22. IEEE Press, 2022.
    [4] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," SIGARCH Comput. Archit. News, vol. 39, no. 2, p. 1–7, Aug. 2011. [Online]. Available: https://doi.org/10.1145/2024716.2024718
    [5] N. Corporation, "Tensorrt-llm," 2023, accessed 2026-03-18. [Online]. Available: https://github.com/NVIDIA/TensorRT-LLM
    [6] D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang, "DeepSeekMoE: Towards ultimate expert specialization in Mixture-of-Experts language models," 2024. [Online]. Available: https://arxiv.org/abs/2401.06066
    [7] T. Dao, "FlashAttention-2: Faster attention with better parallelism and work partitioning," in International Conference on Learning Representations (ICLR), 2024.
    [8] DeepSeek-AI, "deepseek-ai/DeepSeek-MoE-16b-base," 2024. [Online]. Available: https://huggingface.co/deepseek-ai/deepseek-moe-16b-base
    [9] W. Fedus, B. Zoph, and N. Shazeer, "Switch transformers: scaling to trillion parameter models with simple and efficient sparsity," J. Mach. Learn. Res., vol. 23, no. 1, Jan. 2022.
    [10] I. Goldwasser, H. Petty, P. Desale, and K. Devleker, "NVIDIA GB200 NVL72 delivers trillion-parameter LLM training and real-time inference," Mar. 2024. [Online]. Available: https://developer.nvidia.com/blog/nvidia-gb200-nvl72-delivers-trillion-parameter-llm-training-and-real-time-inference/
    [11] C. Jiang, L. Gao, H. E. Zarch, and M. Annavaram, "KVPR: Efficient LLM inference with I/O-aware KV cache partial recomputation," 2025. [Online]. Available: https://arxiv.org/abs/2411.17089
    [12] X. Jiang, Y. Zhou, S. Cao, I. Stoica, and M. Yu, "NEO: Saving GPU memory crisis with CPU offloading for online LLM inference," 2024. [Online]. Available: https://arxiv.org/abs/2411.01142
    [13] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, "Efficient memory management for large language model serving with PagedAttention," in Proceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP '23. New York, NY, USA: Association for Computing Machinery, 2023, p. 611–626. [Online]. Available: https://doi.org/10.1145/3600006.3613165
    [14] W. Lee, J. Lee, J. Seo, and J. Sim, "InfiniGen: efficient generative inference of large language models with dynamic KV cache management," in Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI'24. USA: USENIX Association, 2024.
    [15] W.-F. Lin, J.-C. Chang, Y.-P. Chen, Z.-Y. Tai, Y.-C. Chang, C.-P. Chiang, Y.-Y. Lee, and Y.-J. Wang, "ACALSim: A scalable parallel simulation framework for high-performance system design space exploration," Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 10, no. 2, 2026.
    [16] J. Lowe-Power, A. M. Ahmad, A. Akram, M. Alian, R. Amslinger, M. Andreozzi, A. Armejach, N. Asmussen, B. Beckmann, S. Bharadwaj, G. Black, G. Bloom, B. R. Bruce, D. R. Carvalho, J. Castrillon, L. Chen, N. Derumigny, S. Diestelhorst, W. Elsasser, C. Escuin, M. Fariborz, A. Farmahini-Farahani, P. Fotouhi, R. Gambord, J. Gandhi, D. Gope, T. Grass, A. Gutierrez, B. Hanindhito, A. Hansson, S. Haria, A. Harris, T. Hayes, A. Herrera, M. Horsnell, S. A. R. Jafri, R. Jagtap, H. Jang, R. Jeyapaul, T. M. Jones, M. Jung, S. Kannoth, H. Khaleghzadeh, Y. Kodama, T. Krishna, T. Marinelli, C. Menard, A. Mondelli, M. Moreto, T. Mück, O. Naji, K. Nathella, H. Nguyen, N. Nikoleris, L. E. Olson, M. Orr, B. Pham, P. Prieto, T. Reddy, A. Roelke, M. Samani, A. Sandberg, J. Setoain, B. Shingarov, M. D. Sinclair, T. Ta, R. Thakur, G. Travaglini, M. Upton, N. Vaish, I. Vougioukas, W. Wang, Z. Wang, N. Wehn, C. Weis, D. A. Wood, H. Yoon, and Éder F. Zulian, "The gem5 simulator: Version 20.0+," 2020. [Online]. Available: https://arxiv.org/abs/2007.03152
    [17] C. Luo, Z. Cai, H. Sun, J. Xiao, B. Yuan, W. Xiao, J. Hu, J. Zhao, B. Chen, and A. Anandkumar, "HeadInfer: Memory-efficient LLM inference by head-wise offloading," arXiv preprint arXiv:2502.12574, 2025.
    [18] W. Luo, R. Fan, Z. Li, D. Du, H. Liu, Q. Wang, and X. Chu, "Dissecting the NVIDIA Hopper architecture through microbenchmarking and multiple level analysis," 2025. [Online]. Available: https://arxiv.org/abs/2501.12084
    [19] Meta AI, "meta-llama/Llama-2-13b-hf," Apr. 2024. [Online]. Available: https://huggingface.co/meta-llama/Llama-2-13b-hf
    [20] NVIDIA, "NVIDIA A100 40GB PCIe GPU accelerator product brief," 2020. [Online]. Available: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/A100-PCIE-Prduct-Brief.pdf
    [21] ——, "NVIDIA A100 tensor core GPU architecture," 2020. [Online]. Available: https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
    [22] ——, "NVIDIA Grace Hopper superchip architecture in-depth," Nov. 2022. [Online]. Available: https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/
    [23] C. Perry, "Introducing low-level GPU virtual memory management," Apr. 2020. [Online]. Available: https://developer.nvidia.com/blog/introducing-low-level-gpu-virtual-memory-management/
    [24] R. Prabhakar, R. Sivaramakrishnan, D. Gandhi, Y. Du, M. Wang, X. Song, K. Zhang, T. Gao, A. Wang, X. Li, Y. Sheng, J. Brot, D. Sokolov, A. Vivek, C. Leung, A. Sabnis, J. Bai, T. Zhao, M. Gottscho, D. Jackson, M. Luttrell, M. K. Shah, Z. Chen, K. Liang, S. Jain, U. Thakker, D. Huang, S. Jairath, K. J. Brown, and K. Olukotun, "SambaNova SN40L: Scaling the AI memory wall with dataflow and composition of experts," in Proceedings of the 2024 57th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO '24. IEEE Press, 2024, p. 1353–1366. [Online]. Available: https://doi.org/10.1109/MICRO61859.2024.00100
    [25] Qwen Team, "Qwen/Qwen2.5-14b," 2024. [Online]. Available: https://huggingface.co/Qwen/Qwen2.5-14B
    [26] A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. Cooper-Balis, and B. Jacob, "The structural simulation toolkit," SIGMETRICS Perform. Eval. Rev., vol. 38, no. 4, p. 37–42, Mar. 2011. [Online]. Available: https://doi.org/10.1145/1964218.1964225
    [27] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. Ré, I. Stoica, and C. Zhang, "FlexGen: high-throughput generative inference of large language models with a single GPU," in Proceedings of the 40th International Conference on Machine Learning, ser. ICML'23. JMLR.org, 2023.
    [28] D. Vankov, N. Ivkin, K. Ulrich, X. Song, A. Khetan, and G. Karypis, "XShare: Collaborative in-Batch expert sharing for faster MoE inference," 2026. [Online]. Available: https://arxiv.org/abs/2602.07265
    [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," 2023. [Online]. Available: https://arxiv.org/abs/1706.03762
    [30] S. Williams, A. Waterman, and D. Patterson, "Roofline: an insightful visual performance model for multicore architectures," Commun. ACM, vol. 52, no. 4, p. 65–76, Apr. 2009. [Online]. Available: https://doi.org/10.1145/1498765.1498785
    [31] G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, "Orca: A distributed serving system for Transformer-Based generative models," in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA: USENIX Association, Jul. 2022, pp. 521–538. [Online]. Available: https://www.usenix.org/conference/osdi22/presentation/yu
    [32] Z. Yuan, Y. Shang, Y. Zhou, Z. Dong, Z. Zhou, C. Xue, B. Wu, Z. Li, Q. Gu, Y. J. Lee, Y. Yan, B. Chen, G. Sun, and K. Keutzer, "LLM inference unveiled: Survey and Roofline model insights," 2024. [Online]. Available: https://arxiv.org/abs/2402.16363
    [33] H. Zhang, Y. Zhou, Y. Xue, Y. Liu, and J. Huang, "G10: Enabling an efficient unified GPU memory and storage architecture with smart tensor migrations," in Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO '23. New York, NY, USA: Association for Computing Machinery, 2023, p. 395–410. [Online]. Available: https://doi.org/10.1145/3613424.3614309
    [34] Y. Zhao, D. Wu, and J. Wang, "ALISA: Accelerating large language model inference via sparsity-aware KV caching," in 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 2024, pp. 1005–1017.
    [35] L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y. Sheng, "SGLang: efficient execution of structured language model programs," in Proceedings of the 38th International Conference on Neural Information Processing Systems, ser. NIPS '24. Red Hook, NY, USA: Curran Associates Inc., 2024.

    QR CODE