簡易檢索 / 詳目顯示

研究生: 陳彥博
Chen, Yen-Po
論文名稱: 以軟體管線化與請求路由最佳化解耦式 LLM 服務之延遲
Optimizing Latency via Software Pipelining and Request Routing in Disaggregated LLM Serving
指導教授: 謝明得
Shieh, Ming-Der
共同指導: 林偉棻
Lin, Wei-Fen
學位類別: 碩士
Master
系所名稱: 敏求智慧運算學院 - 智慧運算碩士學位學程
MS Degree in Intelligent Computing
論文出版年: 2026
畢業學年度: 114
語文別: 英文
論文頁數: 113
中文關鍵詞: 解耦式大型語言模型推論服務軟體管線化請求路由鍵值快取管理主動式鍵值快取遷移
外文關鍵詞: disaggregated LLM inference serving, software pipelining, request routing, KV-cache management, proactive KV migration
相關次數: 點閱:18下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著大型語言模型(Large Language Model, LLM)推論需求快速成長,現代推論服務系統愈來愈常採用解耦式架構,將計算密集的預填(Prefill)階段與記憶體頻寬密集的解碼(Decode)階段分配到不同 GPU 上執行,以提升硬體使用效率。然而,這種架構也使鍵值快取(KV cache)不再只是單一 GPU 上的本地狀態,而成為必須跨實例準備、搬移與回復的分散式系統狀態。在多輪對話工作負載下,隨著對話歷史持續累積,KV 快取處理成本逐漸成為影響第一個 token 產生時間(Time To First Token, TTFT)與 token 間延遲(Time Between Tokens, TBT)的重要因素。為此,本論文提出一套適用於解耦式 LLM 推論服務的延遲最佳化設計,整合多階段執行管線、跨高頻寬記憶體(HBM)與裝置附屬 DDR 的兩層式 KV 快取管理,以及具輪次感知能力的設計空間探索(DSE)路由策略。

    為了評估此設計,本研究開發建構於 ACALSim 事件驅動模擬框架之上的經校準多 GPU LLM 推論服務模擬器 kvsim,並以前述 NVIDIA A100 的 operator-level latency 量測資料進行模型校準。本文以 Gemma-9B 為主要評估對象,Llama-13B 作為補充的跨模型驗證,並在 kvsim 模擬評估下分析 1K 至 16K token 對話歷史的使用者感知延遲。結果顯示,軟體管線化可將 KV 搬移移出關鍵路徑,改善解碼階段行為並降低 TBT;再加入具輪次感知能力的 DSE 路由策略後,則可進一步緩解 Prefill 端的排隊瓶頸。在具代表性的重負載 Gemma-9B 工作負載下,本設計可將 TTFT 降低約 11%,並將 TBT 降低約 47%。

    Disaggregated LLM serving places the compute-intensive prefill stage and the memory-bandwidth-intensive decode stage on different GPU instances. While this improves hardware utilization, it also turns the key--value (KV) cache into distributed state that must be prepared, transferred, and restored across instances. Under multi-turn conversational workloads, growing conversation history makes KV-cache handling an important contributor to both Time To First Token (TTFT) and Time Between Tokens (TBT). This thesis proposes a latency-optimization design that combines a multi-stage execution pipeline, two-tier KV-cache management across HBM and device-attached DDR, and turn-aware DSE-based request routing.

    The design is evaluated with kvsim, a calibrated LLM serving simulator built on ACALSim and calibrated against operator-level latency measurements on NVIDIA A100. Gemma-9B is the main evaluation target, while Llama-13B provides supplementary cross-model validation. Under this kvsim-based evaluation, the study examines user-perceived latency under multi-turn conversational workloads with 1K-16K-token histories. The results show that pipelining moves KV handling off the critical path and lowers TBT, while turn-aware routing further alleviates prefill-side queuing bottlenecks. Under a representative heavy-load Gemma-9B workload, TTFT decreases by about 11% and TBT by about 47%.

    中文摘要 i Abstract ii Acknowledgements iii Contents iv List of Tables viii List of Figures ix 1 Introduction 1 1.1 Motivation and Overview 1 1.2 Disaggregated Serving and the Research Gap 3 1.3 Limitations of Reactive KV Management 5 1.4 Enabling Condition: Secondary Memory Tier 6 1.5 Optimization Opportunities 8 1.6 Proposed Design 9 1.7 Contributions 10 2 Related Works 12 2.1 Disaggregated Architectures for LLM Serving 12 2.2 Heterogeneous GPU Memory Systems 13 2.3 KVCache Management and Tiering 13 2.4 Multi-Turn Conversational LLM Serving 14 3 Solution Proposal 16 3.1 Target System 17 3.1.1 Multi-GPU Architecture 17 3.1.2 Memory Hierarchy 18 3.1.3 GPU Interconnect 19 3.1.4 Scope and Assumptions 19 3.1.5 Calibration Baseline 20 3.2 Pipeline Execution Model 20 3.2.1 Baseline: Two-Stage Pipeline 21 3.2.2 Proposed Multi-Stage Pipeline 22 3.2.3 Stage Dependencies and Parallelism 22 3.2.4 KV Data Flow per Stage 24 3.2.5 Chunked Prefill and Pipelined Synchronization 26 3.3 KV Cache Management 26 3.3.1 Block-Based KV Representation 27 3.3.2 Logical and Physical KV Views 29 3.3.3 Source Selection 30 3.3.4 Capacity Management and Eviction 31 3.3.5 Session Window Management 32 3.3.6 Prepare--Execute--Complete Protocol 33 3.4 Scheduling Policy 35 3.4.1 FIFO Serving Order 35 3.4.2 Continuous Batching 35 3.4.3 Per-Instance Dual-Queue Execution 36 3.4.4 Instance Selection 36 3.5 DSE-based Cost Model 37 3.5.1 Online Decision Procedure 38 3.5.2 Decision Factors and Runtime Interaction 40 3.5.3 End-to-End Runtime Decision Flow 42 3.5.4 Search Space and Policy Scope 43 3.5.5 Runtime Cost Estimation 44 3.5.6 Optimization Objective and Scope 45 3.6 Summary 46 4 Experimental Methodology 48 4.1 Simulation Methodology 48 4.2 Simulator Architecture 50 4.2.1 Request Generator Model 51 4.2.2 Inference Server Model 52 4.2.3 Multi-GPU System Model 54 4.3 Simulator Calibration Methodology 57 4.3.1 Calibration Setup 57 4.3.2 Performance Model Parameters 58 4.4 Experimental Setup 58 4.4.1 Hardware Configuration 59 4.4.2 Serving Configuration 60 4.4.3 Evaluated Models 60 4.4.4 Compared Configurations 61 4.4.5 Workload Configuration 62 4.4.6 Evaluation Metrics 64 5 Evaluation 67 5.1 Simulator Calibration 67 5.1.1 Prefill Calibration 68 5.1.2 Decode Calibration 69 5.1.3 Stability Validation 70 5.1.4 Calibration Summary 71 5.2 Simulator Validation 72 5.2.1 Validation Against the Original kvsim Model 72 5.2.2 Cross-Validation Against HPCSim Cycle-Accurate Data 74 5.2.3 Implication for the Main Findings 75 5.3 Performance Evaluation 76 5.3.1 Decode Efficiency (TBT) 77 5.3.2 Prefill-to-Decode Transition Composition 79 5.3.3 Responsiveness (TTFT) 83 5.3.4 KV-Cache Movement Breakdown 85 5.3.5 Cross-Model Validation on Llama-13B 87 5.3.6 Sensitivity to Prefill-Latency Underestimation 89 5.3.7 Secondary-Tier Scope and Bandwidth Sensitivity 91 6 Conclusions 94 References 97 Appendix A: Tail-Latency Results 101 A.1 Gemma-9B TTFT Tail Trends 101 A.2 Gemma-9B TBT Tail Trends 102

    [1] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Nee- lakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877–1901.
    [2] N. Corporation, “Nvidia a100 tensor core gpu architecture,” 2020. [Online]. Available: https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/ nvidia-ampere-architecture-whitepaper.pdf
    [3] C. Jiang, L. Gao, H. E. Zarch, and M. Annavaram, “Kvpr: Efficient llm inference with i/o-aware kv cache partial recomputation,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2411.17089
    [4] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the 29th ACM Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY, USA: Association for Computing Machinery, 2023, pp. 611–626. [Online]. Available: https://doi.org/10.1145/3600006.3613165
    [5] B. Li, Y. Jiang, V. Gadepally, and D. Tiwari, “Llm inference serving: Survey of recent advances and opportunities,” 2024. [Online]. Available: https://arxiv.org/abs/2407.12391
    [6] H. Li, Z. Xu, Y. Li, X. Chen, D. Li, A. Tian, Q. Xiao, C. Deng, J. Wang, Q. Li, L. Chen, and M. Yuan, “Loopserve: An adaptive dual-phase llm inference acceleration system for multi-turn dialogues,” 2025. [Online]. Available: https://arxiv.org/abs/2507.13681
    [7] W.-F. Lin, J.-C. Chang, Y.-P. Chen, Z.-Y. Tai, Y.-C. Chang, C.-P. Chiang, Y.-Y. Lee, and Y.-J. Wang, “ACALSim: A scalable parallel simulation framework for high-performance system design space exploration,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 10, no. 2, 2026.
    [8] J. Liu, D. Zhu, Z. Bai, Y. He, H. Liao, H. Que, Z. Wang, C. Zhang, G. Zhang, J. Zhang, Y. Zhang, Z. Chen, H. Guo, S. Li, Z. Liu, Y. Shan, Y. Song, J. Tian, W. Wu, Z. Zhou, R. Zhu, J. Feng, Y. Gao, S. He, Z. Li, T. Liu, F. Meng, W. Su, Y. Tan, Z. Wang, J. Yang, W. Ye, B. Zheng, W. Zhou, W. Huang, S. Li, and Z. Zhang, “A comprehensive survey on long context language modeling,” 2025. [Online]. Available: https://arxiv.org/abs/2503.17407
    [9] Y. Liu, Y. Cheng, J. Yao, Y. An, X. Chen, S. Feng, Y. Huang, S. Shen, R. Zhang, K. Du, and J. Jiang, “Lmcache: An efficient kv cache layer for enterprise-scale llm inference,” 2025. [Online]. Available: https://arxiv.org/abs/2510.09665
    [10] P. Patel, E. Choukse, C. Zhang, A. Shah, I. n. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative LLM inference using phase splitting,” in Proceed- ings of the 51st International Symposium on Computer Architecture (ISCA’24), 2024.
    [11] R. Qin, Z. Li, W. He, J. Cui, H. Tang, F. Ren, T. Ma, S. Cai, Y. Zhang, M. Zhang, Y. Wu, W. Zheng, and X. Xu, “Mooncake: A kvcache-centric disaggregated architecture for llm serving,” ACM Trans. Storage, Nov. 2025, just Accepted. [Online]. Available: https://doi.org/10.1145/3773772
    [12] G. Schieffer, J. Wahlgren, J. Ren, J. Faj, and I. Peng, “Harnessing integrated cpu-gpu system memory for hpc: a first look into grace hopper,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2407.07850
    [13] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. R´e, I. Stoica, and C. Zhang, “Flexgen: high-throughput generative inference of large language models with a single gpu,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 202. JMLR.org, 2023, pp. 31 094–31 116. [Online]. Available: https://proceedings.mlr.press/v202/sheng23a.html
    [14] G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivi`ere, M. S. Kale, J. Love et al., “Gemma: Open models based on gemini research and technology,” 2024. [Online]. Available: https://arxiv.org/abs/2403.08295
    [15] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” 2023. [Online]. Available: https: //arxiv.org/abs/2302.13971
    [16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
    [17] L. Zheng, W.-L. Chiang, Y. Sheng, T. Li, S. Zhuang, Z. Wu, Y. Zhuang, Z. Li, Z. Lin, E. P. Xing, J. E. Gonzalez, I. Stoica, and H. Zhang, “LMSYS-Chat-1M: A large-scale real-world LLM conversation dataset,” in Proceedings of the 12th International Conference on Learning Representations (ICLR’24), 2024.
    [18] Y. Zhong, S. Liu, J. Chen, J. Hu, Y. Zhu, X. Liu, X. Jin, and H. Zhang, “Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving,” in Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’24. USA: USENIX Association, 2024.

    QR CODE