| 研究生: |
陳彥博 Chen, Yen-Po |
|---|---|
| 論文名稱: |
以軟體管線化與請求路由最佳化解耦式 LLM 服務之延遲 Optimizing Latency via Software Pipelining and Request Routing in Disaggregated LLM Serving |
| 指導教授: |
謝明得
Shieh, Ming-Der |
| 共同指導: |
林偉棻
Lin, Wei-Fen |
| 學位類別: |
碩士 Master |
| 系所名稱: |
敏求智慧運算學院 - 智慧運算碩士學位學程 MS Degree in Intelligent Computing |
| 論文出版年: | 2026 |
| 畢業學年度: | 114 |
| 語文別: | 英文 |
| 論文頁數: | 113 |
| 中文關鍵詞: | 解耦式大型語言模型推論服務 、軟體管線化 、請求路由 、鍵值快取管理 、主動式鍵值快取遷移 |
| 外文關鍵詞: | disaggregated LLM inference serving, software pipelining, request routing, KV-cache management, proactive KV migration |
| 相關次數: | 點閱:18 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著大型語言模型(Large Language Model, LLM)推論需求快速成長,現代推論服務系統愈來愈常採用解耦式架構,將計算密集的預填(Prefill)階段與記憶體頻寬密集的解碼(Decode)階段分配到不同 GPU 上執行,以提升硬體使用效率。然而,這種架構也使鍵值快取(KV cache)不再只是單一 GPU 上的本地狀態,而成為必須跨實例準備、搬移與回復的分散式系統狀態。在多輪對話工作負載下,隨著對話歷史持續累積,KV 快取處理成本逐漸成為影響第一個 token 產生時間(Time To First Token, TTFT)與 token 間延遲(Time Between Tokens, TBT)的重要因素。為此,本論文提出一套適用於解耦式 LLM 推論服務的延遲最佳化設計,整合多階段執行管線、跨高頻寬記憶體(HBM)與裝置附屬 DDR 的兩層式 KV 快取管理,以及具輪次感知能力的設計空間探索(DSE)路由策略。
為了評估此設計,本研究開發建構於 ACALSim 事件驅動模擬框架之上的經校準多 GPU LLM 推論服務模擬器 kvsim,並以前述 NVIDIA A100 的 operator-level latency 量測資料進行模型校準。本文以 Gemma-9B 為主要評估對象,Llama-13B 作為補充的跨模型驗證,並在 kvsim 模擬評估下分析 1K 至 16K token 對話歷史的使用者感知延遲。結果顯示,軟體管線化可將 KV 搬移移出關鍵路徑,改善解碼階段行為並降低 TBT;再加入具輪次感知能力的 DSE 路由策略後,則可進一步緩解 Prefill 端的排隊瓶頸。在具代表性的重負載 Gemma-9B 工作負載下,本設計可將 TTFT 降低約 11%,並將 TBT 降低約 47%。
Disaggregated LLM serving places the compute-intensive prefill stage and the memory-bandwidth-intensive decode stage on different GPU instances. While this improves hardware utilization, it also turns the key--value (KV) cache into distributed state that must be prepared, transferred, and restored across instances. Under multi-turn conversational workloads, growing conversation history makes KV-cache handling an important contributor to both Time To First Token (TTFT) and Time Between Tokens (TBT). This thesis proposes a latency-optimization design that combines a multi-stage execution pipeline, two-tier KV-cache management across HBM and device-attached DDR, and turn-aware DSE-based request routing.
The design is evaluated with kvsim, a calibrated LLM serving simulator built on ACALSim and calibrated against operator-level latency measurements on NVIDIA A100. Gemma-9B is the main evaluation target, while Llama-13B provides supplementary cross-model validation. Under this kvsim-based evaluation, the study examines user-perceived latency under multi-turn conversational workloads with 1K-16K-token histories. The results show that pipelining moves KV handling off the critical path and lowers TBT, while turn-aware routing further alleviates prefill-side queuing bottlenecks. Under a representative heavy-load Gemma-9B workload, TTFT decreases by about 11% and TBT by about 47%.
[1] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Nee- lakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877–1901.
[2] N. Corporation, “Nvidia a100 tensor core gpu architecture,” 2020. [Online]. Available: https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/ nvidia-ampere-architecture-whitepaper.pdf
[3] C. Jiang, L. Gao, H. E. Zarch, and M. Annavaram, “Kvpr: Efficient llm inference with i/o-aware kv cache partial recomputation,” 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2411.17089
[4] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the 29th ACM Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY, USA: Association for Computing Machinery, 2023, pp. 611–626. [Online]. Available: https://doi.org/10.1145/3600006.3613165
[5] B. Li, Y. Jiang, V. Gadepally, and D. Tiwari, “Llm inference serving: Survey of recent advances and opportunities,” 2024. [Online]. Available: https://arxiv.org/abs/2407.12391
[6] H. Li, Z. Xu, Y. Li, X. Chen, D. Li, A. Tian, Q. Xiao, C. Deng, J. Wang, Q. Li, L. Chen, and M. Yuan, “Loopserve: An adaptive dual-phase llm inference acceleration system for multi-turn dialogues,” 2025. [Online]. Available: https://arxiv.org/abs/2507.13681
[7] W.-F. Lin, J.-C. Chang, Y.-P. Chen, Z.-Y. Tai, Y.-C. Chang, C.-P. Chiang, Y.-Y. Lee, and Y.-J. Wang, “ACALSim: A scalable parallel simulation framework for high-performance system design space exploration,” Proceedings of the ACM on Measurement and Analysis of Computing Systems, vol. 10, no. 2, 2026.
[8] J. Liu, D. Zhu, Z. Bai, Y. He, H. Liao, H. Que, Z. Wang, C. Zhang, G. Zhang, J. Zhang, Y. Zhang, Z. Chen, H. Guo, S. Li, Z. Liu, Y. Shan, Y. Song, J. Tian, W. Wu, Z. Zhou, R. Zhu, J. Feng, Y. Gao, S. He, Z. Li, T. Liu, F. Meng, W. Su, Y. Tan, Z. Wang, J. Yang, W. Ye, B. Zheng, W. Zhou, W. Huang, S. Li, and Z. Zhang, “A comprehensive survey on long context language modeling,” 2025. [Online]. Available: https://arxiv.org/abs/2503.17407
[9] Y. Liu, Y. Cheng, J. Yao, Y. An, X. Chen, S. Feng, Y. Huang, S. Shen, R. Zhang, K. Du, and J. Jiang, “Lmcache: An efficient kv cache layer for enterprise-scale llm inference,” 2025. [Online]. Available: https://arxiv.org/abs/2510.09665
[10] P. Patel, E. Choukse, C. Zhang, A. Shah, I. n. Goiri, S. Maleki, and R. Bianchini, “Splitwise: Efficient generative LLM inference using phase splitting,” in Proceed- ings of the 51st International Symposium on Computer Architecture (ISCA’24), 2024.
[11] R. Qin, Z. Li, W. He, J. Cui, H. Tang, F. Ren, T. Ma, S. Cai, Y. Zhang, M. Zhang, Y. Wu, W. Zheng, and X. Xu, “Mooncake: A kvcache-centric disaggregated architecture for llm serving,” ACM Trans. Storage, Nov. 2025, just Accepted. [Online]. Available: https://doi.org/10.1145/3773772
[12] G. Schieffer, J. Wahlgren, J. Ren, J. Faj, and I. Peng, “Harnessing integrated cpu-gpu system memory for hpc: a first look into grace hopper,” 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2407.07850
[13] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. R´e, I. Stoica, and C. Zhang, “Flexgen: high-throughput generative inference of large language models with a single gpu,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 202. JMLR.org, 2023, pp. 31 094–31 116. [Online]. Available: https://proceedings.mlr.press/v202/sheng23a.html
[14] G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivi`ere, M. S. Kale, J. Love et al., “Gemma: Open models based on gemini research and technology,” 2024. [Online]. Available: https://arxiv.org/abs/2403.08295
[15] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” 2023. [Online]. Available: https: //arxiv.org/abs/2302.13971
[16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
[17] L. Zheng, W.-L. Chiang, Y. Sheng, T. Li, S. Zhuang, Z. Wu, Y. Zhuang, Z. Li, Z. Lin, E. P. Xing, J. E. Gonzalez, I. Stoica, and H. Zhang, “LMSYS-Chat-1M: A large-scale real-world LLM conversation dataset,” in Proceedings of the 12th International Conference on Learning Representations (ICLR’24), 2024.
[18] Y. Zhong, S. Liu, J. Chen, J. Hu, Y. Zhu, X. Liu, X. Jin, and H. Zhang, “Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving,” in Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’24. USA: USENIX Association, 2024.