簡易檢索 / 詳目顯示

研究生: 李宜桓
Lee, Yi-Huan
論文名稱: 在Kafka平台下支援深度網路模型訓練的數據部署策略
Optimized Data Deployment Strategies for Deep Neural Network Training on Kafka Platforms
指導教授: 陳朝鈞
Chen, Chao-Chun
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 製造資訊與系統研究所
Institute of Manufacturing Information and Systems
論文出版年: 2025
畢業學年度: 113
語文別: 中文
論文頁數: 56
中文關鍵詞: 資料平行化分區策略數據部屬性能指標收集資料路由演算法
外文關鍵詞: Data Parallelism, Apache Kafka, Node-aware Routing, Performance Metrics Collection, Ray
相關次數: 點閱:11下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 深度神經網路在進行資料平行化時,需要確保各計算設備間數據量的平衡。資料平行化是深度學習中加速訓練的關鍵技術,透過將訓練數據分散至多個計算設備並行處理,能有效縮短訓練時間並提升處理大型數據集的能力。在資料平行化的深度學習訓練過程中,將資料集分割成若干個子集並分配給不同的運算單元(例如GPU)進行處理至關重要。
      保持這些子集的大小平衡對於實現高效且穩定的訓練至關重要。大小不平衡的資料分割會導致運算負載不均、資源利用率低下和訓練進度不一致等問題。為了確保高效的資料平行化訓練,理想情況下,每個運算單元分配到的資料子集的大小應該盡可能接近。這樣可以確保每個單元大致在相同的時間內完成其計算任務,從而減少同步等待時間,提高資源利用率,並促進更穩定和一致的訓練過程。
    提升Kafka在機器學習數據部署的平衡性:本研究旨在探討並優化Apache Kafka 在大規模機器學習應用中的數據部署平衡問題。在機器學習場景下,高效且均衡的數據分發對於最大化計算資源利用率和加速模型訓練至關重要。儘管Kafka作為領先的分散式串流平台,在數據管道中扮演著核心角色,但其原生設計缺乏內建的動態負載平衡機制。現有解決方案多依賴於事後的數據搬遷,這種方式不僅成本高昂,且在數據量持續增長和數據傾斜(data skew)情況下,難以提供即時有效的平衡策略,常導致部分節點過載而其他節點閒置,嚴重影響整體系統的資源利用率和訓練效率。本論文的主要貢獻在於設計並實現了一套主動式數據部署平衡機制。
    該機制具備以下特色:
      • 動態優先順序評估: 系統能動態載入多個評分函式,以彈性地排序各節點的優先順序。
      • 即時目標節點選擇: 當每筆數據需要計算目標節點時,機制會根據評分函式確立的優先順序,選擇最適合的節點作為目標端。
      • 彈性與數據相關聯的評分函式: 為了確保數據平衡,本研究實作了基於節點負載差異和數據分佈的評分函式,使節點優先順序與資料量緊密關聯。
      實驗結果顯示,我們的機制可有效平衡各節點的空間使用率,以機器學習數據集為實驗對象時,可有效避免陷入極度不平衡時的50%效能衰退。為實現最大相容性,本研究的實作結合了Kafka Plugin的擴展能力以及Kafka最新引入的Telemetry功能【1】。這使得本論文的成果能與多個Kafka版本相容,達到最佳的部署效益。總而言之,本研究的成果為大規模機器學習場景下的Kafka應用提供了強而有力的資源利用率優化方案,有效降低了計算成本,並顯著提升了整體系統的穩定性和可擴展性。

    Data parallelism is a key technique in deep learning, enabling efficient training of large models by distributing workloads across multiple devices. Apache Kafka, with its high throughput and scalability, is well-suited for managing data streams in such environments. However, Kafka's default data routing lacks awareness of node-level storage usage, often resulting in imbalanced resource utilization.
    This thesis proposes a node-aware data routing algorithm and a performance metrics collection system designed for compatibility with existing Kafka deployments. The metrics system leverages Kafka’s own telemetry features to asynchronously gather and store performance data, minimizing impact on latency.
    By introducing asynchronous routing and caching mechanisms, the proposed approach maintains low write latency while achieving better storage balance across Kafka brokers. Experimental results show that this method significantly reduces imbalance, improves resource efficiency, and supports more stable deep learning workflows. The design is modular, easy to integrate, and paves the way for future performance-aware enhancements in Kafka-based training pipelines.

    學位考試合格證明 I 摘要 II Extended Abstract IV 致謝 VIII 目錄 IX 表目錄 XI 圖目錄 XII 第一章 緒論 1 1.1 研究背景 1 1.2 研究問題與動機 2 1.3 研究貢獻 8 1.4 研究流程 8 1.5 論文組織架構 9 第二章 文獻探討與核心技術 10 2.1 文獻探討 10 2.2 核心技術 11 第三章 系統架構設計 20 3.1 效能指標收集系統 23 3.2 具節點感知的資料路由演算法 28 3.3 效能指標收集路徑上的記憶體優化 30 第四章 實驗 34 4.1 實驗環境 34 4.2 測試資料集規格 34 4.2.1 結果一:分區分配不均實的空間使用比率比較 35 4.2.2 結果二:寫入請求的延遲比較 36 4.2.3 結果三:減緩既有的空間使用率不平衡 37 4.2.4 結果四:空間使用率與訓練所需時間 37 第五章 未來展望 40 參考文獻 41

    1. Telemetry,https://opentelemetry.io/
    2. Users of Kafka,https://kafka.apache.org/
    3. client libraries of Kafka,https://docs.confluent.io/kafka-client/overview.html
    4. cloud provider of Kafka,https://www.kai-waehner.de/blog/2021/04/20/comparison-open-source-apache-kafka-vs-confluent-cloudera-red-hat-amazon-msk-cloud/
    5. product from Kafka,https://www.kai-waehner.de/blog/2021/05/09/kafka-api-de-facto-standard-event-streaming-like-amazon-s3-object-storage/
    6. KIP-794,https://cwiki.apache.org/confluence/display/KAFKA/KIP-794%3A+Strictly+Uniform+Sticky+Partitioner
    7. confluent self-balance,https://docs.confluent.io/platform/current/clusters/sbc/index.html
    8. cruise-control,https://github.com/linkedin/cruise-control
    9. Kafka telemetry,https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
    10. 台積電 with Kafka,https://github.com/twjug/jobs/issues/59
    11. 玉山 with Kafka,https://blog.coscup.org/2022/07/esunbank.html
    12. Appier with Kafka,https://blog.sayuan.net/posts/kafka-reassignment-strategies/
    13. 國泰銀行 with Kafka,https://www.ithome.com.tw/news/138049
    14. 趨勢科技 with Kafka,https://www.ithome.com.tw/news/111150
    15. prometheus, https://prometheus.io/
    16. OpenAi and Ray,https://www.youtube.com/watch?v=CqiL5QQnN64
    17. huggingface,https://huggingface.co/datasets/nyu-mll/glue/tree/main/cola
    18. Fine-tune a Hugging Face Transformers Model,https://docs.ray.io/en/latest/train/examples/transformers/huggingface_text_classification.html
    19. Yuansheng Lou、Lin Chen、Feng Ye、Yong Chen and Zihao Liu. Research andimple- mentation of an aquaculture monitoring system based on flink,mongodb and kafka. In João M. F. Rodrigues, Pedro J. S. Cardoso, JânioMonteiro, Roberto Lam, Valeria V. Krzhizhanovskaya, Michael H. Lees, Jack J.Dongarra, and Peter M.A. Sloot, editors, Computational Science – ICCS 2019,pages 648–657, Cham, 2019. Springer Interna- tional Publishing
    20. Dimitris Stripelis、José Luis Ambite、Yao-Yi Chiang、Sandrah P. Eckel and Rima Habre. A scalable data integration and analysis architecture for sensordata of pediatric asthma. In 2017 IEEE 33rd International Conference on DataEngineering (ICDE),pages 1407 – 1408,2017
    21. N Sudhakar Yadav、B Eswara Reddy and KG Srinivasa. Cloud-based healthcare mon- itoring system using storm and kafka. Towards extensible and adaptable methods in computing,pages 99 – 106,2018
    22. R. Hofmann、R. Klar、B. Mohr、A. Quick and M. Siegle. Distributed performance monitoring:methods、tools and applications. IEEE Transactions on Parallel and Dis- tributed Systems,5(6) : 585 – 598,1994
    23. Jeffrey Joyce、Greg Lomow、Konrad Slind and Brian Unger. Monitoring distributed systems. ACM Trans. Comput. Syst.,5(2) : 121 – 150,mar 1987
    24. Robbert Van Renesse、Kenneth P. Birman and Werner Vogels. Astrolabe : A robust and scalable technology for distributed system monitoring,management and data mining. ACM Trans. Comput. Syst.,21(2) : 164 – 206,may 2003
    25. Guenter Hesse、Christoph Matthies and Matthias Uflacker. How fast can we insert? an empirical performance evaluation of apache kafka. 2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS),pages 641 – 648,2020
    26. Eno Thereska、Brandon Salmon、John Strunk、Matthew Wachs、Michael Abd-El-Malek、Julio Lopez and Gregory R. Ganger. Stardust: Tracking activity in a distributed storage system. SIGMETRICS Perform. Eval. Rev., 34(1) : 3 – 14,jun 2006
    27. D. Wybranietz and D. Haban. Monitoring and performance measuring distributed sys- tems during operation. SIGMETRICS Perform. Eval. Rev., 16(1) : 197 – 206,may 1988
    28. Graphite. https://graphiteapp.org/
    29. OpenTSDB. https://github.com/OpenTSDB/opentsdb
    30. InfluxDB. https://www.influxdata.com/
    31. Use ByteBufferOutputStream to avoid array copy. https://issues.apache.org/jira/browse/KAFKA-16397
    32. Avoid the extra bytes copy when compressing telemetry payload. https://issues.apache.org/jira/browse/KAFKA-17847
    33. Consider using zero-copy for PushTelemetryRequest. https://issues.apache.org/jira/browse/KAFKA-17867
    34. Ray Training.https://docs.ray.io/en/latest/train/train.html
    35. COCO2017 dataset.https://www.kaggle.com/datasets/awsaf49/coco-2017-dataset
    36. ResNet. https://docs.pytorch.org/vision/main/models/generated/torchvision.models.resnet18.html

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE