簡易檢索 / 詳目顯示

研究生: 唐健翔
Tang, Chien-Hsiang
論文名稱: 支援資料串流之Hadoop資料服務
The Hadoop Data Service Capable of Streaming
指導教授: 蕭宏章
Hsiao, Hung-Chang
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2019
畢業學年度: 107
語文別: 中文
論文頁數: 39
中文關鍵詞: HadoopHDFSTDSKafka分散式儲存
外文關鍵詞: Kafka, TDS, Distributed Storage
相關次數: 點閱:112下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著資料量每年不斷地攀升,單機式的資料庫漸漸的無法支撐龐大資料下所發展出的應用所需要的資源。Hadoop為一計算框架及分散式儲存系統,其儲存系統名為Hadoop Distributed File System (HDFS),是一個擁有高容錯及高併發的分散式檔案系統,基於降低傳統單機資料儲存體將資料移至HDFS的學習成本,先前已經發展出一套利用REST API 來搬移資料的資料轉送中繼系統,即TDS,但鑒於所支援的檔案系統跟資料庫種類相對有限且為了有效節省與新系統連結的開發成本,我們想透過連結一個對外已有諸多不同系統與之連結的系統來間接達到支援多資料系統的目的。
    本論文以Kafka來充當這個中繼系統,Kafka時常被放置在運算引擎或是資料庫之前,因此網路上累積非常多連結各種資料庫的第三方資源,因此很適合當作擴展TDS的中繼系統。為確保資料的完整性,本論文探討出幾個會造成資料遺失的情境,諸如Kafka 伺服器壅塞造成發送端消息遺失、發送端與接收端兩邊的速差過大導致Kafka內部因積累大量未消費的訊息進而產生訊息丟失的風險,針對這些議題實作像是動態流速控制、消費端缺失檔案重傳等方式來解決這些問題,並通過量測相關效能指標來評斷這些方式帶來的影響。

    The thesis let TDS connect to external system via Kafka and explore serveral scenarios that may result in data lose.For example Kafka server congestion or Kafka partition accumulate a lot of message which not being consumed may cause message lost. These problems are solved by methods such as dynamic flow rate control, loss of file retransmissions on the consumer side, etc, and the impact of these methods is determined by measuring relevant performance indicators.

    摘要 i 致謝 ii EXTENDED ABSTRACT iii 目錄 vi 圖目錄 viii 表目錄 ix Chapter 1. 簡介 1 Chapter 2. 背景研究 5 2.1 TDS 5 2.2 Kafka 6 2.2.1 Producer 7 2.2.2 Broker 8 2.2.3 Consumer 11 2.3 Kafka Connect 11 2.4 速差議題 12 2.5 訊息擁塞議題 13 2.6 缺失檔案重試時間點議題 14 Chapter 3. 相關研究 15 3.1 Apache Flume 15 3.2 Apache Avro 15 3.3 Linux檔案系統IO 16 3.4 TCP協定訊息發送策略 16 Chapter 4. 系統概述 19 4.1 系統目標 19 4.2 系統操作 20 4.3 系統架構與流程 21 4.3.1 Source Connector 22 4.3.2 TDS Handler 22 4.3.3 Source Task 22 4.3.4 Sink Connector 25 4.3.5 Sink Task 25 4.4 議題解法與相關元件 27 Chapter 5. 實驗 29 5.1 實驗環境 29 5.2 流速控制之lag值控制情況 30 5.3 擁塞控制之訊息逾時個數實驗 32 5.4 動態調整逾時參數實驗 33 5.5 系統Overhead與Scalability 35 Chapter 6. 結論 37 Chapter 7. 參考資料 38

    [1] Hadoop Tom White, Hadoop:The Definitive Guide,4th ed. O'Reilly Media,2016
    [2] HDFS. K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in Proc. of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, Washington, DC, USA, 2010.
    [3] Lars George. HBase: The Definitive Guide: Random Access to Your Planet-Size Data. O'Reilly Media,2012.
    [4] Block. Available: https://en.wikipedia.org/wiki/Serialization
    [5] Samba. Available: https://www.samba.org/
    [6] FTP. Available: https://www.ietf.org/rfc/rfc959.txt
    [7] 曾冠博. The Web-based Data Service over Hadoop. 成功大學分散式系統實驗室, 2017
    [8] Jay Kreps, Neha Narkhede, Jun Rao,” Kafka: a Distributed Messaging System for Log Processing”,Proceedings of the NetDB,2011
    [9] Kafka Connect. Available: https://www.confluent.io/hub/
    [10] MongoDB. Available: https://www.mongodb.com/
    [11] Amazon S3. Available: https://docs.aws.amazon.com/zh_tw/machine-learning/latest/dg/using-amazon-s3-with-amazon-ml.html
    [12] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker,” Spark: Cluster Computing with Working Sets”,HotCloud,2010.
    [13] Flink. Available: https://flink.apache.org/
    [14] REST API. Available: https://en.wikipedia.org/wiki/Web_API
    [15] TCP. Available: https://en.wikipedia.org/wiki/Transmission_Control_Protocol
    [16] W. Stevens.”TCP slow start , congestion avoidance , fast retransmit, and fast recovery algorithms”,January 1997.
    [17] HTTP URL.Available: https://en.wikipedia.org/wiki/URL
    [18] Message queue. Available: https://en.wikipedia.org/wiki/Message_queue
    [19] Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui and Anne-Marie Kermarrec.” The Many Faces of Publish/Subscribe”, ACM Computing Surveys, Volume 35 Issue 2, June 2003, Pages 114-131 ACM New York, NY, USA.
    [20] ETL. Available: https://en.wikipedia.org/wiki/Extract,_transform,_load
    [21] Reactor. Available: https://en.wikipedia.org/wiki/Reactor_pattern
    [22] Socket. Available: https://en.wikipedia.org/wiki/Socket
    [23] Steve Hoffman(2013).Apache Flume: distributed log collection for Hadoop.Published by Packet Publishing Ltd.
    [24] Avro. Available: https://avro.apache.org/
    [25] Serialization. Available: https://en.wikipedia.org/wiki/Serialization
    [26] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy. “Hive: a warehousing solution over a map-reduce framework”, Volume 2 Issue 2, August 2009.
    [27] Linux Direct I/O. Available: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/global_file_system/s1-manage-direct-io
    [28] Linux Cache I/O. Available: https://zh.wikipedia.org/wiki/%E7%A3%81%E7%9B%98%E7%BC%93%E5%AD%98
    [29] Page cache. Available: https://en.wikipedia.org/wiki/Page_cache
    [30] Nagle. Available: https://en.wikipedia.org/wiki/Nagle%27s_algorithm
    [31] Vinod Kumar Vavilapalli and Arun C Murthy. “Apache Hadoop YARN: Yet Another Resource Negotiator” in 2013 ACM Symposium on Cloud Computing. October 1-3, 2013 - Santa Clara, CA

    無法下載圖示 校內:2024-07-01公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE