| 研究生: |
唐健翔 Tang, Chien-Hsiang |
|---|---|
| 論文名稱: |
支援資料串流之Hadoop資料服務 The Hadoop Data Service Capable of Streaming |
| 指導教授: |
蕭宏章
Hsiao, Hung-Chang |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 中文 |
| 論文頁數: | 39 |
| 中文關鍵詞: | Hadoop 、HDFS 、TDS 、Kafka 、分散式儲存 |
| 外文關鍵詞: | Kafka, TDS, Distributed Storage |
| 相關次數: | 點閱:112 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著資料量每年不斷地攀升,單機式的資料庫漸漸的無法支撐龐大資料下所發展出的應用所需要的資源。Hadoop為一計算框架及分散式儲存系統,其儲存系統名為Hadoop Distributed File System (HDFS),是一個擁有高容錯及高併發的分散式檔案系統,基於降低傳統單機資料儲存體將資料移至HDFS的學習成本,先前已經發展出一套利用REST API 來搬移資料的資料轉送中繼系統,即TDS,但鑒於所支援的檔案系統跟資料庫種類相對有限且為了有效節省與新系統連結的開發成本,我們想透過連結一個對外已有諸多不同系統與之連結的系統來間接達到支援多資料系統的目的。
本論文以Kafka來充當這個中繼系統,Kafka時常被放置在運算引擎或是資料庫之前,因此網路上累積非常多連結各種資料庫的第三方資源,因此很適合當作擴展TDS的中繼系統。為確保資料的完整性,本論文探討出幾個會造成資料遺失的情境,諸如Kafka 伺服器壅塞造成發送端消息遺失、發送端與接收端兩邊的速差過大導致Kafka內部因積累大量未消費的訊息進而產生訊息丟失的風險,針對這些議題實作像是動態流速控制、消費端缺失檔案重傳等方式來解決這些問題,並通過量測相關效能指標來評斷這些方式帶來的影響。
The thesis let TDS connect to external system via Kafka and explore serveral scenarios that may result in data lose.For example Kafka server congestion or Kafka partition accumulate a lot of message which not being consumed may cause message lost. These problems are solved by methods such as dynamic flow rate control, loss of file retransmissions on the consumer side, etc, and the impact of these methods is determined by measuring relevant performance indicators.
[1] Hadoop Tom White, Hadoop:The Definitive Guide,4th ed. O'Reilly Media,2016
[2] HDFS. K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in Proc. of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, Washington, DC, USA, 2010.
[3] Lars George. HBase: The Definitive Guide: Random Access to Your Planet-Size Data. O'Reilly Media,2012.
[4] Block. Available: https://en.wikipedia.org/wiki/Serialization
[5] Samba. Available: https://www.samba.org/
[6] FTP. Available: https://www.ietf.org/rfc/rfc959.txt
[7] 曾冠博. The Web-based Data Service over Hadoop. 成功大學分散式系統實驗室, 2017
[8] Jay Kreps, Neha Narkhede, Jun Rao,” Kafka: a Distributed Messaging System for Log Processing”,Proceedings of the NetDB,2011
[9] Kafka Connect. Available: https://www.confluent.io/hub/
[10] MongoDB. Available: https://www.mongodb.com/
[11] Amazon S3. Available: https://docs.aws.amazon.com/zh_tw/machine-learning/latest/dg/using-amazon-s3-with-amazon-ml.html
[12] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker,” Spark: Cluster Computing with Working Sets”,HotCloud,2010.
[13] Flink. Available: https://flink.apache.org/
[14] REST API. Available: https://en.wikipedia.org/wiki/Web_API
[15] TCP. Available: https://en.wikipedia.org/wiki/Transmission_Control_Protocol
[16] W. Stevens.”TCP slow start , congestion avoidance , fast retransmit, and fast recovery algorithms”,January 1997.
[17] HTTP URL.Available: https://en.wikipedia.org/wiki/URL
[18] Message queue. Available: https://en.wikipedia.org/wiki/Message_queue
[19] Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui and Anne-Marie Kermarrec.” The Many Faces of Publish/Subscribe”, ACM Computing Surveys, Volume 35 Issue 2, June 2003, Pages 114-131 ACM New York, NY, USA.
[20] ETL. Available: https://en.wikipedia.org/wiki/Extract,_transform,_load
[21] Reactor. Available: https://en.wikipedia.org/wiki/Reactor_pattern
[22] Socket. Available: https://en.wikipedia.org/wiki/Socket
[23] Steve Hoffman(2013).Apache Flume: distributed log collection for Hadoop.Published by Packet Publishing Ltd.
[24] Avro. Available: https://avro.apache.org/
[25] Serialization. Available: https://en.wikipedia.org/wiki/Serialization
[26] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy. “Hive: a warehousing solution over a map-reduce framework”, Volume 2 Issue 2, August 2009.
[27] Linux Direct I/O. Available: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/5/html/global_file_system/s1-manage-direct-io
[28] Linux Cache I/O. Available: https://zh.wikipedia.org/wiki/%E7%A3%81%E7%9B%98%E7%BC%93%E5%AD%98
[29] Page cache. Available: https://en.wikipedia.org/wiki/Page_cache
[30] Nagle. Available: https://en.wikipedia.org/wiki/Nagle%27s_algorithm
[31] Vinod Kumar Vavilapalli and Arun C Murthy. “Apache Hadoop YARN: Yet Another Resource Negotiator” in 2013 ACM Symposium on Cloud Computing. October 1-3, 2013 - Santa Clara, CA
校內:2024-07-01公開