| 研究生: |
曾冠博 Tseng, Kuna-Po |
|---|---|
| 論文名稱: |
基於 Hadoop 之網絡資料轉傳服務系統 The Web-based Data Service over Hadoop |
| 指導教授: |
蕭宏章
Hsiao, Hung-Chang |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2017 |
| 畢業學年度: | 105 |
| 語文別: | 中文 |
| 論文頁數: | 36 |
| 中文關鍵詞: | Hadoop 、HDFS 、HBase 、分散式儲存 |
| 外文關鍵詞: | Hadoop, HDFS, HBase, Distributed Storage System |
| 相關次數: | 點閱:82 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
Hadoop為一分散式儲存以及運算框架,其儲存系統為HDFS,旨在處理巨量資料,然而其對於小型資料的儲存有其不足之處,目前存在的解法雖然可以獨立的處理此問題,但對於使用者來說需要額外做出相當多的處理步驟。然而所謂巨量資料並非只有大型檔案才稱作巨量,大量的小型檔案其實也存在,若此兩種資料同時存在,在目前並沒有系統可以不用額外花功夫的將兩種情境處理妥當,並且不引發效能以及營運上的隱憂 。另外Hadoop相關系統操作諸如HDFS,HBase除了這兩系統本身均需要額外的時間來學習之外,另一點則是都需透過Java來進行使用,這對於諸多開發者來說是一個相當高的進入門檻。
在此篇論文中,我們提出一分散式儲存系統,設計目的在於隱藏巨量資料平台背後的複雜操作,以簡單並且方便各語言操作的API來使用,並透過HBase以及HDFS兩者的組合來解決HDFS儲存小檔議題。除此之外還支援諸多不同檔案伺服器的存取,方便使用者透過本系統定義的API將各式檔案來源匯入本系統。
Traditional storage systems such as stand-alone file servers and relational databases are designed for small data sets, and they cannot accommodate data sets in the era of big data. Apache Hadoop is an indispensable solution to big data, which relies on scale-out technology by adopting any number of storage servers. Hadoop is not only a distributed storage platform, but supports computational frameworks including MapReduce, Spark, etc. Hadoop cannot store and manage a large set of small-sized data items, however, due to its centralized metadata management scheme. On the other hand, Java programming language appearing extensively in Hadoop is typically a high barrier, introducing a lengthy learning curve to data analysts. To this end, we propose a novel Web-based Data Service over Hadoop (WDSH) in this thesis. WDSH relies on the Hadoop Distributed File System (HDFS) and HBase and Hadoop Database (HBase) as its underlying storage layer to attack the large number of small-sized files problem. For ease of use, we suggest in WDSH with application level interfaces (APIs) in the form of HTTP URL. WDSH has been extensively tested and deployed in production systems over a year. Its performance results reveal that while WDSH is highly scalable, it introduces overheads comparable to Hadoop. As WDSH operates daily, and have stored data in several tera bytes, we are currently extending WDSH as a storage layer for applications using the popular query processor, namely, Hive, and the emerging computational framework, Spark.
[1] Hadoop. Available: https://hadoop.apache.org/
[2] HBase. Available: https://hbase.apache.org/
[3] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in Proc. of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Washington, DC, USA, 2010, pp. 1–10.
[4] V. K. Vavilapalli et al., “Apache Hadoop YARN: Yet Another Resource Negotiator,” in Proc. of the 4th Annual Symposium on Cloud Computing, New York, NY, USA, 2013, p. 5:1–5:16.
[5] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008.
[6] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google File System,” in Proc. of the Nineteenth ACM Symposium on Operating Systems Principles, New York, NY, USA, 2003, pp. 29–43.
[7] F. Chang et al., “Bigtable: A Distributed Storage System for Structured Data,” ACM Trans. Comput. Syst., vol. 26, no. 2, p. 4:1–4:26, Jun. 2008.
[8] B. Dong, J. Qiu, Q. Zheng, X. Zhong, J. Li, and Y. Li, “A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: A Case Study by PowerPoint Files,” in Proc. of the 2010 IEEE International Conference on Services Computing, Washington, DC, USA, 2010, pp. 65–72.
[9] X. Liu, J. Han, Y. Zhong, C. Han, and X. He, “Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS,” in Proc. of the 2009 IEEE International Conference on Cluster Computing and Workshops, 2009, pp. 1–8.
[10] H. C. Hsiao, H. Y. Chung, H. Shen, and Y. C. Chao, “Load Rebalancing for Distributed File Systems in Clouds,” IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 5, pp. 951–962, May 2013.
[11] H. C. Hsiao, H. Liao, S. T. Chen, and K. C. Huang, “Load Balance with Imperfect Information in Structured Peer-to-Peer Systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 4, pp. 634–649, Apr. 2011.
[12] Samba. Available: https://www.samba.org/
[13] FTP. Available: https://www.ietf.org/rfc/rfc959.txt
[14] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, “ZooKeeper: Wait-free Coordination for Internet-scale Systems,” in Proc. of the 2010 USENIX Conference on USENIX Annual Technical Conference, Berkeley, CA, USA, 2010, pp. 11–11.
[15] L. Lamport, “The Part-time Parliament,” ACM Trans. Comput. Syst., vol. 16, no. 2, pp. 133–169, May 1998.
[16] Cloudera. Available: https://www.cloudera.com/
[17] SequenceFile. Available: https://wiki.apache.org/hadoop/SequenceFile
[18] HAR. Available: https://hadoop.apache.org/docs/current/hadoop-archives/HadoopArchives.html
[19] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster Computing with Working Sets,” in Proc. of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, Berkeley, CA, USA, 2010, pp. 10–10.
[20] A. Thusoo et al., “Hive: A Warehousing Solution over a Map-reduce Framework,” in Proc. of the VLDB Endow., vol. 2, no. 2, pp. 1626–1629, Aug. 2009.
[21] JMC. Available: https://www.oracle.com/technetwork/java/javaseproducts/mission-control/java-mission-control-1998576.html
[22] Phoenix. Available: https://phoenix.apache.org/
[23] PosgreSQL. Available: https://www.postgresql.org/
[24] MySQL. Available: https://www.mysql.com/
[25] MSSQLServer. Available: https://www.microsoft.com/en-us/sql-server/sql-server-2016
[26] Oracle. Available: https://www.oracle.com/database/index.html
[27] Hadoop List API Issues. Available: https://issues.apache.org/jira/browse/HADOOP-10987
[28] Mingjie Lai, Eugene Koontz, Andrew Purtell, HBase Coprocessor. Available: https://blogs.apache.org/hbase/entry/coprocessor_introduction
[29] Tom White, The Small Files Problem. http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
[30] 曾冠博, “WDSH系統文件,” 國立成功大學資訊工程學系分散式系統實驗室
校內:2022-08-01公開