| 研究生: |
李信穎 Lee, Hsin-Ying |
|---|---|
| 論文名稱: |
具通透性之Hadoop資料服務 A Transparent Hadoop Data Service |
| 指導教授: |
蕭宏章
Hsiao, Hung-Chang |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2018 |
| 畢業學年度: | 106 |
| 語文別: | 中文 |
| 論文頁數: | 28 |
| 中文關鍵詞: | Hadoop 、HDFS 、HBase 、分散式儲存 |
| 外文關鍵詞: | Hadoop, HDFS, HBase, Distributed data store |
| 相關次數: | 點閱:83 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
Hadoop為一計算框架及分散式儲存系統,其儲存系統名為Hadoop Distributed File System (HDFS),設計用來儲存超大型檔案,卻無法有效的處理大量小資料,目前雖然存在許多方法可以解決小資料的議題,但對於使用者來說需要額外付出相當多的處理步驟,而HBase是建構在HDFS上的分散式資料庫,提供高效率的隨機存取,可以用來解決小資料的問題,但這兩套系統皆需要花費時間學習外且皆使用Java語言撰寫,對於一般使用者來說是個相當高的門檻。
本論文提出一具通透性的分散式儲存系統,設計的目的是為了解決HDFS上小資料的議題外,並支援HDFS Interface介面用以相容Hadoop體系相關應用,使得Hive及Spark等專案之使用者不需額外修改程式碼就可以直接使用系統存取資料,同時提供簡易的Web API隱藏巨量資料平台後複雜的操作,讓使用者輕易的透過API,將資料匯入至巨量資料平台。
Hadoop is an open source distributed processing framework and storage for big data. Its storage called Hadoop Distributed File System (HDFS). HDFS is designed for storing very large files with streaming data access patterns, but it can't effectively handle lots of files. Although there are many ways to solve small data problem, users still need to take a lot of extra processing. HBase is a distributed database that is often paired with Hadoop, providing efficient random access, HBase can be used to solve small data problems, but these two systems must to take time to learn and are written in Java, which are high barrier introducing to data analysts. This paper proposes a transparent distributed storage system, designed to solve the problem of small data on HDFS, and supports the HDFS Interface to be compatible with Hadoop ecosystem, such as Hive, Spark, etc. Users can access the data directly without changing any code, and system also provide a simple Web API to hide the data platform’s complex operations, let users migrate data into the data platform from other file servers through the API.
[1] Hadoop. Available: https://hadoop.apache.org/
[2] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in Proc. of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, Washington, DC, USA, 2010.
[3] Tom White, The Small Files Problem. Available:
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
[4] HAR. Available:
https://hadoop.apache.org/docs/current/hadoop-archives/HadoopArchives.html
[5] SequenceFile. Available: https://wiki.apache.org/hadoop/SequenceFile
[6] HBase. Available: https://hbase.apache.org/
[7] Samba. Available: https://www.samba.org/
[8] FTP. Available: https://www.ietf.org/rfc/rfc959.txt
[9] Representational State Transfer. Available: https://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm
[10] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google File System,” in Proc. of the Nineteenth ACM Symposium on Operating Systems Principles, New York, NY, USA, 2003.
[11] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A.Fikes, and R.E. Gruber, “Bigtable: A Distributed Storage System for Structured Data,” ACM Trans. Comput. Syst., vol. 26, no. 2, 2008.
[12] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, “ZooKeeper: Wait-free Coordination for Internet-scale Systems,” in Proc. of the 2010 USENIX Conference on USENIX Annual Technical Conference, Berkeley, CA, USA, 2010.
[13] B. Dong, J. Qiu, Q. Zheng, X. Zhong, J. Li, and Y. Li, “A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: A Case Study by PowerPoint Files,” in Proc. of the 2010 IEEE International Conference on Services Computing, Washington, DC, USA, 2010.
[14] X. Liu, J. Han, Y. Zhong, C. Han, and X. He, “Implementing WebGIS on Hadoop: A Case Study of Improving Small File I/O Performance on HDFS,” in Proc. of the 2009 IEEE International Conference on Cluster Computing and Workshops, 2009.
[15] 曾冠博. HDS: The Web-based Data Service over Hadoop. 成功大學分散式系統實驗
室, 2017.
[16] Hive. Available: https://hive.apache.org/
[17] Spark. Available: https://spark.apache.org/
[18] M. Lai, E. Koontz, A. Purtell, HBase Coprocessor. Available: https://blogs.apache.org/hbase/entry/coprocessor_introduction
[19] H. C. Hsiao, H. Y. Chung, H. Shen, and Y. C. Chao, “Load Rebalancing for Distributed File Systems in Clouds,” IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 5, 2013.
[20] H. C. Hsiao, H. Liao, S. T. Chen, and K. C. Huang, “Load Balance with Imperfect Information in Structured Peer-to-Peer Systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 4, 2011.
[21] JMC. Available: https://www.oracle.com/technetwork/java/javaseproducts/mission-control/java-mission-control-1998576.html
[22] Phoenix. Available: https://phoenix.apache.org/
[23] PosgreSQL. Available: https://www.postgresql.org/
[24] MySQL. Available: https://www.mysql.com/
[25] MSSQLServer. Available:
https://www.microsoft.com/en-us/sql-server/sql-server-2016
[26] Oracle. Available: https://www.oracle.com/database/index.html
[27] Hadoop List API Issues. Available:
https://issues.apache.org/jira/browse/HADOOP-10987
[28] Cloudera. Available: https://www.cloudera.com/