簡易檢索 / 詳目顯示

研究生: 李信穎
Lee, Hsin-Ying
論文名稱: 具通透性之Hadoop資料服務
A Transparent Hadoop Data Service
指導教授: 蕭宏章
Hsiao, Hung-Chang
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 中文
論文頁數: 28
中文關鍵詞: HadoopHDFSHBase分散式儲存
外文關鍵詞: Hadoop, HDFS, HBase, Distributed data store
相關次數: 點閱:83下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Hadoop為一計算框架及分散式儲存系統,其儲存系統名為Hadoop Distributed File System (HDFS),設計用來儲存超大型檔案,卻無法有效的處理大量小資料,目前雖然存在許多方法可以解決小資料的議題,但對於使用者來說需要額外付出相當多的處理步驟,而HBase是建構在HDFS上的分散式資料庫,提供高效率的隨機存取,可以用來解決小資料的問題,但這兩套系統皆需要花費時間學習外且皆使用Java語言撰寫,對於一般使用者來說是個相當高的門檻。
    本論文提出一具通透性的分散式儲存系統,設計的目的是為了解決HDFS上小資料的議題外,並支援HDFS Interface介面用以相容Hadoop體系相關應用,使得Hive及Spark等專案之使用者不需額外修改程式碼就可以直接使用系統存取資料,同時提供簡易的Web API隱藏巨量資料平台後複雜的操作,讓使用者輕易的透過API,將資料匯入至巨量資料平台。

    Hadoop is an open source distributed processing framework and storage for big data. Its storage called Hadoop Distributed File System (HDFS). HDFS is designed for storing very large files with streaming data access patterns, but it can't effectively handle lots of files. Although there are many ways to solve small data problem, users still need to take a lot of extra processing. HBase is a distributed database that is often paired with Hadoop, providing efficient random access, HBase can be used to solve small data problems, but these two systems must to take time to learn and are written in Java, which are high barrier introducing to data analysts. This paper proposes a transparent distributed storage system, designed to solve the problem of small data on HDFS, and supports the HDFS Interface to be compatible with Hadoop ecosystem, such as Hive, Spark, etc. Users can access the data directly without changing any code, and system also provide a simple Web API to hide the data platform’s complex operations, let users migrate data into the data platform from other file servers through the API.

    摘要 i Extended Abstract ii 致謝 v 目錄 vi 表目錄 viii 圖目錄 ix Chapter 1. 簡介 1 Chapter 2. 背景研究 4 2.1 HDFS 4 2.2 HBase 4 2.3 小資料於HDFS上之議題 5 2.4 通用的接口 5 Chapter 3. 系統用戶端 6 3.1 Web APIs 6 3.1.1 檔案傳輸 7 3.2 Authorization (認證授權) 8 3.2.1 認證授權設定 8 3.3 Mapping 8 3.4 Hadoop體系專案支援 9 3.5 啟動/關閉HDS與系統參數 9 Chapter 4. 系統架構 10 4.1 HTTP Server 10 4.1.1 Connection limits 11 4.1.2 Load Balancer 11 4.2 Lock Manager 11 4.3 Task Manager 12 4.4 Transfer 13 4.5 Metrics & Time Phase Logger 13 4.6 HDFS Interface 14 Chapter 5. HDS 儲存架構 15 5.1 讀取及寫入流程 15 5.2 HDS目錄結構設計 17 5.3 Table and Schema 18 5.4 大型資料於HDFS上的管理 19 Chapter 6. 實驗 20 6.1 測試環境 20 6.2 Overhead 21 6.3 Load Balance 22 6.4 Scalability and Fault Tolerance 23 6.5 Transparency 24 Chapter 7. 結論 26 參考資料 27

    [1] Hadoop. Available: https://hadoop.apache.org/
    [2] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop Distributed File System,” in Proc. of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, Washington, DC, USA, 2010.
    [3] Tom White, The Small Files Problem. Available:
    http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
    [4] HAR. Available:
    https://hadoop.apache.org/docs/current/hadoop-archives/HadoopArchives.html
    [5] SequenceFile. Available: https://wiki.apache.org/hadoop/SequenceFile
    [6] HBase. Available: https://hbase.apache.org/
    [7] Samba. Available: https://www.samba.org/
    [8] FTP. Available: https://www.ietf.org/rfc/rfc959.txt
    [9] Representational State Transfer. Available: https://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm
    [10] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google File System,” in Proc. of the Nineteenth ACM Symposium on Operating Systems Principles, New York, NY, USA, 2003.
    [11] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A.Fikes, and R.E. Gruber, “Bigtable: A Distributed Storage System for Structured Data,” ACM Trans. Comput. Syst., vol. 26, no. 2, 2008.
    [12] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, “ZooKeeper: Wait-free Coordination for Internet-scale Systems,” in Proc. of the 2010 USENIX Conference on USENIX Annual Technical Conference, Berkeley, CA, USA, 2010.
    [13] B. Dong, J. Qiu, Q. Zheng, X. Zhong, J. Li, and Y. Li, “A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: A Case Study by PowerPoint Files,” in Proc. of the 2010 IEEE International Conference on Services Computing, Washington, DC, USA, 2010.
    [14] X. Liu, J. Han, Y. Zhong, C. Han, and X. He, “Implementing WebGIS on Hadoop: A Case Study of Improving Small File I/O Performance on HDFS,” in Proc. of the 2009 IEEE International Conference on Cluster Computing and Workshops, 2009.
    [15] 曾冠博. HDS: The Web-based Data Service over Hadoop. 成功大學分散式系統實驗
    室, 2017.
    [16] Hive. Available: https://hive.apache.org/
    [17] Spark. Available: https://spark.apache.org/
    [18] M. Lai, E. Koontz, A. Purtell, HBase Coprocessor. Available: https://blogs.apache.org/hbase/entry/coprocessor_introduction
    [19] H. C. Hsiao, H. Y. Chung, H. Shen, and Y. C. Chao, “Load Rebalancing for Distributed File Systems in Clouds,” IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 5, 2013.
    [20] H. C. Hsiao, H. Liao, S. T. Chen, and K. C. Huang, “Load Balance with Imperfect Information in Structured Peer-to-Peer Systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 4, 2011.
    [21] JMC. Available: https://www.oracle.com/technetwork/java/javaseproducts/mission-control/java-mission-control-1998576.html
    [22] Phoenix. Available: https://phoenix.apache.org/
    [23] PosgreSQL. Available: https://www.postgresql.org/
    [24] MySQL. Available: https://www.mysql.com/
    [25] MSSQLServer. Available:
    https://www.microsoft.com/en-us/sql-server/sql-server-2016
    [26] Oracle. Available: https://www.oracle.com/database/index.html
    [27] Hadoop List API Issues. Available:
    https://issues.apache.org/jira/browse/HADOOP-10987
    [28] Cloudera. Available: https://www.cloudera.com/

    下載圖示 校內:2023-12-31公開
    校外:2023-12-31公開
    QR CODE