研究生: |
徐瑞興 Hsu, Jui-Hsing |
---|---|
論文名稱: |
一個可將MapReduce程式透通地執行在多個Hadoop平台之方法 A Transparent Approach to Run MapReduce Programs on Collaborative Hadoops |
指導教授: |
謝錫堃
Shieh, Ce-Kuen |
共同指導教授: |
張志標
Chang, Jyh-Biau |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
論文出版年: | 2012 |
畢業學年度: | 100 |
語文別: | 英文 |
論文頁數: | 41 |
中文關鍵詞: | 透通 、分散式運算 |
外文關鍵詞: | MapReduce, Hadoop, transparent, distributed computing |
相關次數: | 點閱:108 下載:3 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
MapReduce是一種分散式大量運算的處理框架,隨著大量資料之分散式運算的興起,已經有許多機構建立屬於自己的資料及運算中心來處理及分析資料。Hadoop是MapReduce的開源軟體,現今已有許多機構利用Hadoop創建自己的運算及資料中心並開發相關應用程式例如建立網頁索引、資料探勘等。
在某些情況下,將各機構的Hadoop資源聯合起來可獲許多好處。舉例像是藉由聯合各機構的運算資源我們可以縮短整體執行時間。關於聯合各機構Hadoop叢集的現有作法可能會導致許多問題。舉例來說,使用者必須要重新設計一個專門使用於多個Hadoop環境的MapReduce程式。或是當運算需求改變時,使用者必須要重新設定整體系統環境。兩者都對用戶造成使用上的不便且破壞了MapReduce的簡潔性。
我們提出了一個不用提供額外程式就能使原有程式在多個Hadoop環境上執行的作法。在我們的系統中,使用者可以在上層Hadoop執行原MapReduce程式,而我們的系統可以自動聯合下層各Hadoop包括工作的派送、資料的傳遞且不必修改該MapReduce程式。實驗結果顯示我們在WordCount 5G的case中可獲23%的效能改善。
MapReduce is a programming model for data-intensive applications while providing the simplicity of parallel programming. With the rapid growth of data-intensive applications in distributed computing, many organizations have built clusters with computing resources to store or to analyze data. Hadoop is an open-source implementation of MapReduce which have been widely used for many applications such as web indexing, data mining, etc.
In some cases, it is favorable to aggregate several Hadoop clusters’ resources. For example, we could minimize the job execution time with more computing resources by integrating computing nodes outside the local cluster together. However, existing solutions to aggregate Hadoop clusters have several problems. For example, users need to redesign the program for the collaborative use for each application. Or users need to reset the enivonments while compuation demand changes. Both of which causes inconvenience for users and thus breaks the property of simplicity in MapReduce.
We propose a transparent way which could make collaborative Hadoop clusters work together without redesigning programs for each application. In our system, users could execute jobs in cloud portal as the single Hadoop cluster does, and our system runtime will automatically handle the rest work including job dispatching, data transferring, program modification and program running. The experimental results also shows that our system could provide 23% performance gain in WordCount 5G case.
[1] J. Dean and S. Ghemawat, "MapReduce: Simplified data processing on large clusters," Communications of the ACM, vol. 51, pp. 107-113, 2008.
[2] Yahoo! . Available: http://www.yahoo.com
[3] Facebook. Available: http://www.facebook.com/
[4] RackSpace. Available: http://www.rackspace.com
[5] PowerBy – Hadoop Wiki. Available: http://wiki.apache.org/hadoop/PoweredBy
[6] Amazon EC2. Available: http://aws.amazon.com/ec2/
[7] GoGrid Available: http://www.gogrid.com/
[8] A. Matsunaga, et al., "Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications," 2008, pp. 222-229.
[9] M. Tsugawa and J. A. B. Fortes, "A virtual network (ViNe) architecture for grid computing," 2006, p. 10 pp.
[10] Y. Luo, et al., "A hierarchical framework for cross-domain MapReduce execution," 2011, pp. 15-22.
[11] K. Cardona, et al., "A grid based system for data mining using MapReduce," Technical Report TR-2007-02, AMALTHEA2007.
[12] C. T. Chu, et al., "Map-reduce for machine learning on multicore," Advances in neural information processing systems, vol. 19, p. 281, 2007.
[13] S. W. Jer´ ome Franc¸ois, Walter Bronzi, Radu State, Thomas Engel, "BotCloud: Detecting Botnets Using MapReduce," presented at the IEEE International Workshop on Information Forensics and Security, 2011.
[14] Google. Available: http://www.google.com
[15] W. Gropp, et al., "A high-performance, portable implementation of the MPI message passing interface standard," Parallel computing, vol. 22, pp. 789-828, 1996.
[16] S. Ghemawat, et al., "The Google file system," 2003, pp. 29-43.
[17] Nimbus (Virtual Workspace). Available: http://www.nimbusproject.org/
[18] Hadoop. Available: http://hadoop.apache.org
[19] HDFS File System Shell Guide – get. Available: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html#get
[20] HDFS File System Shell Guide – put. Available: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html#put
[21] DistCp. Available: http://hadoop.apache.org/common/docs/current/distcp.html
[22] M. Nambiar, et al., "WANem: The Wide Area Network Emulator," ed.
[23] WordCount. Available: http://wiki.apache.org/hadoop/WordCount
[24] BlockSearch. Available: http://github.com/apache/hadoop-mapreduce/blob/trunk/src/contrib/block_forensics/src/java/org/apache/hadoop/blockforensics/BlockSearch.java
[25] gzip. Available: http://www.gzip.org/
[26] To Use Or Not To Use A Combiner. Available: http://philippeadjiman.com/blog/2010/01/14/hadoop-tutorial-series-issue-4-to-use-or-not-to-use-a-combiner/
[27] RSA (algorithm). Available: http://en.wikipedia.org/wiki/RSA_(algorithm)
[28] ssh – Linux command. Available: http://linux.about.com/od/commands/l/blcmdl1_ssh.htm
[29] H. Y. Huang, et al., "Identity Federation Broker for Service Cloud," 2010, pp. 115-120.
[30] Hbase. Available: http://hbase.apache.org/