| 研究生: |
黃虹橋 Huang, Horng-Chyau |
|---|---|
| 論文名稱: |
應用在提昇Hadoop異質環境效能的一種創新方法 IDP: An Innovative Data Placement Strategy for Hadoop in Heterogeneous Environments |
| 指導教授: |
謝孫源
Hsieh, Sun-Yuan |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 醫學資訊研究所 Institute of Medical Informatics |
| 論文出版年: | 2014 |
| 畢業學年度: | 102 |
| 語文別: | 英文 |
| 論文頁數: | 56 |
| 中文關鍵詞: | MapReduce 、Hadoop 、異質環境 、資料放置 |
| 外文關鍵詞: | MapReduce, Hadoop, Heterogeneous Environments, Data Placement |
| 相關次數: | 點閱:122 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
雲端運算,一種以平行分散式運算為概念的運算系統,在近幾年發展成為一個非常熱門的處理巨量資料的方式。MapReduce在雲端運算裡面扮演了舉足輕重的角色。此外MapReduce在Hadoop這個開放式原始碼系統中也扮演了主要的架構,而Hadoop在現今經常被用來處理巨量資料,像是資料探勘或是網頁索引。現今Hadoop在實作上是假設在集群中,每一台電腦的運算能力皆一致,而且任務處理中的資料都是在本地端。然而同質性環境和資料區域性這兩個特性的要求在私有雲或是虛擬化的資料中心中不可能達到,因此有極大的可能性會使得網路傳輸量過高甚至最終導致MapReduce的整體表現下降。鑒於這些原因,我們在這篇論文中提出一個新的資料擺放方式,來解決現實環境中因為預設的前提而導致工作量不平均的問題。主要採取的方式是根據資料節點的運算能力的不同來分配相對應的資料量,希望藉由此方式來讓資料的轉移量大幅降低,進而提升Hadoop整體的性能。最後,在實驗的部份也證實了應用我們提出的方法,可以有效降低資料處理的時間,以達到改善Hadoop雲端環境運算的效能。
Cloud computing is a kind of parallel distributed computing system that becomes more and more popular in modern world. MapReduce is a popular model in cloud computing, which is an important programming model for large-scale data-parallel application. Furthermore, Hadoop is an open-source implementation of MapReduce model, which is usually used for data-intensive application such as data mining and web indexing. The current Hadoop implementation assumes that every node in a cluster has equivalent computing capability and task are data-local. However, this assumption induces that homogeneity and data locality requirement would not be satisfied in private cluster and virtualized data centers, which may increase extra overhead and degrade MapReduce performance. In this paper, we propose a data placement strategy to deal with the imbalanced workload problem on DataNode. Basing on computing capability of each node in a heterogeneous Hadoop cluster, the proposed strategy can balance the data that was stored in the DataNode such that the cost of data transfer time can be tremendously reduced. As a result, the Hadoop overall performance can be greatly improved. Experimental results demonstrate that the proposed data placement strategy can highly decrease the execution time and thus improves Hadoop performance in a heterogeneous cluster.
[1] Amazon Elastic Compute Cloud - http://aws.amazon.com/ec2/
[2] Amazon Elastic MapReduce - http://aws.amazon.com/elasticmapreduce/
[3] Apache - http://httpd.apache.org/
[4] AWS Elastic Beanstalk - http://aws.amazon.com/elasticbeanstalk/
[5] Cloud Foundry - http://www.cloudfoundry.com/
[6] Engine Yard https://www.engineyard.com/
[7] Force.com - http://www.force.com/
[8] Go Grid Cloud Servers - http://www.gogrid.com/products/infrastructure-cloud-servers
[9] Google App Engine - https://appengine.google.com/start
[10] Google Compute Engine - https://cloud.google.com/products/compute-engine
[11] Hadoop - http://hadoop.apache.org/
[12] Hadoop MapReduce - http://hadoop.apache.org/docs/stable/mapred tutorial.html
[13] Hadoop Distributed File System - http://hadoop.apache.org/docs/stable/hdfs design.html
[14] Hadoop Yahoo - http://www.ithome.com.tw/itadm/article.php?c=49410&s=4
[15] Heroku - https://www.heroku.com/
[16] HP Cloud Services - https://www.hpcloud.com/
[17] Jelastic - http://jelastic.com/
[18] Mendix - http://www.mendix.com/
[19] OpenShift - https://www.openshift.com/
[20] Oracle Infrastructure as a Service - http://www.oracle.com/us/products/engineeredsystems/
iaas/overview/index.html
[21] Orange Scape - http://www.orangescape.com/
[22] ReadySpace Cloud Services - http://www.readyspace.com/
[23] Secure Shell Script - http://en.wikipedia.org/wiki/Secure Shell
[24] WhatIs.com CaaS - http://whatis.techtarget.com/definition/Communications-as-a-
Service-CaaS
[25] Wikipedia Cloud Computing - https://en.wikipedia.org/wiki/Cloud computing
[26] Wikipedia Converged infrastructure - http://en.wikipedia.org/wiki/Converged infrastructure
[27] Windows Azure - http://www.windowsazure.com/en-us/
[28] WindowsAzureCloudServices - http://www.windowsazure.com/enus/
documentation/services/cloud-services/?fb=zh-tw
[29] Amies, Alex; Sluiman, Harm; Tong, Qiang Guo; Liu, Guo Ning (July 2012). “Infrastructure
as a Service Cloud Concepts”. Developing and Hosting Applications on the Cloud. IBM
Press. ISBN 978-0-13-306684-5.
[30] ”Cloud Computing in Telecommunications”. Ericsson. Retrieved 16 December 2012.
[31] S. Ghemawat, H. Gobioff, and S.-T. Leung “The Google File System,” In Proceedings of
the nineteenth ACM Symposium on Operating Systems Principles(SOSP), pp 29–43, 2003.
[32] ”ITU-T NEWSLOG - CLOUD COMPUTING AND STANDARDIZATION: TECHNICAL
REPORTS PUBLISHED.” International Telecommunication Union (ITU), Retrieved 16
December 2012.
[33] ”ITU Focus Group on Cloud Computing - Part 1.” International Telecommunication Union
(ITU) TELECOMMUNICATION STANDARDIZATION SECTOR OF ITU. Retrieved
16 December 2012.
[34] Metzler, Jim; Taylor, Steve. (2010-08-23) ”Cloud computing: Reality vs. fiction,” Network
World.
[35] National Institute of Standards and Technology “The NIST Definition of Cloud Computing,”
September, 2011
[36] Chou, Timothy. ”Introduction to Cloud Computing: Business & Technology.”
[37] A scalable, high performance file system. http://lustre.org.
[38] Parallel virtual file system, version 2. http://www.pvfs2.org.
[39] D. Borthakur, K. Muthukkaruppan, K. Ranganathan, S. Rash, J.-S. Sarma, N. Spiegelberg,
D. Molkov, R. Schmidt, J. Gray, H. Kuang, A. Menon, A. Aiyer,“Apache Hadoop Goes
Realtime at Facebook,” In SIGMOD 11, June 12V16, 2011, Athens, Greece.
[40] F. Chang, J. Dean, S. Ghemawat, W.-C. Hsieh “ Bigtable: A Distributed Storage System
for Structured Data,” In TOCS 2008, 2008, 26.2: 4.
[41] Q. Chen, D. Zhang, M. Guo, Q. Deng and S. Guo, “SAMR: A Self-Adaptive MapReduce
Scheduling Algorithm in Heterogeneous Environment,” . Computer and Information
Technology (CIT), 2010 IEEE 10th International Conference on. IEEE, 2010. p. 2736-2743.
[42] J. Dean, and S. Ghemawat “MapReduce: Simplified Data Processing on Large Clusters,”
In OSDI '04, pp 137–150, Dec 2004.
[43] S. Ghemawat, H. Gobioff, and S.-T. Leung “The Google File System,” In Proc. SOSP
2003, pages 29V43, 2003.
[44] Kuang-Yu Hsieh, Sun-Yuan Hsieh. “ A Dynamic Data Placement Policy for Hadoop in
Heterogeneous Environments”
[45] B. He, W. Fang, Q. Luo, N. Govindaraju, and T. Wang “Mars: A MapReduce Framework
on Graphics Processors,” In Proceedings of the 17th international conference on Parallel
architectures and compilation techniques(PACT), pp 260–269, 2008.
[46] M.Isard, M.Budiu, Y.Yu, A.Birrell, and D.Fetterly. “Dryad: distributed data-parallel
programs from sequential building blocks.” In Proceedings of the 2nd ACM European
Conference on Computer Systems, pp 59–72, 2007.
[47] G. Lee, B. G. Chun and R. H Katz,. “Heterogeneity-Aware Resource Allocation and
Scheduling in the Cloud.” Proceedings of the 3rd USENIX Workshop on Hot Topics in
Cloud Computing, pp 4–5, 2011.
[48] Julia Myint, Thinn Thu Naing “Management Of Data Replication For Pc Cluster Based
Cloud Storage System.” , International Journal on Cloud Computing: Services and
Architecture(IJCCSA),Vol.1, No.3, pp 31–41, November 2011
[49] M.Rafique, B.Rose, A.Butt, and D.Nikolopoulos. “Supporting mapreduce on large-scale
asymmetric multi-core clusters.” ACM Special Interest Group on Operating Systems(
SIGOPS),Volume 43, Issue 2, 99 25–34, April 2009
[50] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. “Evaluating
mapreduce for multi-core and multiprocessor systems.” IEEE 13th International Symposium
on High Performance Computer Architecture, pp 13–24, 2007
[51] Kavulya, S. ; Carnegie Mellon Univ., Pittsburgh, PA, USA ; Tan, J. ; Gandhi, R.
; Narasimhan, P. “An Analysis of Traces from a Production MapReduce Cluster”
IEEE/ACM International Conference on Cluster, Cloud and Grid Computing(CCGrid),
pp 94–103, May 2010
[52] C. Tian, H. Zhou, Y. He and L. Zha, “A Dynamic Mapreduce Scheduler for Heterogeneous
Workloads.” In: Grid and Cooperative Computing, 2009. GCC’09. Eighth International
Conference on. IEEE, 2009. pp 218–224
[53] W.Tantisiriroj, S.Patil, and G.Gibson. “Data-intensive file systems for internet services: A
rose by any other name.” Carnegie Mellon University Parallel Data Lab Technical Report
CMU-PDL-08-114, October 2008.
[54] Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam
Manzanares, and Xiao Qin “Improving MapReduce Performance through Data Placement
in Heterogeneous Hadoop Clusters” Parallel and Distributed Processing, Workshops and
Phd Forum (IPDPSW), 2010 IEEE International Symposium on, pp 1–9, April 2010
[55] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy Scott Shenker,
Ion Stoica “Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in
Cluster Scheduling” Proceedings of the 5th European conference on Computer system, pp
265–278, 2010
[56] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica “Improving MapReduce
Performance in Heterogeneous Environments,” In Proceedings of the 8th USENIX
conference on Operating systems design and implementation, pp 29–42, December 2008.