研究生: |
陳仲毅 Chen, Jhong-Yi |
---|---|
論文名稱: |
使用多重任務佇列以改善Hadoop中之資料地域性 Using Multi-Task Queues to Improve Data Locality in Hadoop |
指導教授: |
謝孫源
Hsieh, Sun-Yuan |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2013 |
畢業學年度: | 101 |
語文別: | 英文 |
論文頁數: | 50 |
中文關鍵詞: | 資料地域性 、Hadoop 、任務排程演算法 |
外文關鍵詞: | data locality, Hadoop, task assignment scheduling. |
相關次數: | 點閱:110 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
分散式運算系統在近幾年越來越被廣泛使用,主要用來處理因為通訊技術快速發展,而產生的龐大資料量。而在眾多的分散式運算系統中,最具知名度也廣泛被討論的即是Apache 所開發 Hadoop。但在Hadoop預設的任務排程中,工作管理程序有較高的可能性去挑選到一個任務給工作節點,而這個任務所需的資料並不在此工作節點上。這造成此工作節點必須去向其他工作節點取得資料,而這些額外的資料傳輸,很可能拖長整個工作的完成時間,甚至使得一些負責連結整個系統網路的機架成為網路頻寬瓶頸或網路壅塞。所以我們提出了一個針對任務的地域做全域性考量並且平衡工作量的排程演算法。首先我們讓每個工作節點擁有各自的任務佇列,並且將每個任務派置到各自最佳地域性的佇列中。然後我們計算一個均衡處理時間的限制,根據這個限制來對所有任務做消耗最低的重新配置調整。相較於預設的排程,我們的方法可以有效減少資料傳輸量以及改善系統的運算效能。
Distributed computing system as cloud computing becomes more and more popular, and Hadoop is one of the familiar systems. Because the default task scheduling in Hadoop is FCFS that is not efficient, Master probably selects tasks without data locality for slaves. It causes many unnecessary data transfer within slaves that directly increase jobs' execution times and make racks become the network bandwidth bottleneck. In this paper, we present a scheduling algorithm to globally consider task with data locality and load-balance. First, we create multiple queues for every slave and put each task to the queue with best data locality. Next, we compute the time balancing limit, and we start to shift tasks according the limit to get a better task assignment. In contrast with default scheduling, the proposed method could always keep less data transfer and improve the computing system performance.
[1] Amazon Elastic Compute Cloud http://aws.amazon.com/ec2/.
[2] Amazon Elastic MapReduce http://aws.amazon.com/elasticmapreduce/.
[3] Apache http://httpd.apache.org/.
[4] AWS Elastic Beanstalk http://aws.amazon.com/elasticbeanstalk/.
[5] Apache Pig http://pig.apache.org/.
[6] Cloud Computing https://en.wikipedia.org/wiki/Cloud computing.
[7] Cloud Foundry http://www.cloudfoundry.com/.
[8] Cloud Model https://en.wikipedia.org/wiki/File:Cloud computing types.svg.
[9] Cloud Service Layers https://en.wikipedia.org/wiki/File:Cloud computing layers.png.
[10] Engine Yard https://www.engineyard.com/.
[11] GoGrid http://www.gogrid.com/.
[12] Google App Engine https://appengine.google.com/start.
[13] Google Compute Engine https://cloud.google.com/products/compute-engine.
[14] GridGain http://www.gridgain.com/.
36
[15] Hadoop http://hadoop.apache.org/.
[16] Hadoop and Big Data http://www.cloudera.com/content/cloudera/en/why-
cloudera/hadoop-and-big-data.html
[17] Hadoop DataNode http://wiki.apache.org/hadoop/DataNode.
[18] Hadoop Distributed File System Architecture
http://hadoop.apache.org/docs/stable/hdfs design.html.
[19] Hadoop MapReduce http://hadoop.apache.org/docs/stable/mapred tutorial.html.
[20] Hadoop NameNode http://wiki.apache.org/hadoop/NameNode.
[21] HiCloud http://hicloud.hinet.net/.
[22] Jelastic http://jelastic.com/.
[23] Kernel Based Virtual Machine http://www.linux-kvm.org/page/Main Page.
[24] Mendix http://www.mendix.com/.
[25] Minimun Cost Flow http://en.wikipedia.org/wiki/Minimum-cost °ow problem.
[26] On-Demand-Self-Service http://cloudstory.in/2012/07/top-10-reasons-why-
startups-should-consider-cloud/.
[27] OPENSHIFT https://www.openshift.com/.
[28] Oracle Infrastructure as a Service http://www.oracle.com/us/products/engineered-
systems/iaas/overview/index.html.
[29] Orange Scape http://www.orangescape.com/.
[30] RACKSPACE http://www.rackspace.com/cloud/b/servers/.
[31] Secure Shell Script http://en.wikipedia.org/wiki/Secure Shell.
[32] The Phoenix System for MapReduce Programming
http://mapreduce.stanford.edu/.
[33] WindowsAzureCloudServices http://www.windowsazure.com/en-
us/manage/services/cloud-services/.
[34] Windows Azure http://www.windowsazure.com/en-us/.
[35] Xen Project http://www.xenproject.org/.
[36] Yahoo on Hadoop http://www.ithome.com.tw/itadm/article.php?c=49410&s=4.
[37] D. Borthakur, K. Muthukkaruppan, K. Ranganathan, S. Rash, J.-S. Sarma, N. Spiegelberg, D. Molkov, R. Schmidt, J. Gray, H. Kuang, A. Menon, A. Aiyer, ``Apache Hadoop Goes Realtime at Facebook,' In SPECIAL INTEREST GROUP
ON MANAGEMENT OF DATA 2011, June 12-16, 2011, Athens, Greece.
[38] J. Dean, and S. Ghemawat, ``MapReduce: Simplified Data Processing on Large Clusters,' In Operating Systems Design and Implementation 2004, pp 137-150, Dec 2004.
[39] H. T. Dinh, C. Lee, D. Niyato, and P. Wang, ``A survey of mobile cloud computing: architecture, applications, and approaches,' Wireless Communications and Mobile Computing, 2011.
[40] N. Fernando, S. W. Loke, W. Rahayu, ``Mobile cloud computing: A survey,' Future Generation Computer Systems, Volume 29, Issue 1, Pages 84-106, January, 2013
[41] S. Ghemawat, H. Gobio®, and S.-T. Leung, ``The Google File System,' In ACM SIGOPS Operating Systems Review, Vol. 37, No. 5, pp. 29-43, October, 2003.
[42] B. Hindman, A. Konwinski, M. Zaharia, and I. Stoica, ``A common substrate for cluster computing,' In Workshop on Hot Topics in Cloud Computing (HotCloud) 2009, 2009.
[43] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, ``Dryad: Distributed data-parallel programs from sequential building blocks,' In Proceedings of the 2007 EuroSys Conference, pages 59-72, Lisbon, Portugal, March 2007.
[44] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar and A. Goldberg, ``Quincy: Fair Scheduling for Distributed Computing Clusters,' In Symposium on Operating Systems Principles 2009, October 11-14, 2009, Big Sky, Montana, USA.
[45] S. Y. Ko, I. Hoque, B. Cho, and I. Gupta, ``Making Cloud Intermediate Data Fault tolerant,' In Proceedings of the 1st ACM symposium on Cloud computing, 2010.
[46] National Institute of Standards and Technology, ``The NIST De¯nition of Cloud Computing,' NIST special publication, 800, 145, September, 2011.
[47] B. Palanisamy, A. Singh, L. Liu, B. Jain, ``Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud,' In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and analysis, page 58. ACM, November 12-18, 2011, Seattle, Washington, USA.
[48] Z. Sanaei, S. Abolfazli, A. Gani, and R. Buyya, ``Heterogeneity in Mobile Cloud Computing: Taxonomy and Open Challenges,' IEEE Transaction on Communications Surveys and Tutorials, vol.PP, no.99, Pages 1-24, May, 2013.
[49] J. Xie, S. Yin, X. Ruan, Z. Ding, Y. Tian, J. Majors, A. Manzanares, and X. Qin, ``Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters,' In Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, pages 1-9, Atlanta, 2010.
[50] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, ``Job Scheduling for Multi-User MapReduce Clusters ,' Technical Report UCB/EECS-2009-55, University of California at Berkeley, Apr 2009.
[51] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, ``Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling,' In Proceedings of the 5th European conference on Computer systems 2010, April 2010.
[52] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica, ``Improving MapReduce Performance in Heterogeneous Environments,' In Proceedings of the 8th USENIX conference on Operating systems design and implementation, pages 29-42, San Diego, CA, December 2008.