簡易檢索 / 詳目顯示

研究生: 戴資力
Tai, Tzu-Li
論文名稱: 可執行於多個 Spark 上之彈性分散式資料集
FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark
指導教授: 謝錫堃
Shieh, Ce-Kuen
共同指導教授: 張志標
Chang, Jyh-Biau
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2016
畢業學年度: 104
語文別: 英文
論文頁數: 37
中文關鍵詞: Apache Spark多叢集運算系統聯合彈性分散式資料集
外文關鍵詞: Apache Spark, Multicluster Computing Systems, Federated Resilient Distributed Datasets
相關次數: 點閱:79下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Apache Spark 是一個近期非常廣為被使用的分散式內存運算系統(in-memory cluster system),其對於迭代式運算的優越能力遠剩於較傳統的 MapReduce 框架。 Spark 效能之所以優越主要源自於它的「彈性分散式資料集(RDD,全名為 Resilient Distributed Dataset)」抽象化型態,可以讓使用者以連續的 RDD 轉換運算來輕易的表示多階層且複雜的應用。然而,如同 MapReduce,Spark 與它的 RDD 抽象化型態起初是為了單一叢集運算被設計的。當今天的許多「巨量資料」應用之資料源常常是分散於不同的資料中心被產生、儲存、與維護,而原本單一叢集的框架設計並不符合這樣的使用場景。這樣資料稱為「聯合資料集(federated datasets)」,而為了要能有效的處理這種資料,學界與業界皆開始發展許多關於「多叢集運算系統 (Multicuster systems)」的研究與開發。雖然這項研究領域已對 MapReduce 框架為基底的多叢集運算系統有充份的探索,基於 RDD 的多叢集運算系統仍是一個非常新而且未被發展的題目。基於此,我們提倡「聯合彈性分散式資料集(FedRDD,全名 為 Federated Resilient Distributed Datasets)」,是為了多叢集運算系統而設計的基於 RDD 之延伸抽象化型態。我們專注於兩個主要目標: ㄧ)使用者透通性、與 二)最佳化之資料彙整類運算(data aggregation)。為了達成這些目標,我們將 FedRDD 設計成一個類似主從關係之多階層架構(master-slave hierarchy),使多個 slave RDD 被單一個 master FedRDD 維護與操控。在這樣的設計之下,一個 FedRDD 的全域轉變(global transformation)可以被執行為多個局部的區域性轉變(local transformation),而過程中使用者皆使用原本 RDD 之應用撰寫介面。除此之外,為了最佳化資料彙整運算時跨叢集的資料傳輸量,我們也提出了 FedRDD 之轉換執行流程最佳化方法。

    Apache Spark, an in-memory cluster system, has largely grown in popularity and industrial acceptance due to its capabilities to perform iterative computations with surpassing efficiency compared to MapReduce. Spark's success mainly lies in its Resilient Distributed Dataset (RDD) abstraction, which allows cluster application designers to flexibly express multi-level, complex workflows with RDD transformations. However, like MapReduce, Spark and the RDD abstraction was originally designed for cluster computing on a single set of tightly-coupled nodes. This initial design fails to meet an emerging genre of applications where input data are generated, stored and maintained at multiple independent datacenters. Multicluster systems had been designed to efficiently process such federated datasets. While this area of research has thoroughly explored MapReduce-based solutions, RDD-based approaches remains a new and unexplored topic. We propose Federated RDD (FedRDD), an extension to the RDD abstraction for multicluster computing. We focus on two aims: 1) user transparency and 2) optimized data aggregation. To meet these challenges, we have designed FedRDD with a master-slave hierarchy in which multiple slave RDDs are maintained. Following this design, a FedRDD global transformation can be executed as regional transformations on slave RDDs while still being exposed to the user as the original interface. Moreover, a transformation sequence refactoring formulation is proposed based on this design to minimize inter-cluster data transmission for the purpose of aggregation optimization. As a first effort to explore RDD-based multicluster system solutions, we believe our research provides valuable insight for further research.

    Chapter 1. Introduction --- p.1 Chapter 2. Background --- p.4 2.1 Apache Spark Overview ---p.4 2.2 RDD Internals --- p.5 2.3 Transformation Dependencies and Staging --- p.6 Chapter 3. Federated Resilient Distributed Dataset (FedRDD) --- p.7 3.1 Design Issues --- p.7 3.2 High-Level Abstraction --- p.9 3.3 Internal Master-Slave Hierarchy --- p.10 3.4 Transformation Sequence Refactoring --- p.12 Chapter 4. Implementation --- p.17 4.1 Runtime System Overview --- p.17 4.1.1 FedRDD Global Stage Scheduler (GSS) --- p.18 4.1.2 FedRDD Global Task Scheduler (GTS) --- p.19 4.1.3 Federated Actor System --- p.20 4.2 FedRDD Runtime Initialization --- p.20 4.3 FedRDD Task Assignment Flow --- p.22 Chapter 5. Experimental Evaluation --- p.24 5.1 Environments and dataset --- p.24 5.2 Evaluation 1: Multicluster performance --- p.25 5.3 Evaluation 2: Refactoring effectiveness --- p.28 Chapter 6. Related Works --- p.30 6.1 MapReduce-based multicluster computing systems --- p.30 6.2 In-memory cluster computing --- p.32 Chapter 7. Conclusion --- p.33 References --- p.34

    [1] Akka. http://akka.io/. Accessed: 2015-07-05.

    [2] Running spark on yarn. https://spark.apache.org/docs/1.2.
    0/running-on-yarn.html. Accessed: 2015-07-05.

    [3] Scala. http://www.scala-lang.org/. Accessed: 2015-07-05.

    [4] Spark standalone mode. http://spark.apache.org/docs/ latest/spark-standalone.html. Accessed: 2015-07-05.

    [5] TechCrunch ibm pours researchers and resources into apache
    spark project. http://techcrunch.com/2015/06/15/ ibm-pours-researchers-and-resources-into-apache-spark-project/ #.mrmzsd:Jk4k. Accessed: 2015-07-02.

    [6] Typesafe apache spark survey from typesafe: The results are in. https:// dzone.com/articles/apache-spark-survey-typesafe-0. Accessed: 2015-07-02.

    [7] Vagrant. http://www.vagrantup.com. Accessed: 2015-07-07.

    [8] R. Brunner, S. Djorgovski, T. Prince, and A. Szalay. Massive datasets in astronomy. In J. Abello, P. Pardalos, and M. Resende, editors, Handbook of Massive Data Sets, volume 4 of Massive Computing, pages 931–979. Springer US, 2002.

    [9] R. J. Byrd, S. E. Smith, and S. P. deJong. An actor-based programming sys- tem. In Proceedings of the SIGOA Conference on Office Information Systems, pages 67–78, New York, NY, USA, 1982. ACM.

    [10] R. J. Byrd, S. E. Smith, and S. P. deJong. An actor-based programming sys- tem. ACM SIGOA Newsletter, 3(1-2):67–78, June 1982.

    [11] M. Cardosa, C. Wang, A. Nangia, A. Chandra, and J. Weissman. Exploring mapreduce efficiency with highly-distributed data. In Proceedings of the Sec- ond International Workshop on MapReduce and Its Applications, MapReduce ’11, pages 27–34, New York, NY, USA, 2011. ACM.

    [12] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107–113, Jan. 2008.

    [13] B. Fitzpatrick. Distributed caching with memcached. Linux J., 2004(124):5–, Aug. 2004.

    [14] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In D. Agrawal, K. Candan, and W.-S. Li, editors, New Frontiers in Information and Software as Services, volume 74 of Lecture Notes in Business Information Processing, pages 209–228. Springer Berlin Heidelberg, 2011.

    [15] S. Ibrahim, H. Jin, L. Lu, L. Qi, S. Wu, and X. Shi. Evaluating mapreduce on virtual machines: The hadoop case. In M. Jaatun, G. Zhao, and C. Rong, ed- itors, Cloud Computing, volume 5931 of Lecture Notes in Computer Science, pages 519–528. Springer Berlin Heidelberg, 2009.

    [16] J. Issa and S. Figueira. Hadoop and memcached: Performance and power characterization and analysis. Journal of Cloud Computing, 1(1), 2012.

    [17] C. Jayalath, J. Stephen, and P. Eugster. From the cloud to the atmosphere: Running mapreduce across data centers. IEEE Transactions on Computers, 63(1):74–87, 2014.

    [18] Y. Luo and B. Plale. Hierarchical mapreduce programming model and scheduling algorithms. In Proceedings of the 2012 12th IEEE/ACM Inter- national Symposium on Cluster, Cloud and Grid Computing (Ccgrid 2012), CCGRID ’12, pages 769–774, Washington, DC, USA, 2012. IEEE Computer Society.

    [19] M. Peacock. Creating Development Environments with Vagrant. Packt Pub- lishing, 2013.

    [20] R. Rankin, M. Gordon, R. G. Potter, and B. Satchwill. Federated and Cloud Enabled Resources for Data Management and Utilization. AGU Fall Meeting Abstracts, page 3, Dec. 2011.

    [21] S. Sangavi, A. Vanmathi, R. Gayathri, R. Raju, P. V. Paul, and P. Dhavachel- van. An enhanced {DACHE} model for the mapreduce environment. Proce- dia Computer Science, 50:579 – 584, 2015. Big Data, Cloud and Computing Challenges.

    [22] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Stor- age Systems and Technologies (MSST), MSST ’10, pages 1–10, Washington, DC, USA, 2010. IEEE Computer Society.

    [23] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Ra- dia, B. Reed, and E. Baldeschwieler. Apache hadoop yarn: Yet another re- source negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC ’13, pages 5:1–5:16, New York, NY, USA, 2013. ACM.

    [24] C.-Y. Wang, T.-L. Tai, S. Jui-Shing, C. Jyh-Biau, and S. Ce-Kuen. Feder- ated mapreduce to transparently run applications on multicluster environment. In Proceedings of the 2014 IEEE International Congress on Big Data, BIG- DATACONGRESS ’14, pages 296–303, Washington, DC, USA, 2014. IEEE Computer Society.

    [25] L. Wang, J. Tao, R. Ranjan, H. Marten, A. Streit, J. Chen, and D. Chen. G- hadoop: Mapreduce across distributed data centers for data-intensive comput- ing. Future Gener. Comput. Syst., 29(3):739–750, Mar. 2013.

    [26] T. White. Hadoop: The Definitive Guide. Yahoo! Press, USA, 2010.

    [27] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault- tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, pages 2–2, Berkeley, CA, USA, 2012. USENIX Association.

    [28] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, pages 10–10, Berkeley, CA, USA, 2010. USENIX Association.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE