| 研究生: |
施韋銨 Shih, Wei-An |
|---|---|
| 論文名稱: |
Hadoop分散式R運算服務之智慧及動態資源配置 Intelligent, Adaptive Resource Allocation for Distributed R Computing Service over Hadoop |
| 指導教授: |
蕭宏章
Hsiao, Hung-Chang |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2018 |
| 畢業學年度: | 106 |
| 語文別: | 中文 |
| 論文頁數: | 31 |
| 中文關鍵詞: | Hadoop YARN 、R 、分散式 |
| 外文關鍵詞: | Hadoop YARN, R, distributed |
| 相關次數: | 點閱:84 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
R語言,是當前在資料統計、繪圖常用的腳本語言之一,但R在本身得設計上,卻是以單執行緒運行,雖然目前已經有許多平行化的套件,但此類型平行套間皆以核心數為單位來平行運算,使得面對大量資料下的R使用者需要伺服器等級的環境才得以解決問題。Distributed R Service (DRS),顧名思義就是將R程式分散的一個服務,是相關研究人員提出的分散式運算框架,用以解決R程式在單機上遇到大量資料處理的問題,簡單來說,就是資源的不足導致運算效率差或工作執行失敗。對R使用者來說,使用DRS只需要懂得定義好一個DRS工作內容,因為DRS對R使用者隱藏了分散式執行的細節,不需要特別去撰寫原本R程式以外的部分,以降低使用者的學習門檻。DRS是建立於Hadoop YARN上的一個應用服務,就像Spark與MR一樣依賴YARN的資源管理分配,DRS也依賴YARN提供的叢集資源管理與分配功能,去建構出適合R執行的工作流程,也進一步提供分散式支援功能如動態分配任務、任務排程、錯誤回復、R使用者自定義函數等等。
本文中,將會說明在一個固定資源的工作下,如何解決R任務因記憶體不足而導致工作失敗的問題,與原本資源利用不佳的問題。透過這些設計,讓DRS能更彈性的被使用,給使用者依自己環境的需求,調整相對應的設置。
R, is a programming language for statistical computing and graphics. It is widely used among statisticians and data miners for developing statistical software and data analysis. The Distributed R Service (DRS) is a service for R language to distribute on Hadoop compute platform. Unlike R-Hadoop, spark-R, and Distributed-R, DRS is friend-ly for users to use. DRS hides the distributed implements to users. So, DRS's users do not need to modify the logic in your R code. Just set up the configuration and R code and you will get the benefit of distributing.
In this paper, we will show how to schedule the fixed resources in a DRS job and how to solve the failed task owing to the memory problem. By our new design, DRS get more elasticity in resources. DRS’s users can set up some configuration about resources. Otherwise, DRS also adjusts the resource to appropriate combination of container. Finally, we will present how big performance improvements of DRS.
[1] Hadoop. https://hadoop.apache.org/
[2] K. Shvachko, et al. HDFS: The Hadoop Distributed File System. In Proceedings of
the IEEE 26th Symposium on Mass Storage Systems and Technologies, 2010.
[3] V. K. Vavilapalli, et al. YARN: Apache Hadoop YARN: Yet Another Resource
Negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing, 2013.
[4] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on
Large Clusters. Commun. Association for Computing Machinery, 2008.
[5] Spark. https://spark.apache.org/
[6] Storm. http://storm.apache.org/
[7] Flink. https://flink.apache.org/
[8] Ihaka and R. Gentleman. R: A Language for Data Analysis and Graphics. In
Journal of Computational and Graphical Statistics, 1996.
[9] rmr2. https://github.com/RevolutionAnalytics/rmr2
[10] Venkataraman1, et al. SparkR: Scaling R Programs with Spark. In
Proceedings of SIGMOD’16, 2014.
[11] Distributed R. http://www.hpl.hp.com/research/distributedr.htm
[12] 黃彥周. DRS: Massively R Data Parallel Computation over Hadoop without
MapReduce. 成功大學分散式系統實驗室, 2016.
[13] 曾冠博. HDS: The Web-based Data Service over Hadoop. 成功大學分散式系統實驗室, 2017.
[14] B. Hindman, et al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In Networked Systems Design and Implementation, 2011.
[15] Malte Schwarzkopf, et al. Omega: flexible, scalable schedulers for large compute
clusters. In Proceedings of EuroSys’13, 2013.
[16] Corona. https://docs.coronalabs.com/
[17] Python. https://www.python.org/
[18] TensorFlow. https://www.tensorflow.org/