研究生: |
謝亦凡 Xie, Yi-Fan |
---|---|
論文名稱: |
基於Hadoop之GPU叢集的大資料Python平行運算 Massively Data Parallel Computation with Python over GPU-Enabled Hadoop |
指導教授: |
蕭宏章
Hsiao, Hung-Chang |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2017 |
畢業學年度: | 105 |
語文別: | 中文 |
論文頁數: | 30 |
中文關鍵詞: | Hadoop YARN 、Python 、GPU |
外文關鍵詞: | Hadoop YARN, Python, GPU |
相關次數: | 點閱:104 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
DRS (Distributed R Service) 是一個將R程式分散執行的框架,通過分散式計算的方式解決單機環境效能瓶頸的問題,尤其是針對大量資料的計算。在生產環境中,Python也是資料分析人員廣泛使用的腳本語言,DRS分散式計算框架的服務能否也支援Python便成為一個議題。根據過去平行化R程式的經驗,我們知道平行化執行會帶來效能的提升;同時,近年來GPU在高速計算領域的發展與應用 (例如人工智慧建模計算) 也多有突破,這讓我們有了更深層次的思考,即GPU能否為基於DRS分散式環境下的分析計算帶來更進一步的效能提升。
在本研究裡,我們通過對YARN資源監控的研究,將DRS框架擴展,提供Python平行化之叢集計算服務。這個服務包含了GPU資源管理及GPU資源監控等模組。我們透過移植一生產環境下使用的統計程式Indicator於我們所開發的平臺來調查:在具有GPU計算資源的叢集下,探討Indicator程式的以Python語言撰寫的複雜度;同時,也探討這樣的程式如何有效的開發GPU計算上的優勢。我們也討論可能會有的效能障礙。
Python is one of the most favorite language in the world. More and more data analyst choose Python as tool for data analysis. Many big data process frameworks such as Spark and Hadoop Streaming let user manipulate them by using Python. Although it is very convenient, it requires the capability on basic distributed system knowledge of developer.
DRS(distributed R service) is a distributed data processing framework based on Hadoop YARN. It has three main component: Application Master, Client and Container. In this paper, we summarize the design, current state and implementation of our application which support distributed Python service and GPU resource management based on DRS. Our approach is providing another distributed solution for Python users and exploring what should be prepared in building GPU cluster. Considering of providing simple user interface and hiding distributed issues to user. We extends the functionality of Application Master and Container. So AM has ability of allocating and managing GPU resource and Container can execute Python program. Then we introduce the CPU algorithm of Indicator and design the algorithm of Indicator for using GPU.
The result of application design are display by performance experiments. And we got some conclusion as we expected.
[1]DRS. 黃彥周. “基於Hadoop之非MapReduce的大資料R平行運算”,成功大學分散式系統實驗室。
[2]Hadoop. https://hadoop.apache.org/
[3]NVIDIA CUDA Introduction
http://www.nvidia.com/object/cuda_home_new.html
[4]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#axzz4nVppUO4T
[5]V. K. Vavilapalli et al.,“Apache Hadoop YARN: Yet Another Resource Negotiator,” in Proc. of the 4th Annual Symposium on Cloud Computing, New York, NY, USA, 2013, p. 5:1–5:16.
[6]MapReduce. Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters”. Commun. ACM,2008.
[7]Spark. https://spark.apache.org/
[8]NVIDIA System Management Interface http://developer.nvidia.com/nvidia-system-management-interface
[9]PyCUDA http://documen.tician.de/pycuda
[10]Theano deeplearning.net/software/theano
[11]Python https://www.python.org/
[12]http://www.hangge.com/blog/cache/detail_1676.html
[13]R https://www.r-project.org/
[14]Python GIL https://wiki.python.org/moin/GlobalInterpreterLock
[15]CUDA Programming Guide docs.nvidia.com/cuda/cuda-c-programming-guide/index.html