簡易檢索 / 詳目顯示

研究生: 黃冠傑
Huang, Kuan-Chieh
論文名稱: 聯邦Hadoop疊代運算機制之研究
FedLoop: HaLooping on Federated Hadoops
指導教授: 謝錫堃
Shieh, Ce-Kuen
共同指導教授: 張志標
Chang, Jyh-Biau
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2013
畢業學年度: 101
語文別: 英文
論文頁數: 37
中文關鍵詞: 雲端運算疊代運算聯合運算MapReduceHadoop
外文關鍵詞: Cloud Computing, Iterative computing, MapReduce, Hadoop, Federated Hadoop
相關次數: 點閱:81下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著行動裝置及網路頻寬的快速發展,透過網路產生及儲存的資料量呈現極快速成長,促進了許多大量資料處理系統的發明。Apache Hadoop計畫實現了一個平行分散式運算框架"MapReduce"並且設計分散式檔案系統Hadoop Distributed File Sys-tem(HDFS)供使用者可快速簡單地佈建運算叢集且輕易地設計程式來處理大量資料。
    然而,疊代運算在現今所有運算中佔有一定的比例,MapReduce運算框架的設計概念並未針對疊代運算進行最佳化,造成使用者利用MapReduce運算框架處理疊代運算時效能不甚理想。另一方面,某些運算中心或研究單位佈建Hadoop運算叢集時,並無法即時地動態調整運算叢集大小及環境,當資源不足時在處理極大量資料會造成運算時間過長。在此種情況下,聯合多個運算叢集資源可有效且即時的解決這個問題。
    在本篇論文中,我們提出一套複合式的MapReduce運算系統稱為"FedLoop",利用此套系統可輕易的聯合多個運算叢集進行疊代運算,也繼承了單一回合運算及單一叢集運算的能力,使用者可在不同的環境下利用我們的系統可有效且快速地解決各種問題,並且不需要隨著不同的環境而更改原始的程式,實驗結果顯示在PageRank應用中我們使用四倍的運算資源對11GB的資料進行PageRank運算可達到1.9倍的效能改善。

    As the rapid development of mobile devices and network bandwidth, the data pro-duced from network shows an amount growth and promotes some intensive-data pro-cessing systems appearance. Hadoop implemented a parallel (Distributed) computing and Distributed File System with the terms “MapReduce Programing Model” and “Hadoop Distributed File System.” User can quickly and easily deploy their own clusters to deal with the large data.
    However, iterative applications accounted an important class of all applications. MapReduce lacks build-in support for iterative applications and causing poor performance when users exploit MapReduce to execute iterative program. On the other hand, some or-ganizations deploys their own Hadoop cluster in a moderate scale. Sometimes, it will result computing time wasted when resources are insufficient in dealing with very large amounts of data. In this case, combining resources of multiple cluster can effectively solve this problem. In this paper we propose a hybrid system called “FedLoop”. Using this system can easily combine more resources to execute iterative applications. In this system, it also inherited single round operation and single cluster computing capability. With our system user didn’t need to re-design their program when executing across multiple clusters. The experimental results show that our system could provide 1.9 times performance gain by us-ing 4 times resources in PageRank 11GB case.

    CHAPTER 1. INTRODUCTION 1 CHAPTER 2. BACKGROUNDS AND RELATED WORKS 5 2.1 BACKGROUNDS 5 2.1.1 MapReduce 5 2.1.2 Hadoop Architecture 7 2.2 RELATED WORKS 9 2.2.1 Iterative Hadoop Frameworks 9 2.2.2 Hierarchical Hadoop Frameworks 11 CHAPTER 3. SYSTEM DESIGN 14 3.1 SYSTEM OVERVIEW 14 3.2 SYSTEM ARCHITECTURE 15 3.3 DESIGN ISSUE 17 CHAPTER 4. IMPLEMENTATION 20 4.1 JOB DISPATCHER 20 4.2 SYNCHRONIZER MECHANISM 21 4.3 COMBINER ISSUE 22 4.4 REGION CLOUD EARLY START 22 4.4.1 Early-Start Execution 22 4.4.2 Early-Start Transmission 23 CHAPTER 5. PERFORMANCE EVALUATION 25 5.1 EXPERIMENTAL ENVIRONMENT & SETUP 25 5.1.1 Environment 25 5.1.2 Applications 27 5.1.3 Dataset 30 5.2 PERFORMANCE 31 5.2.1 Combiner Comparison 33 5.2.2 Early-Start Comparison 34 CHAPTER 6. CONCLUSION AND FUTURE WORK 35 REFERENCES 36

    1. Wikipedia contributors. Big data. Available from: http://en.wikipedia.org/w/index.php?title=Big_data&oldid=561937621.
    2. Hadoop. Available from: http://hadoop.apache.org/.
    3. Facebook. Available from: www.facebook.com.
    4. Yahoo! ; Available from: http://www.yahoo.com/.
    5. Amazon. Available from: http://www.amazon.com/.
    6. Shvachko, K., et al. The hadoop distributed file system. in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. 2010. IEEE.
    7. Dean, J. and S. Ghemawat, MapReduce: simplified data processing on large clusters. Communications of the ACM, 2008. 51(1): p. 107-113.
    8. Kleinberg, J.M., Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 1999. 46(5): p. 604-632.
    9. Bu, Y., et al., HaLoop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment, 2010. 3(1-2): p. 285-296.
    10. Luo, Y., et al. A hierarchical framework for cross-domain MapReduce execution. in Proceedings of the second international workshop on Emerging computational methods for the life sciences. 2011. ACM.
    11. 徐瑞興, 一個可將 MapReduce 程式透通地執行在多個 Hadoop 平台之方法. 成功大學電腦與通信工程研究所學位論文, 2012(2012 年).
    12. DeWitt, D. and J. Gray, Parallel database systems: the future of high performance database systems. Communications of the ACM, 1992. 35(6): p. 85-98.
    13. Foster, I., et al., The physiology of the grid. Grid computing: making the global infrastructure a reality, 2003: p. 217-249.
    14. Dowd, K., C.R. Severance, and M.K. Loukides, High performance computing. Vol. 2. 1998: O'Reilly.
    15. Ghemawat, S., H. Gobioff, and S.-T. Leung. The Google file system. in ACM SIGOPS Operating Systems Review. 2003. ACM.
    16. Bu, Y., et al., The HaLoop approach to large-scale iterative data analysis. The VLDB Journal—The International Journal on Very Large Data Bases, 2012. 21(2): p. 169-190.
    17. Ekanayake, J., et al. Twister: a runtime for iterative mapreduce. in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. 2010. ACM.
    18. Elnikety, E., T. Elsayed, and H.E. Ramadan. iHadoop: asynchronous iterations for MapReduce. in Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on. 2011. IEEE.
    19. Zhang, Y., et al., imapreduce: A distributed computing framework for iterative computation. Journal of Grid Computing, 2012. 10(1): p. 47-68.
    20. Page, L., et al., The PageRank citation ranking: bringing order to the web. 1999.
    21. Hartigan, J.A. and M.A. Wong, Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 1979. 28(1): p. 100-108.

    下載圖示 校內:2018-08-30公開
    校外:2018-08-30公開
    QR CODE