簡易檢索 / 詳目顯示

研究生: 黃子恩
Huang, Tzu-En
論文名稱: iFedMR:支持迭代MapReduce應用的聯邦式Hadoop系統
iFedMR: A Federated Hadoop System for Iterative MapReduce Applications
指導教授: 謝錫堃
Shieh, Ce-Kuen
共同指導教授: 張志標
Chang, Jyh-Biau
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2016
畢業學年度: 104
語文別: 英文
論文頁數: 39
中文關鍵詞: 跨區域運算迭代式MapReduce框架資料分配
外文關鍵詞: Cross-region Computing, Iterative MapReduce Framework, Data Partitioning
相關次數: 點閱:60下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在跨叢集、大數據的運算系統裡,如何支持迭代式Mapreduce運算(機學學習、資料探勘等等)是一個重要的議題。而目前主要的做法是,直接的將原始數據傳送、收集到一個叢集,再進行迭代式 MapReduce 的運算。然而,這個做法有幾個問題。第一個,只用一個叢集運算,浪費了閒置的叢集的資源。再者,資料的隱私性,很多原始資料在未經過運算前是不允許隨意傳送到別的叢集的。
    在這個研究裡,我們延伸的之前的研究成果FedMR,它能運行非迭代式的MapReduce運算,但不支持迭代式的運算。我們提出了iFedMR,支持跨區域迭代式的MapReduce運算的一個系統。另外,iFedMR為了要使迭代式MapReduce整體運行時間縮短,調整了Map和Reduce之間傳遞資料時,資料分配的策略。在跨叢集環境裡,基於叢集之間的檔案傳送速度和叢集的運算速度,iFedMR平衡資料的分配。實驗數據顯示,對比於預設的平均分配策略,iFedMR最高能降低百分之18的運算時間。

    In data-intensive cross-region computing systems, to support iterative applications, like data mining and machine learning, is an upcoming and important challenge. The dominant approach is to aggregate all the information to a single cluster, then performs the iterative MapReduce application. However, this will waste the resource of the remaining others, and there might be privacy concern about concentrating data to a single cluster.
    In this paper, we extend on our previous work FedMR that is a cross-region computing framework which can cooperate MapReduce tasks executed on multiple clusters and aggregate consequences into a single cloud. However, for iterative applications in FedMR, outcomes must be manually redistributed for next run. To provide better assistance in repetitive computing, we present an iFedMR; a system supports cross-region iterative MapReduce applications. Moreover, iFedMR achieves low execution time by adjusting the strategy of intermediate data partitioning phase. In the heterogeneous inter-cluster environment, iFedMR balances the partition of intermediate data for each cluster, with the consideration of the transfer speed between clusters and the computing speed of each cluster. Evaluation across three clusters in simulated network environment shows that iFedMR lowers the execution time up to 18% compared to the default approach.

    Chapter 1 : Introduction 1 Chapter 2 : Backgrounds & Related Works 5 2.1 Backgrounds 5 2.1.1 Federated HDFS 5 2.1.2 MapReduce Programming Model 5 2.2 Related Works 6 2.2.1 Federated MapReduce 6 2.2.2 Optimizing Grouped Aggregation in Geo-Distributed Streaming Analytics [6] 6 2.2.3 Low Latency Geo-Distributed Data Analytics [7] 7 2.2.4 Joint Scheduling of Data and Computation in Geo-Distributed Cloud Systems [8] 7 Chapter 3 : System Design 8 3.1 FedJobManager 10 3.1.1 XML File Configuration 11 3.2 CloudManager 11 3.3 Proxy Mapper/Proxy Reducer 12 3.3.1 Direct Remote HDFS Transfer 12 3.3.2 Intermediate Data Serialization 14 3.3.3 User-defined Key/Value Format 16 3.4 Partitioner 16 3.4.1 Adaptive Partitioner 17 3.5 InfoCollector 18 3.6 Iterative Federated Map Reduce Job 19 Chapter 4 : Partition Strategy 21 4.1 Considering Map Computation 26 Chapter 5 : Experiment 28 5.1 Without Considering Map Computation 30 5.1.1 K-means 30 5.2 Partition Strategy with Considering Map Computation 32 5.2.1 PageRank 32 5.2.2 SimRank 33 5.2.3 Genetic Algorithm 35 Chapter 6 : Conclusion & Future Work 37 Chapter 7 : References 38

    [1] Hsu et al, "A Similarity-based P2P Botnet Detection Algorithm for Inter-Domain NetFlow Analysis". Unpublished manuscript, National Cheng Kung University.
    [2] NetFlow http://www.cisco.com/c/dam/en/us/products/collateral/ios-nx-os-software/ios-netflow/prod_case_study0900aecd80311fc2.pdf
    [3] Chun-Yu Wang, et al. "Federated MapReduce to Transparently Run Applications on Multicluster Environment". 2014 IEEE International Congress on Big Data.
    [4] Liu et al, “FedHDFS+”. Unpublished manuscript, National Cheng Kung University
    [5] J. Dean and S. Ghemawat, "MapReduce: Simplified data processing on large clusters". Communications of the ACM, vol. 51, pp. 107-113, 200S.
    [6] Benjamin Heintz, et al. "Optimizing Grouped Aggregation in Geo-Distributed Streaming Analytics" TSINGHUA SCIENCE AND TECHNOLOGY ISSN 1007-0214 01/10 pp125-135, Volume 21, Number 2, April 2016
    [7] Qifan Pu, et al. "Low Latency Geo-distributed Data Analytics" SIGCOMM '15 Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication 421-434
    [8] Lingyan Yin, et al. "Joint Scheduling of Data and Computation in Geo-Distributed Cloud Systems" Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium 657 - 666
    [9] Apache Hadoop hadoop.apache.org
    [10] Weizhong Zhao, et al. "Parallel K-Means Clustering Based on MapReduce" CloudCom 2009, LNCS 5931, pp. 674–679, 2009.
    [11] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford University, 1998.
    [12] Bahman Bahman, Kaushik Chakrabarti, and Dong Xin. "Fast Personalized PageRank on MapReduce" SIGMOD’11, June 12–16, 2011, Athens, Greece.
    [13] Glen Jeh, and Jennifer Widom. "SimRank: a measure of structural-context similarity." KDD '02 Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 538-543
    [14] Luo, Y. ,Z.(G.) Guo, Y. Sun, B. Plale, J. Qiu, and W. Li. "A hierarchical framework for cross-domain MapReduce execution" Proceedings of the second international workshop on Emerging computational methods for the life sciences, Jun 2011.
    [15] K. F. Man, K. S. Tang, and S. Kwong "Genetic Algorithms: Concepts and Applications" IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, VOL. 43, NO. 5, OCTOBER 1996, 519-534

    無法下載圖示 校內:2021-07-01公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE