| 研究生: |
蔡孟軒 Tsai, Meng-Hsuan |
|---|---|
| 論文名稱: |
在大數據平台上運用K-core到Graphx函式庫去尋找節點群的中心點 Using K-core Decomposition to find cluster center for K-mean algorithm in Graphx on the Spark |
| 指導教授: |
鄭憲宗
Cheng, Sheng-Tzong |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2016 |
| 畢業學年度: | 104 |
| 語文別: | 英文 |
| 論文頁數: | 49 |
| 中文關鍵詞: | 雲端運算 、Graphx 、Spark 、k核心分解 、基於圖之k平均分群演算法 |
| 外文關鍵詞: | Cloud computing, Graphx, Spark, k-core decomposition, graph-based k-means |
| 相關次數: | 點閱:116 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來巨量資料分析越來越重要,以往常見的分散式平台Hadoop與MapReduce架構在資料分析方面有其不完備的地方:社群網路中資料都以圖為數據結構的分析有比較多的關聯性,以Hadoop的平行運算方式執行的效率不好且不夠直觀,且為了處理大量資料而大量使用硬碟溝通的方式也對執行效率造成影響。
近年以In-memory架構的Spark於2014的發表中它透過將資料存放在記憶體中重複運算解決Hadoop對硬體過度存取而導致運算時間較長的問題,在使用與Hadoop相同的環境和實驗中以更短時間完成工作。而Graphx是Spark提供的一個處理圖形的介面,它可以讓圖處理在Spark上更加精簡並且有效率。
本論文提出透過k核心分解作為尋找群中心點的方法改良圖形結構的分群演算法,由於k核心分解是尋找出一個最緊密的一個子群,透過這樣的性質從中尋找出一個最為可能的中心點。我們將其實作到Graphx並利用此演算法來得到更好的分群結果。
With Big Data analysis became more important in recent years, the common platform such as Hadoop and MapReduce distributed architecture has its incompleteness place in data analysis. In social network analysis, many data are the graph structure, which has more relevance in structure, the efficiency of data-parallel computing likes Hadoop is not good and not intuitive, and in order to handle large amounts of data, using many data access also impact on efficiency.
In 2014, In-memory Computing architecture called Spark. Through reusing of the data in memory to solve the long computation time issue for hardware over access in Hadoop, Spark finished the task with shorter time under the same environment with Hadoop using. Graphx is a Spark API that provided a graphics and makes graph computation simplify and efficient. This study presents an improved graph clustering method by using k-core decomposition, which is a popular algorithm in community detection to find the center in each cluster. We implement the clustering algorithm with Graphx and use this to get better results.
[1] “Hadoop” 2014, http://hadoop.apache.org/ [Jun. 30, 2016].
[2] “Google MapReduce” 2011, http://research.google.com/archive/mapreduce.html [Jun. 30, 2016].
[3] M. Zaharia, et al. “Spark: cluster computing with working sets” HotCloud10 (2010): 10-10.
[4] M. Zaharia et al., “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in NSDI, 2012.
[5] “GraphX: A Resilient Distributed Graph System on Spark”, AMPLab, EECS, UC Berkeley 2013
[6] “Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud”, 2012
[7] Simmhan, Yogesh, et al. “Goffish: A sub-graph centric framework for large-scale graph analytics” European Conference on Parallel Processing. Springer International Publishing, 2014.
[8] Malewicz, Grzegorz, et al. “Pregel: a system for large-scale graph processing” Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010.
[9] Shivaram Venkataraman, Erik Bodzsar, Indrajit Roy, Alvin AuYoung, and Robert S Schreiber. “Presto: distributed machine learning and graph processing with sparse matrices” In Proceedings of the 8th ACM European Conference on Computer Systems, 2013.
[10] A. Montresor, F. D. Pellegrini, and D. Miorandi, “Distributed kcore decomposition” IEEE Trans. Parallel Distrib. Syst., vol. 24, no. 2, pp. 288–300, Feb. 2013.
[11] V. Batagelj and M. Zaversnik, “An o(m) algorithm for cores decomposition of networks,” CoRR, vol. cs.DS/0310049, 2003.
[12] “Stanford Large Network Dataset Collection” http://snap.stanford.edu/data/index.html [Jul, 6, 2016]
[13] J. Hartigan and M. Wong. “A k-means clustering algorithm.” Applied Statistics, 28:100–108, 1979.
[14] http://stackoverflow.com/questions/23700124/how-to-get-sssp-actual-path-by-apache-spark-graphx [Jul. 30, 2016].
[15] R. Bellman, “On a Routing Problem,” Quarterly of Applied Mathematics, Vol. 16, No. 1, 1958, pp. 87-90.
[16] https://github.com/ankurdave/spark/blob/vldb/graphx/src/main/scala/org/apache/spark/graphx/lib/KCoreFast.scala [Jul. 30, 2016].
[17] D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson, Behavioral Ecology and Sociobiology 54, 396-405 (2003).
[18] W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33, 452-473 (1977).
[19] J. McAuley and J. Leskovec. “Learning to Discover Social Circles in Ego Networks.” NIPS, 2012.
[20] J. Leskovec, J. Kleinberg and C. Faloutsos. “Graph Evolution: Densification and Shrinking Diameters.” ACM Transactions on Knowledge Discovery from Data (ACM TKDD), 1(1), 2007.
[21] M. Ripeanu and I. Foster and A. Iamnitchi. “Mapping the Gnutella Network: Properties of Large-Scale Peer-to-Peer Systems and Implications for System Design.” IEEE Internet Computing Journal, 2002.
[22] M. NEWMAN. “Modularity and community structure in networks.” Proc. Natl. Acad. Sci. USA, 2006, 103(23): 8577-8582.
[23] M. NEWMAN, M. GIRVAN. “Finding and evaluating community structure in networks.” Phys. Rev. E, 2004, 69(2): 026113.
[24] D. Lusseau and M. E. J. Newman. “Identifying the role that individual animals play in their social network.” PROC.R.SOC.LONDON B, 271:S477, 2004.
校內:2018-09-01公開