| 研究生: |
陳建廷 Chen, Jian-Ting |
|---|---|
| 論文名稱: |
大數據即時平台上的重贅節省資料傳輸方法 Fast Deduplication Data Transmission Scheme on a Big Data Real-time Platform |
| 指導教授: |
鄭憲宗
Cheng, Sheng-Tzong |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2016 |
| 畢業學年度: | 104 |
| 語文別: | 英文 |
| 論文頁數: | 49 |
| 中文關鍵詞: | 大數據 、資料重複刪除技術 、In-memory Computing 、Spark |
| 外文關鍵詞: | Big Data, Deduplication, In-memory Computing, Spark |
| 相關次數: | 點閱:164 下載:8 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著巨量資料時代來臨,對於如何有運用這些巨量資料成為了一大難題。如何能夠在更短的時間處理更多資料,甚至是即時地處理這些資料,過去的分散式運算架構MapReduce已不符合Real-time的需求。為了解決這項問題,記憶體內運算(In-memory Computing IMC)的概念被提出來。
記憶體內運算如同其字面上的意義,解決了MapReduce過分地對硬碟存取資料所造成的成本問題,並能夠有效地去執行分散式疊代運算。可是,IMC分散式運算依然無法擺脫一個瓶頸,即網路的頻寬,其將資料從來源取得以及分散至各個節點都受到頻寬限制。根據觀察,來自感應裝置的部份資料會因為時間或空間相依性而有所重複。因此,重複資料刪除技術將會是一個不錯的解決方案,以消除數據的重複部分來提高資料的傳送效率。
本研究提出了重贅節省資料傳輸方法來優化IMC平行化即時運算平台Spark Streaming,利用重複資料刪除技術針對來源資料可能的重複區塊進行剔除的動作,以期望提高對資料的使用率。因此在同一頻寬下,這個方法將能夠傳輸更多的資料進而提高運算平台的處理能力。
With the huge amount of information era is coming, it is a difficult problem to exploit and compute these data efficiently. Today, it is inadequate to use MapReduce to handle more data in less time even real time. Hence, it presented “In-memory Computing (IMC)” to resolve the problem of Hadoop MapReduce.
IMC with its literal meaning, uses computing in memory to solve the cost problem which Hadoop undue access data to disk caused and can be effectively distributed to perform iterative operations. However, IMC distributed computing still cannot get rid of a bottleneck, that is, network bandwidth. It restricts the speed that receiving the information from the source and dispersing information to each node. According to observation, some data from sensor devices might be duplicate due to time or space dependence. Therefore, deduplication technology would be a good solution, the technology with eliminating duplicate part of data is capable of improving data utilization.
This study presents a distributed real-time IMC platform “Spark Streaming” optimization which is used deduplication technology to eliminate the possible duplicate blocks from source. It is expected to reduce redundant data transmission and improve the throughput of Spark Streaming.
[1] “Google MapReduce,” 2011, http://research.google.com/archive/mapreduce.html
[Jun. 30, 2016].
[2] “Hadoop,” 2014, http://hadoop.apache.org/ [Jun. 30, 2016].
[3] M. Zaharia, M. Chowdhury, M. Franklin, S. Shenkr, and I. Stoica. “Spark: cluster computing with working sets,” in HotCloud, 2010.
[4] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cyirci, “A Survey on Sensor Networks,” IEEE Communications Magazine, vol. 40, no. 8, Aug. 2002, pp.102 -114.
[5] The MD5 Message-Digest Algorithm, IETF RFC 1320, April 1992; www.rfc-editor.org/rfc/rfc1320.txt.
[6] US secure hash algorithm 1 (SHA1), IETF RFC 3174, 2001; www.rfc-editor.org/rfc/rfc3174.txt.
[7] A. Tridgell and P. Mackerras, “The Rsync Algorithm,” Technical Report TR-CS-96-05, Department of Computer Science, The Australian National University, Canberra, Australia, June 1998. Available: https://rsync.samba.org/tech_report/ [Jun. 30, 2016].
[8] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault tolerant abstraction for in-memory cluster computing,” In Proceedings of the 9th USENIX conference on Netwroked Systems Design and Implementation, pages 2-2, USENIX Association, 2012.
[9] “The Scala programming language,” 2016, http://www.scala-lang.org [Jun. 30, 2016].
[10] M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica, “Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters,” In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Computing, pages 10–10. USENIX Association, 2012.
[11] ZLIB compressed data format specification version 3.3, IETF RFC 1950, May 1996; www.rfc-editor.org/rfc/rfc1950.txt.
[12] M. Athicha, B. Chen, and D. Mazieres. “A low-bandwidth network file system.” ACM SIGOPS Operating Systems Review. Vol. 35. No.5. ACM, 2001.
[13] M. O. Rabin. “Fingerprinting by random polynomials.” Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
[14] D. T. Meyer and W. J. Bolosky, “A Study of Practical Deduplication,” in Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’ 11), 2011, pp. 1-14.
[15] The ‘application/zlib’ and ‘application/gzip’ Media Types, IETF RFC 6713, August 2012; www.rfc-editor.org/rfc/rfc6713.txt.
[16] Y. Collet, “xxhash,” https://github.com/Cyan4973/xxHash [Jun. 27, 2016].
[17] A. Appleby, “SMHasher & MurmurHash,” 2012, https://github.com/aappleby/
smhasher [Jun. 27, 2016].
[18] J. Yuan, Y. Zheng, X. Xie, and G. Sun, “Driving with knowledge from the physical world,” In The 17th ACM SIGKDD (international conference on Knowledge Discovery and Data mining), KDD'11, New York, NY, USA, 2011. ACM.
[19] L. A. Zadeh, “Fuzzy sets,” Information and Control, vol. 8, pp. 338-353, 1965.
[20] J. Kreps, N. Narkhede, and J. Rao. “Kafka: A distributed messaging system for log processing.” In Proceedings of 6th International Workshop on Networking Meets Databases (NetDB), Athens, Greece, 2011.