簡易檢索 / 詳目顯示

研究生: 黃奕崴
Huang, Yi-Wei
論文名稱: 大數據串流平台上降低感測資料傳輸的方法
Reduction Scheme for Sensor-Data Transmission on a Big Data Streaming Platform
指導教授: 鄭憲宗
Cheng, Sheng-Tzong
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 英文
論文頁數: 52
中文關鍵詞: 大數據動態傳輸資料壓縮技術In-memory ComputingSpark Streaming
外文關鍵詞: Big Data, In-memory Computing, Spark Streaming, Resilient Distributed Datasets, Data compression technique
相關次數: 點閱:91下載:7
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著感測技術的進步,對於許多感測器應用於各種環境下而產生的巨量資料,如何善用這些巨量資料成為新的商業模式。如何在短時間內可以處理更多資料,甚至是達到即時性分析應用。從過去的分散式運算架構MapReduce,在一些情況下如:機器學習或多層次的迭代運算已經不符合Real-time的需求。主要是MapReduce 缺少一個重要的要素 “有效的資源共享”。為了解決這類的問題,記憶體內運算(In-memory Computing IMC)的概念被提出來。
    IMC就如字面上的意思,將中間運算的結果都存在記憶體內,而不再是頻繁地存取硬碟,解決了磁碟I/O的效能瓶頸。近期經典的應用就是Apache Spark。Apache Spark 是開放原始碼的叢集運算框架,它在資料量越大時,能夠比MapReduce快上幾十倍。然而它仍然無法解決一個瓶頸 “頻寬”。感測資料從各個節點傳入,感測器會受限於資源如: 記憶體,能源、頻寬….等等。根據觀察,這些感測資料因為空間相依性或時間相依性而有一些相似的序列。因此,壓縮資料技術將會是個不錯的解決方案,利用較小的資料量來代表較大的資料量。藉此,來解決感測器資源上的限制,同時提高Spark 的資料使用率。
    本研究提出了降低感測資料的傳輸方法來優化IMC平行化串流運算平台Spark Streaming。利用前處理來提高資料的相似度,讓壓縮技術能夠取代更長的樣式。另一方面,將壓縮與動態傳輸結合在一起,來達到即時性兼顧高壓縮率的效果。因此,降低感測器所要消耗的能量,延長感測器的壽命。同時,在同個頻寬下,可以傳輸更多資料進而提升了運算平台的處理能力。

    Recent advances in sensor technology have led to the availability of a multitude of the sensor, e.g. sound, luminosity, and humidity. Huge raw data is a difficult problem to exploit and compute these data efficiently. Hadoop MapReduce has been used to solve this issue, but the operations which need iteration is not an efficient to handle these data. Hence, “In-memory Computing concept (IMC)” is come up to resolve the problem of Hadoop I/O bottleneck.
    In in-memory computing, the data is computed parallel in random access memory (RAM) instead of slow disk drives. We can train patterns and analyze large data frequently by IMC technique. However, IMC platform does not provide an effective reduce transmission scheme in the real-time system. It may limit some applications like wireless sensor network. It may be impractical for transmitting entire data from each sensor node, due to weak resource such as CPU, Memory, Power, etc. Compress data before sending is an effective way to make good use of sensor nodes limited power supply and make better the life of sensors. According to our observation, most of the sensor data has a similar pattern due to time dependence and spatial dependence. Therefore, we can improve compression efficiency by these characteristics.
    This study presents an effective reduce transmission scheme on a distributed real-time IMC platform “Spark Streaming” which is used to collect data in real-time. We describe the whole system design and implement that provides a high compression ratio in a small batch data from the source. It is expected to reduce data transmission with a little delay time in the soft real-time system.

    摘要 I Abstract II TABLE OF CONTENTS III LIST OF TABLES VI LIST OF FIGURES VII Chapter 1. Introduction and Motivation 1 1.1. Introduction 1 1.2. Motivation 2 1.3. Thesis Overview 4 Chapter 2. Backgrounds 6 2.1. Spark 6 2.1.1. Spark Core 6 2.1.2. Spark Streaming 9 2.2. Message Queue Telemetry Transport Protocol 11 2.3. Lempel-Ziv-Welch Algorithm 13 2.3.1 Encode 13 2.3.2 Decode 14 Chapter 3. System Design 15 3.1. Problem Description 15 3.2. System Design 16 3.2.1. System Architecture 16 3.2.2. Preprocess 17 3.2.3. Mapper 18 3.2.4. Encoder 20 3.2.5. Communication 22 3.2.6. Transformation 23 3.2.7. Decoder & Re-Constructor 24 Chapter 4. Implementation and Experiment 27 4.1. Experiment Environment and Settings 27 4.2. Implementation 29 4.3. Experiment Result 31 4.3.1. Scheme Performance 32 4.3.2. Dictionary Code length 34 4.3.3. Dictionary Rebuild 36 4.3.4. Output Time Delay Formula 37 Chapter 5. Conclusion and future work 39 References 41

    [1] Howard, Paul G. "Lossless and lossy compression of text images by soft pattern matching." Data Compression Conference, 1996. DCC'96. Proceedings. IEEE, 1996.
    [2] Knuth, Donald E. "Dynamic huffman coding." Journal of algorithms 6.2 (1985): 163-180.
    [3] Hauck, Edward L. "Data compression using run length encoding and statistical encoding." U.S. Patent No. 4,626,829. 2 Dec. 1986.
    [4] Nelson, Mark R. "LZW data compression." Dr. Dobb's Journal 14.10 (1989): 29-36.
    [5] Chen, Hsinchun, Roger HL Chiang, and Veda C. Storey. "Business intelligence and analytics: From big data to big impact." MIS quarterly 36.4 (2012).
    [6] “Google MapReduce,” 2011, http://research.google.com/archive/mapreduce.html
    [July. 05, 2017].
    [7] “Hadoop,” 2014, http://hadoop.apache.org/ [July. 05, 2017].
    [8] Liu, Xuhui, et al. "Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS." Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on. IEEE, 2009.
    [9] Vavilapalli, Vinod Kumar, et al. "Apache hadoop yarn: Yet another resource negotiator." Proceedings of the 4th annual Symposium on Cloud Computing. ACM, 2013.
    [10] M. Zaharia, M. Chowdhury, M. Franklin, S. Shenkr, and I. Stoica. “Spark: cluster computing with working sets,” in HotCloud, 2010.
    [11] Jiang, Tao, et al. "Understanding the behavior of in-memory computing workloads." Workload Characterization (IISWC), 2014 IEEE International Symposium on. IEEE, 2014.
    [12] Sadler, Christopher M., and Margaret Martonosi. "Data compression algorithms for energy-constrained devices in delay tolerant networks." Proceedings of the 4th international conference on Embedded networked sensor systems. ACM, 2006.
    [13] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cyirci, “A Survey on Sensor Networks,” IEEE Communications Magazine, vol. 40, no. 8, Aug. 2002, pp.102 -114.
    [14] Hunkeler, Urs, Hong Linh Truong, and Andy Stanford-Clark. "MQTT-S—A publish/subscribe protocol for Wireless Sensor Networks." Communication systems software and middleware and workshops, 2008. comsware 2008. 3rd international conference on. IEEE, 2008.
    [15] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault tolerant abstraction for in-memory cluster computing,” In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2-2, USENIX Association, 2012.
    [16] Zaharia, Matei, et al. "Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters." HotCloud 12 (2012): 10-10.
    [17] “Pubnub Sensor Network,” 2010, https://www.pubnub.com/developers/realtime-data-streams/sensor-network/ [July. 05, 2017].
    [18] “Benchmark IoT sensor data models,“2014, https://github.com/assaad/BenchmarkIoT/tree/master/DataSets [July. 05, 2017]. “
    [19] The Scala programming language,” 2016, http://www.scala-lang.org [July. 05, 2017].
    [20] J. Kreps, N. Narkhede, and J. Rao. “Kafka: A distributed messaging system for log processing.” In Proceedings of 6th International Workshop on Networking Meets Databases (NetDB), Athens, Greece, 2011.
    [21] Raghuwanshi, B.S., Jain, S. Chawda, D. and Varma,B. 2009. “New dynamic approach for LZW data compression”. IJCNS Vol. 1, No. 1 (October),22-26.

    下載圖示 校內:2020-08-01公開
    校外:2020-08-01公開
    QR CODE