簡易檢索 / 詳目顯示

研究生: 黃聖維
Huang, Sheng-Wei
論文名稱: 大數據平台效能改善之研究
Study on Performance Improvement for Big Data Platforms
指導教授: 謝錫堃
Shieh, Ce-Kuen
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2016
畢業學年度: 104
語文別: 英文
論文頁數: 57
中文關鍵詞: 大數據批次處理即時處理Apache StormApache HadoopApache HBase
外文關鍵詞: Big Data, Batch Processing, Real-time Processing, Apache Storm, Apache Hadoop, Apache HBase
相關次數: 點閱:142下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 目前各式各樣大數據的應用已充斥在每個人的生活中,像是Google的各項服務、購物網站的商品推薦、抑或是政府積極推動的工業4.0,全都是與大數據直接相關。而大數據的處理,可分為兩個部份:即時處理以及批次處理。在即時處理的平台當中,Storm 是最廣為使用及研究的平台之一,而批次處理則是以Hadoop平台最為人所熟知。不管是即時或批次大數據處理平台,皆運用分散式運算之技術,來處理傳統平台無法應付之龐大數據,這些平台雖已廣泛被使用,然而其系統效能仍有改進空間。在本論文中,我們對於Storm之延展機制以及Hadoop檔案系統上之資料庫系統HBase這兩部份的效能進行研究。在Storm方面探討了其延展機制的缺失,提出了一個以拓樸為單位的延展方法;於HBase上則提出了一個於簡化階段進行聯結(Reduce-Phase Aggregation)且採用了資料轉置模型的資料分析處理系統改進了資料查詢的效能,實驗的結果皆證明了本論文所提出的方法能夠有效地改善相關的效能問題。

    Nowadays in our daily life, we use all kinds of Big Data applications such as Google applications, online shopping or the Industry 4.0 which is promoted by the Government. Generally, when dealing with Big Data, the processing method can be divided into two categories: real-time processing and batch processing. Storm is a popular real-time processing system in industrial or research area and Apache Hadoop is one of the most well-known batch processing systems. Whether real-time or batch, these platforms use distributed computing techniques to process huge amount of data that cannot be handled by traditional systems. However, although these platforms are widely used, there is still room for improvement on the system performance. In this thesis, we propose a topology-based scaling mechanism for Storm as well as a system based on reduce-phase aggregation with inverted data model over HBase. The experimental results show that our proposed methods are effective in performance improvement.

    Contents I Illustrations III Chapter 1 Introduction 1 Chapter 2 Background and Related works 7 2.1 Real-time processing system 7 2.1.1 Storm 7 2.1.2 The rebalance command: Scalability of Storm 8 2.1.3 Related works of performance improvements for Storm 9 2.2 Batch processing systems 10 2.2.1 MapReduce Programming Model 10 2.2.2 Apache Hadoop 11 2.2.3 Apache HBase 12 2.2.4 Target application: OLAP operation processing 13 Chapter 3 Topology-based Scaling Mechanism for Storm 16 3.1 Real-time Processing System Scaling Mechanism 16 3.1.1 System Overview 16 3.1.2 System Operation 18 3.2 Implementation of Real-time Processing System 21 3.2.1 Integration of Storm and Kafka 21 3.2.2 Monitor virtual topologies at run time 21 3.2.3 Create new topics in Kafka / Start new virtual topology in the cluster 22 3.2.4 Distribute data to Kafka topics 23 3.2.5 Add worker nodes to the cluster 24 Chapter 4 Reduce-Phase Aggregation with Inverted Data Model on HBase 26 4.1 Batch System Design for Multidimensional Query 26 4.1.1 Data Model Constructor 27 4.1.2 Query Analyzer 30 4.1.3 Algebra Execution Algorithm 31 4.2 Implementation of Batch Processing System 32 4.2.1 Query Analyzer 32 4.2.2 Data Model Constructor 33 4.2.3 Algebra Execution Algorithms 34 Chapter 5 Experimental Results and Discussion 39 5.1 Experimental Setup 39 5.2 Real-time Processing System 39 5.2.1 Different numbers of virtual topologies 39 5.2.2 Dynamic scaling results 41 5.2.3 Comparison with storm rebalance 42 5.2.4 The topology substitution method 43 5.3 Batch Processing System 44 5.3.1 Experimental results 45 5.3.2 The Overhead of Creating Inverted Data Model 52 Chapter 6 Conclusions and Future Works 54 References 55

    [1] Susan Gunelius, “The Data Explosion in 2014 Minute by Minute,” http://aci.info/2014/07/12/the-data-explosion-in-2014-minute-by-minute-infographic/
    [2] Apache Hadoop, http://hadoop.apache.org/
    [3] Jeffrey Dean and Sanjay Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
    [4] Apache Storm, https://storm.apache.org
    [5] Borthakur, Dhruba. “HDFS architecture guide,” HADOOP APACHE PROJECT http://hadoop. apache. org/common/docs/current/hdfs design. pdf (2008).
    [6] Apache HBase. Available from: http://hbase.apache.org/
    [7] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes and Robert E. Gruber, “Bigtable: A distributed storage system for structured data,” ACM Transactions on Computer Systems (TOCS) 26.2 (2008): 4.
    [8] Chaudhuri, Surajit, and Umeshwar Dayal. “An overview of data warehousing and OLAP technology,” ACM SIGMOD record 26.1 (1997): 65-74.
    [9] Han, Jiawei, and Micheline Kamber. “Data Mining, Southeast Asia Edition: Concepts and Techniques,” Morgan kaufmann, 2006.
    [10] Jing-hua, Zhao, Song Ai-mei, and Song Ai-bo, “OLAP Aggregation Based on Dimension-oriented Storage,” Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International. IEEE, 2012.
    [11] Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang and Zhiwei Xu, “RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems,” Data Engineering (ICDE), 2011 IEEE 27th International Conference on. IEEE, 2011.
    [12] P. Taylor Goetz and Brian O'Neill, “Storm Blueprints: Patterns for Distributed Real-time Computation,” Packt Publishing, 2014.
    [13] Guaranteeing message processing (Storm). Available from: http://storm.apache.org/documentation/Guaranteeing-message-processing.html
    [14] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu and Raghotham Murthy, "Hive – A Petabyte Scale Data Warehouse Using Hadoop," Data Engineering (ICDE), 2010 IEEE 26th International Conference on. IEEE, 2010.
    [15] Apache ZooKeeper. Available from: http://zookeeper.apache.org/
    [16] L. Aniello, R. Baldoni and L. Querzoni, “Adaptive online scheduling in Storm,” in Proceedings of ACM DEBS’2013.
    [17] Jielong Xu, Zhenhua Chen, Jian Tang and Sen Su, “T-Storm: Traffic-aware Online Scheduling,” in Storm.IEEE 34th International Conference on Distributed Computing Systems, 2014.
    [18] Ivan Bedini, Sherif Sakr, Bart Theeten, Alessandra Sala and Peter Cogan, “Modeling performance of a parallel streaming engine: bridging theory and costs,” the 4th ACM/SPEC International Conference on Performance Engineering, pp. 173-184, 2013.
    [19] O'Neil, Patrick, et al. “The star schema benchmark,” 2009.
    [20] Council, Transaction Processing Performance. “TPC-H benchmark specification,” Published at http://www. tcp. org/hspec. html (2008).
    [21] Jay Kreps, Neha Narkhede, and Jun Rao, “Kafka: a distributed messaging system for log processing,” ACM SIGMOD Workshop on Networking Meets Databases, Athens, Greece, 2011.
    [22] Apache Software Foundation, Thrift. Available from: http://thrift.apache.org/
    [23] Owen O’Malley, Kan Zhang, Sanjay Radia, Ram Marti, and Christopher Harrell, “Hadoop Security Design,” Technical Report, 2009.

    無法下載圖示 校內:2021-09-01公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE