成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	黃聖維 Huang, Sheng-Wei
論文名稱：	大數據平台效能改善之研究 Study on Performance Improvement for Big Data Platforms
指導教授：	謝錫堃 Shieh, Ce-Kuen
學位類別：	博士 Doctor
系所名稱：	電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering
論文出版年：	2016
畢業學年度：	104
語文別：	英文
論文頁數：	57
中文關鍵詞：	大數據、批次處理、即時處理、Apache Storm 、Apache Hadoop 、Apache HBase
外文關鍵詞：	Big Data, Batch Processing, Real-time Processing, Apache Storm, Apache Hadoop, Apache HBase
相關次數：	點閱：273 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

目前各式各樣大數據的應用已充斥在每個人的生活中，像是Google的各項服務、購物網站的商品推薦、抑或是政府積極推動的工業4.0，全都是與大數據直接相關。而大數據的處理，可分為兩個部份：即時處理以及批次處理。在即時處理的平台當中，Storm 是最廣為使用及研究的平台之一，而批次處理則是以Hadoop平台最為人所熟知。不管是即時或批次大數據處理平台，皆運用分散式運算之技術，來處理傳統平台無法應付之龐大數據，這些平台雖已廣泛被使用，然而其系統效能仍有改進空間。在本論文中，我們對於Storm之延展機制以及Hadoop檔案系統上之資料庫系統HBase這兩部份的效能進行研究。在Storm方面探討了其延展機制的缺失，提出了一個以拓樸為單位的延展方法；於HBase上則提出了一個於簡化階段進行聯結(Reduce-Phase Aggregation)且採用了資料轉置模型的資料分析處理系統改進了資料查詢的效能，實驗的結果皆證明了本論文所提出的方法能夠有效地改善相關的效能問題。

Nowadays in our daily life, we use all kinds of Big Data applications such as Google applications, online shopping or the Industry 4.0 which is promoted by the Government. Generally, when dealing with Big Data, the processing method can be divided into two categories: real-time processing and batch processing. Storm is a popular real-time processing system in industrial or research area and Apache Hadoop is one of the most well-known batch processing systems. Whether real-time or batch, these platforms use distributed computing techniques to process huge amount of data that cannot be handled by traditional systems. However, although these platforms are widely used, there is still room for improvement on the system performance. In this thesis, we propose a topology-based scaling mechanism for Storm as well as a system based on reduce-phase aggregation with inverted data model over HBase. The experimental results show that our proposed methods are effective in performance improvement.

Contents  I
Illustrations  III
Chapter 1 Introduction	1
Chapter 2 Background and Related works	7
1 Real-time processing system	7
1.1 Storm	7
1.2 The rebalance command: Scalability of Storm	8
1.3 Related works of performance improvements for Storm	9
2 Batch processing systems	10
2.1 MapReduce Programming Model	10
2.2 Apache Hadoop	11
2.3 Apache HBase	12
2.4 Target application: OLAP operation processing	13
Chapter 3 Topology-based Scaling Mechanism for Storm	16
1 Real-time Processing System Scaling Mechanism	16
1.1 System Overview	16
1.2 System Operation	18
2 Implementation of Real-time Processing System	21
2.1 Integration of Storm and Kafka	21
2.2 Monitor virtual topologies at run time	21
2.3 Create new topics in Kafka / Start new virtual topology in the cluster	22
2.4 Distribute data to Kafka topics	23
2.5 Add worker nodes to the cluster	24
Chapter 4 Reduce-Phase Aggregation with Inverted Data Model on HBase	26
1 Batch System Design for Multidimensional Query	26
1.1 Data Model Constructor	27
1.2 Query Analyzer	30
1.3 Algebra Execution Algorithm	31
2 Implementation of Batch Processing System	32
2.1 Query Analyzer	32
2.2 Data Model Constructor	33
2.3 Algebra Execution Algorithms	34
Chapter 5 Experimental Results and Discussion	39
1 Experimental Setup	39
2 Real-time Processing System	39
2.1 Different numbers of virtual topologies	39
2.2 Dynamic scaling results	41
2.3 Comparison with storm rebalance	42
2.4 The topology substitution method	43
3 Batch Processing System	44
3.1 Experimental results	45
3.2 The Overhead of Creating Inverted Data Model	52
Chapter 6 Conclusions and Future Works	54
References	55
                                    

[1] Susan Gunelius, “The Data Explosion in 2014 Minute by Minute,” http://aci.info/2014/07/12/the-data-explosion-in-2014-minute-by-minute-infographic/
[2] Apache Hadoop, http://hadoop.apache.org/
[3] Jeffrey Dean and Sanjay Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
[4] Apache Storm, https://storm.apache.org
[5] Borthakur, Dhruba. “HDFS architecture guide,” HADOOP APACHE PROJECT http://hadoop. apache. org/common/docs/current/hdfs design. pdf (2008).
[6] Apache HBase. Available from: http://hbase.apache.org/
[7] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes and Robert E. Gruber, “Bigtable: A distributed storage system for structured data,” ACM Transactions on Computer Systems (TOCS) 26.2 (2008): 4.
[8] Chaudhuri, Surajit, and Umeshwar Dayal. “An overview of data warehousing and OLAP technology,” ACM SIGMOD record 26.1 (1997): 65-74.
[9] Han, Jiawei, and Micheline Kamber. “Data Mining, Southeast Asia Edition: Concepts and Techniques,” Morgan kaufmann, 2006.
[10] Jing-hua, Zhao, Song Ai-mei, and Song Ai-bo, “OLAP Aggregation Based on Dimension-oriented Storage,” Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International. IEEE, 2012.
[11] Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang and Zhiwei Xu, “RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems,” Data Engineering (ICDE), 2011 IEEE 27th International Conference on. IEEE, 2011.
[12] P. Taylor Goetz and Brian O'Neill, “Storm Blueprints: Patterns for Distributed Real-time Computation,” Packt Publishing, 2014.
[13] Guaranteeing message processing (Storm). Available from: http://storm.apache.org/documentation/Guaranteeing-message-processing.html
[14] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu and Raghotham Murthy, "Hive – A Petabyte Scale Data Warehouse Using Hadoop," Data Engineering (ICDE), 2010 IEEE 26th International Conference on. IEEE, 2010.
[15] Apache ZooKeeper. Available from: http://zookeeper.apache.org/
[16] L. Aniello, R. Baldoni and L. Querzoni, “Adaptive online scheduling in Storm,” in Proceedings of ACM DEBS’2013.
[17] Jielong Xu, Zhenhua Chen, Jian Tang and Sen Su, “T-Storm: Traffic-aware Online Scheduling,” in Storm.IEEE 34th International Conference on Distributed Computing Systems, 2014.
[18] Ivan Bedini, Sherif Sakr, Bart Theeten, Alessandra Sala and Peter Cogan, “Modeling performance of a parallel streaming engine: bridging theory and costs,” the 4th ACM/SPEC International Conference on Performance Engineering, pp. 173-184, 2013.
[19] O'Neil, Patrick, et al. “The star schema benchmark,” 2009.
[20] Council, Transaction Processing Performance. “TPC-H benchmark specification,” Published at http://www. tcp. org/hspec. html (2008).
[21] Jay Kreps, Neha Narkhede, and Jun Rao, “Kafka: a distributed messaging system for log processing,” ACM SIGMOD Workshop on Networking Meets Databases, Athens, Greece, 2011.
[22] Apache Software Foundation, Thrift. Available from: http://thrift.apache.org/
[23] Owen O’Malley, Kan Zhang, Sanjay Radia, Ram Marti, and Christopher Harrell, “Hadoop Security Design,” Technical Report, 2009.

校外：不公開電子論文及紙本論文均尚未授權公開

簡易檢索 / 詳目顯示

相關論文