| 研究生: |
黃聖維 Huang, Sheng-Wei |
|---|---|
| 論文名稱: |
大數據平台效能改善之研究 Study on Performance Improvement for Big Data Platforms |
| 指導教授: |
謝錫堃
Shieh, Ce-Kuen |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2016 |
| 畢業學年度: | 104 |
| 語文別: | 英文 |
| 論文頁數: | 57 |
| 中文關鍵詞: | 大數據 、批次處理 、即時處理 、Apache Storm 、Apache Hadoop 、Apache HBase |
| 外文關鍵詞: | Big Data, Batch Processing, Real-time Processing, Apache Storm, Apache Hadoop, Apache HBase |
| 相關次數: | 點閱:142 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
目前各式各樣大數據的應用已充斥在每個人的生活中,像是Google的各項服務、購物網站的商品推薦、抑或是政府積極推動的工業4.0,全都是與大數據直接相關。而大數據的處理,可分為兩個部份:即時處理以及批次處理。在即時處理的平台當中,Storm 是最廣為使用及研究的平台之一,而批次處理則是以Hadoop平台最為人所熟知。不管是即時或批次大數據處理平台,皆運用分散式運算之技術,來處理傳統平台無法應付之龐大數據,這些平台雖已廣泛被使用,然而其系統效能仍有改進空間。在本論文中,我們對於Storm之延展機制以及Hadoop檔案系統上之資料庫系統HBase這兩部份的效能進行研究。在Storm方面探討了其延展機制的缺失,提出了一個以拓樸為單位的延展方法;於HBase上則提出了一個於簡化階段進行聯結(Reduce-Phase Aggregation)且採用了資料轉置模型的資料分析處理系統改進了資料查詢的效能,實驗的結果皆證明了本論文所提出的方法能夠有效地改善相關的效能問題。
Nowadays in our daily life, we use all kinds of Big Data applications such as Google applications, online shopping or the Industry 4.0 which is promoted by the Government. Generally, when dealing with Big Data, the processing method can be divided into two categories: real-time processing and batch processing. Storm is a popular real-time processing system in industrial or research area and Apache Hadoop is one of the most well-known batch processing systems. Whether real-time or batch, these platforms use distributed computing techniques to process huge amount of data that cannot be handled by traditional systems. However, although these platforms are widely used, there is still room for improvement on the system performance. In this thesis, we propose a topology-based scaling mechanism for Storm as well as a system based on reduce-phase aggregation with inverted data model over HBase. The experimental results show that our proposed methods are effective in performance improvement.
[1] Susan Gunelius, “The Data Explosion in 2014 Minute by Minute,” http://aci.info/2014/07/12/the-data-explosion-in-2014-minute-by-minute-infographic/
[2] Apache Hadoop, http://hadoop.apache.org/
[3] Jeffrey Dean and Sanjay Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
[4] Apache Storm, https://storm.apache.org
[5] Borthakur, Dhruba. “HDFS architecture guide,” HADOOP APACHE PROJECT http://hadoop. apache. org/common/docs/current/hdfs design. pdf (2008).
[6] Apache HBase. Available from: http://hbase.apache.org/
[7] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes and Robert E. Gruber, “Bigtable: A distributed storage system for structured data,” ACM Transactions on Computer Systems (TOCS) 26.2 (2008): 4.
[8] Chaudhuri, Surajit, and Umeshwar Dayal. “An overview of data warehousing and OLAP technology,” ACM SIGMOD record 26.1 (1997): 65-74.
[9] Han, Jiawei, and Micheline Kamber. “Data Mining, Southeast Asia Edition: Concepts and Techniques,” Morgan kaufmann, 2006.
[10] Jing-hua, Zhao, Song Ai-mei, and Song Ai-bo, “OLAP Aggregation Based on Dimension-oriented Storage,” Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International. IEEE, 2012.
[11] Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang and Zhiwei Xu, “RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems,” Data Engineering (ICDE), 2011 IEEE 27th International Conference on. IEEE, 2011.
[12] P. Taylor Goetz and Brian O'Neill, “Storm Blueprints: Patterns for Distributed Real-time Computation,” Packt Publishing, 2014.
[13] Guaranteeing message processing (Storm). Available from: http://storm.apache.org/documentation/Guaranteeing-message-processing.html
[14] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu and Raghotham Murthy, "Hive – A Petabyte Scale Data Warehouse Using Hadoop," Data Engineering (ICDE), 2010 IEEE 26th International Conference on. IEEE, 2010.
[15] Apache ZooKeeper. Available from: http://zookeeper.apache.org/
[16] L. Aniello, R. Baldoni and L. Querzoni, “Adaptive online scheduling in Storm,” in Proceedings of ACM DEBS’2013.
[17] Jielong Xu, Zhenhua Chen, Jian Tang and Sen Su, “T-Storm: Traffic-aware Online Scheduling,” in Storm.IEEE 34th International Conference on Distributed Computing Systems, 2014.
[18] Ivan Bedini, Sherif Sakr, Bart Theeten, Alessandra Sala and Peter Cogan, “Modeling performance of a parallel streaming engine: bridging theory and costs,” the 4th ACM/SPEC International Conference on Performance Engineering, pp. 173-184, 2013.
[19] O'Neil, Patrick, et al. “The star schema benchmark,” 2009.
[20] Council, Transaction Processing Performance. “TPC-H benchmark specification,” Published at http://www. tcp. org/hspec. html (2008).
[21] Jay Kreps, Neha Narkhede, and Jun Rao, “Kafka: a distributed messaging system for log processing,” ACM SIGMOD Workshop on Networking Meets Databases, Athens, Greece, 2011.
[22] Apache Software Foundation, Thrift. Available from: http://thrift.apache.org/
[23] Owen O’Malley, Kan Zhang, Sanjay Radia, Ram Marti, and Christopher Harrell, “Hadoop Security Design,” Technical Report, 2009.
校內:2021-09-01公開