簡易檢索 / 詳目顯示

研究生: 卓峰民
Cho, Feng-Min
論文名稱: 採用代表性特徵以降低運算成本之大數據清理方法
A Big Data Cleansing Approach using Representative Feature to Reduce Computation Cost
指導教授: 謝錫堃
Shieh, Ce-Kuen
共同指導教授: 張志標
Chang, Jyh-Biau
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 英文
論文頁數: 39
中文關鍵詞: 大資料處理數據清理分散式系統MapReduce架構網路流殭屍網路
外文關鍵詞: Big Data, Data Cleansing, Hadoop, MapReduce Framework, Botnet, NetFlow
相關次數: 點閱:125下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 大數據分析對於發現潛在價值至關重要,但是通常會都需要一些前處理方法來降低運算成本。而在長期且大量的網路流量數據中,會存在著許多冗餘、重複的網路行為,其會對後續的分析造成大量的運算負擔。如何針對重複數據進行刪減在數據清理方法上也是一個重要的議題。
    在本文中,我們根據經驗提出了一種採用代表性特徵的大數據清理方法,以有效地在前處理階段就辨別、清理這些重複的行為並保留足夠的資訊量去發掘潛在的行為模式。通過我們演算法的驗證,這個方法可以協助分析更加長期的數據,並且比以前過濾方法有更好的效能與結果。而實驗的最大NetFlow日誌數量可以從兩個星期擴大到兩個月,一次實驗的資料量從約64GB增長至217GB左右。而且使用此方法對於原本偵測結果的影響很小,其與原本偵測結果的交集比例可達到99.88%。

    Big data analytics is critical to discover the potential value, but it still requires some pre-processing methods to reduce computing cost. In the long-term or a large amount of network traffic, there are many redundant and repeated network behaviors, and their existence will result in huge computation cost for subsequent analysis. How to clean up duplicates from big data is also a major issue for data cleansing approach.
    In this paper, we propose a big data cleansing approach using representative features to efficiently identify and clean up repeated behaviors in the pre-processing stage and retain enough information to explore potential behavior patterns. Through the verification of our algorithm, this method can effectively assist in analyzing longer-term data and has better performance and results than previous filtering methods. The maximum amount of NetFlow logs in the experiment can be expanded from two weeks to two months. The total input data in one experiment has increased from about 64GB to 217GB. Moreover, the impact of using this method on the original detection results is slight, and the intersection ratio can reach 99.88%.

    Chapter 1 : Introduction 1 Chapter 2 : Backgrounds and Related Work 4 2.1 Backgrounds 4 2.1.1 Botnet Architecture 4 2.1.2 BotCluster 5 2.2 Relate Work 8 Chapter 3 : Methodology Design 12 3.1 Overview 12 3.2 Duplicates denfinition and Problem Description 12 3.2.1 Duplicates definition 12 3.2.2 Problem Description 13 3.3 The effects and challenge of Data Cleansing approach 15 3.4 Design Principles 17 Chapter 4 : Implementation 19 4.1 Implementation Process 19 4.2 Sampling Points 21 Chapter 5 : Experiment 24 5.1 Experimental Environment 24 5.2 Network Trace 24 5.3 Experiment Results 25 5.3.1 Experiment 1 – Performance Comparison 25 5.3.2 Experiment 2 – Results Comparison 28 5.3.3 Experiment 3 – Sampling Points 31 5.3.4 Experiment 4 – Concept Verification 32 Chapter 6 : Conclusion and Future Work 35 Chapter 7 : References 36

    [1] Apache Hadoop MaReduce, https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
    [2] P.Amini, M. A.Araghizadeh, and R.Azmi, “A survey on Botnet: Classification, detection and defense,” Proceedings - 2015 International Electronics Symposium: Emerging Technology in Electronic and Information, IES 2015, pp. 233–238, 2016.
    [3] I.Ullah, N.Khan, and H. A.Aboalsamh, “Survey on botnet: Its architecture, detection, prevention and mitigation,” 2013 10th IEEE International Conference on Networking, Sensing and Control, ICNSC 2013, pp. 660–665, 2013.
    [4] G.Vormayr, T.Zseby, and J.Fabini, “Botnet Communication Patterns,” IEEE Communications Surveys and Tutorials, vol. 19, no. 4, pp. 2768–2796, 2017.
    [5] A. K.Sood and S.Zeadally, “A Taxonomy of Domain-Generation Algorithms,” IEEE Security and Privacy, vol. 14, no. 4, pp. 46–53, 2016.
    [6] A. F. A.Kadir, R. A. R.Othman, and N. A.Aziz, “Behavioral analysis and visualization of Fast-Flux DNS,” Proceedings - 2012 European Intelligence and Security Informatics Conference, EISIC 2012, pp. 250–253, 2012.
    [7] Zeus bot, https://en.wikipedia.org/wiki/Zeus_(malware)
    [8] Conficker bot, https://en.wikipedia.org/wiki/Conficker
    [9] C.Kolias, G.Kambourakis, A.Stavrou, and J.Voas, “DDoS in the IoT: Mirai and other botnets,” Computer, vol. 50, no. 7, pp. 80–84, 2017.
    [10] WASTE protocol, http://waste.sourceforge.net/docs/docs.html
    [11] P.Maymounkov and D.Mazières, “Kademlia: A Peer-to-Peer Information System Based on the XOR Metric,” pp. 53–65, 2002.

    [12] Overnet protocol https://en.wikipedia.org/wiki/Overnet
    [13] N.Falliere, “Sality: Story of a peer-to-peer viral network,” Symantic Security Response, Tech. Rep, no. June 2003, pp. 1–21, 2011.
    [14] Zeroaccess bot, https://www.symantec.com/security-center/writeup/2011-071314-0410-99
    [15] P. Wang, S. Sparks, and C. C. Zou, “An advanced hybrid peer-to peer botnet,” IEEE Trans. Depend. Secure Comput., vol. 7, no. 2, pp. 113–127, Apr./Jun. 2010.
    [16] G.Jian, K. F.Zheng, Y. X.Yang, and Z. M.Hu, “Research of an innovative P2P-based botnet,” 2010 International Conference on Machine Vision and Human-Machine Interface, MVHI 2010, pp. 214–218, 2010.
    [17] kelihos bot, https://www.malwaretech.com/2017/04/the-kelihos-botnet.html
    [18] Storm bot, https://en.wikipedia.org/wiki/Storm_botnet
    [19] Chun-Yu Wang, Chi-Lung Ou, Yu-En Zhang, Feng-Min Cho, Jyh-Biau Chang, Ce-Kuen Shieh, “BotCluster: A Session-based P2P Botnet Clustering System on NetFlow”, Submitted to Computer Networks
    [20] Alexa Top 500 sites on the web, https://www.alexa.com/
    [21] Whois, https://www.whois.net/
    [22] Virustotal, https://www.virustotal.com/
    [23] R.Salem, “A Manifold Learning Framework for Reducing High-dimensional Big Text Data.” 2017 12th International Conference on Computer Engineering and Systems (ICCES), pp. 347–352.
    [24] Tenenbaum J B, De Silva V, Langford J C. A global geometric framework for nonlinear dimensionality reduction. Science, 2000, 290(5500): 2319-2323.
    [25] L. K. Saul and S. T. Roweis, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, pp. 2323–2326, 2000.
    [26] L.DeLathauwer, B.DeMoor, and J.Vandewalle, “A multilinear singular value decomposition,” SIAM Journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1253–1278, 2000.
    [27] A.Javadpour and G.Wang, “Feature Selection and Intrusion Detection in Cloud Environment based on Machine Learning Algorithms,” 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC), pp. 1417–1421, 2017
    [28] G.Xu, Y.Ding, Chunyi Wu, Yunan Zhai, and J.Zhao, “Explore maximal frequent itemsets for big data pre-processing based on small sample in cloud computing,” 2016 8th International Congress on Ultra-Modern Telecommunications and Control Systems and Workshops (ICUMT), pp. 235–239, 2016.
    [29] C.Fürber and M.Hepp, “Towards a vocabulary for data quality management in semantic web architectures,” Proceedings of the 1st International Workshop on Linked Web Data Management - LWDM ’11, p. 1, 2011.
    [30] I.Taleb, R.Dssouli, and M. A.Serhani, “Big Data Pre-processing: A Quality Framework,” 2015 IEEE International Congress on Big Data, pp. 191–198, 2015.
    [31] I.Taleb, H. T.ElKassabi, M. A.Serhani, R.Dssouli, and C.Bouhaddioui, “Big Data Quality: A Quality Dimensions Evaluation,” 2016 Int Ieee Conferences on Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (Uic/Atc/Scalcom/Cbdcom/Iop/Smartworld), pp. 759–765, 2016.
    [32] A. Kleiner, A. Talwalkar, P. Sarkar, and M. Jordan, “The big data bootstrap,” ArXiv Prepr. ArXiv12066415, 2012.
    [33] W.Wei, M.Zhang, B.Zhang, and X.Tang, “A Data Cleaning Method Based on Association Rules,” Intelligent Systems and Knowledge Engineering (ISKE2007), p. 6, 2007.
    [34] M.Rehman andV.Esichaikul, “Duplicate record detection for database cleansing,” 2009 2nd International Conference on Machine Vision, ICMV 2009, pp. 333–338, 2009.
    [35] M. L.Lee, T. W.Ling, andW. L.Low, “IntelliClean: A Knowledge-Based Intelligent Data Cleaner,” Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 290–294, 2000.
    [36] National Center for High-performance Computing (NCHC), https://www.nchc.org.tw/tw/inner.php?CONTENT_ID=744
    [37] H. J.Kang, E.Chan-Tin, N. J.Hopper, andY.Kim, “Why kad lookup fails,” IEEE P2P’09 - 9th International Conference on Peer-to-Peer Computing, pp. 121–130, 2009.
    [38] B.Liu, T.Wei, C.Zhang, J.Li, andJ.Zhang, “Improving lookup reliability in Kad,” Peer-to-Peer Networking and Applications, vol. 8, no. 1, pp. 156–170, 2013.
    [39] G.Jian, K. F.Zheng, Y. X.Yang, andZ. M.Hu, “Research of an innovative P2P-based botnet,” 2010 International Conference on Machine Vision and Human-Machine Interface, MVHI 2010, pp. 214–218, 2010.
    [40] L.Xiao-Nan, Z.Hua, andL.Yang, “A framework for hybrid structure P2P botnet,” 2011 IEEE 3rd International Conference on Communication Software and Networks, ICCSN 2011, pp. 121–124, 2011.

    無法下載圖示 校內:2023-08-01公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE