簡易檢索 / 詳目顯示

研究生: 程祺鈞
Cheng, Qi-Jun
論文名稱: 使用主成分分析方法改善特徵變動對Botcluster所造成的性能影響
Using principal component analysis to improve the performance degradation caused by feature variance for Botcluster
指導教授: 謝錫堃
Shieh, Ce-Kuen
共同指導教授: 張志標
Chang, Jyh-Biau
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 40
中文關鍵詞: 點對點殭屍網路網路流主成份分析特徵工程對應歸納
外文關鍵詞: P2P botnet, NetFlow, PCA, Feature engineering, MapReduce
相關次數: 點閱:140下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究將主成份分析(PCA)應用在BotCluster上,為殭屍網路偵測取得適當的特徵(feature)集。在先前的研究中,特徵的選擇是根據每個特徵的信息增益(Information Gain)。 BotCluster在2016年資料集中使用此特徵集可達到約97%的精度(Precision)。但是在2020年數據集中使用相同的特徵集時,精度會有明顯下降。我們懷疑是因為特徵的選擇已經過時。每年都會有變種或新的殭屍網絡(Botnet)出現。各種殭屍網絡的關鍵特徵可能有所不同,所以更新特徵集是有必要的。但是信息增益只能用在有標記(label)的資料集,因此難以用信息增益為每個資料集更新特徵集。在這篇論文中以主成份分析代替信息增益,以獲得適當的特徵集。 主成份分析是一種非監督式(unsupervised)的方法,所以可以為任何資料集提取特徵。主成份分析可以找出指定資料集的主成份(Principal Component),其中每個主成份是原始特徵的線性組合,並且彼此不相關。由於非監督式的特性,我們可以在為任何資料集提取會話(Session)後直接執行主成份分析。 主成份分析和BotCluster的合作顯著提高了性能。在2019/07資料集中,平均精度從77.22%增加到94.22%,Direct IP從190增加到316。

    In the thesis, we apply Principal Component Analysis (PCA) on BotCluster [1] to extract the grouping phase's proper feature set. In previous work, the selection of features is according to the Information Gain of each feature. BotCluster achieves about 97% precision with this feature set in the 2016 dataset. However, there is a noticeable drop in precision while applying the same feature set in the 2020 dataset. We suspect the reason is that the selection of features is out of date. The mutation and new type of Botnet emerge every year [13]. The critical features for the variety of Botnet might differ [6], so updating the proper feature set is essential. However, the labeled dataset is necessary for the Information Gain, so updating the feature set with Information Gain for each dataset is impracticable. The proposed method replaces Information Gain with PCA to obtain the appropriate feature set. PCA is an unsupervised method to extract new features for each dataset. PCA finds the Principal Components (PCs) for the specified dataset, where each PC is the linear combination of original features and uncorrelated to each other. Due to the unsupervised characteristics, we can perform PCA directly after the session extraction phase for each dataset. The cooperation of PCA and BotCluster improves the performance notably. In the 2019/07 dataset, the average precision increase from 77.22% to 94.22%, and Direct IP increase from 190 to 316.

    Chapter 1 : Introduction 1 Chapter 2 : Background & Related Works 4 2.1 Background 4 2.1.1 BotCluster 4 2.1.2 Principal Component Analysis 5 2.2 Related Works 6 Chapter 3 : Methodology 12 3.1 Method overview 12 3.2 Standardizing 13 3.3 Covariance 14 3.4 Principal Component 15 3.5 Projection 16 3.6 Selection of PC 17 Chapter 4 : Implementation 18 4.1 Overview 18 4.2 Standardizing 18 4.3 Covariance Matrix 21 4.4 Eigenvectors 22 4.5 Projection 23 Chapter 5 : Experiments 24 5.1 Experimental Environment 24 5.2 Dataset Summary 25 5.3 The Covariance matrix 26 5.4 Selection of PC 29 5.5 Performance Comparison of PCA and Information Gain 32 5.6 Process time 36 5.7 Discussion 37 Chapter 6 : Conclusion 38 References 39

    [1] Wang, C.-Y., Ou, C.-L., Zhang, Y.-E., Cho, F.-M., Chen, P.-H., Chang, J.-B. and Shieh, C.-K, "BotCluster: A session-based P2P botnet clustering system on NetFlow." Computer Networks. 145, P175-189, 2018.
    [2] P. Narang, J. M. Reddy, and C. Hota, "Feature Selection for Detection of Peer-to-Peer Botnet Traffic," ACM Compute, 2013.
    [3] A. Gang, H. Raja and W. U. Bajwa, "Fast and Communication-efficient Distributed Pca," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 7450-7454.
    [4] C. Pascoal, M. R. de Oliveira, R. Valadas, P. Filzmoser, P. Salvador and A. Pacheco, "Robust feature selection and robust PCA for internet traffic anomaly detection," 2012 Proceedings IEEE INFOCOM, Orlando, FL, 2012, pp. 1755-1763.
    [5] E. Biglar Beigi, H. Hadian Jazi, N. Stakhanova and A. A. Ghorbani, "Towards effective feature selection in machine learning-based botnet detection approaches," 2014 IEEE Conference on Communications and Network Security, San Francisco, CA, 2014, pp. 247-255.
    [6] F. V. Alejandre, N. C. Cortés and E. A. Anaya, "Feature selection to detect botnets using machine learning algorithms," 2017 International Conference on Electronics, Communications and Computers (CONIELECOMP), Cholula, 2017, pp. 1-7.
    [7] A. Guerra-Manzanares, H. Bahsi and S. Nõmm, "Hybrid Feature Selection Models for Machine Learning Based Botnet Detection in IoT Networks," 2019 International Conference on Cyberworlds (CW), Kyoto, Japan, 2019, pp. 324-327.
    [8] Shlens, Jonathon, "A Tutorial on Principal Component Analysis." Educational, 51, 2014.
    [9] K. A. Pituch and J. P Stevens, Applied Multivariate Statistics for the Social Science.6th ed. Routledge Taylor & Francis Group, 2016.
    [10] H. Abdi, L. J. Williams, D. Valentin, "Multiple factor analysis: principal component analysis for multitable and multiblock data sets." Wires Computational Statistics, 2013, pp. 149-179.
    [11] K. Tsuda, M. Kawanabe, K. Muller, "Clustering with the fisher score." In Advances in Neural Information Processing Systems Neural information processing systems foundation, 2003.
    [12] H. Mia, R. Peter, B. Karlien, "ROBPCA: A new approach to robust principal component analysis. " Technometrics, 2005, pp. 64-79.
    [13] Augustine Fou June 2020, With Digital Ad Fraud, You Don't Know What You Don't Know, accessed July 2020, https://www.forbes.com/sites/augustinefou/2020/06/21/with-digital-ad-fraud-you-dont-know-what-you-dont-know
    [14] National Center for High-performance Computing (NCHC), Tawain Computing Cloud https://www.nchc.org.tw/posts/Mr4kt9kyKc/taiwania2
    [15] Apache Hadoop Map Reduce https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
    [16] Weka, 2017. https://www.cs.waikato.ac.nz/ml/weka/
    [17] VirusTotal, 2017. https://www.virustotal.com/.
    [18] TaiWan Advanced Research and Education Network (TWAREN), 2017. http://www.twaren.net/.

    下載圖示 校內:2022-07-23公開
    校外:2022-07-23公開
    QR CODE