研究生: |
程祺鈞 Cheng, Qi-Jun |
---|---|
論文名稱: |
使用主成分分析方法改善特徵變動對Botcluster所造成的性能影響 Using principal component analysis to improve the performance degradation caused by feature variance for Botcluster |
指導教授: |
謝錫堃
Shieh, Ce-Kuen |
共同指導教授: |
張志標
Chang, Jyh-Biau |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
論文出版年: | 2020 |
畢業學年度: | 108 |
語文別: | 英文 |
論文頁數: | 40 |
中文關鍵詞: | 點對點殭屍網路 、網路流 、主成份分析 、特徵工程 、對應歸納 |
外文關鍵詞: | P2P botnet, NetFlow, PCA, Feature engineering, MapReduce |
相關次數: | 點閱:140 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究將主成份分析(PCA)應用在BotCluster上,為殭屍網路偵測取得適當的特徵(feature)集。在先前的研究中,特徵的選擇是根據每個特徵的信息增益(Information Gain)。 BotCluster在2016年資料集中使用此特徵集可達到約97%的精度(Precision)。但是在2020年數據集中使用相同的特徵集時,精度會有明顯下降。我們懷疑是因為特徵的選擇已經過時。每年都會有變種或新的殭屍網絡(Botnet)出現。各種殭屍網絡的關鍵特徵可能有所不同,所以更新特徵集是有必要的。但是信息增益只能用在有標記(label)的資料集,因此難以用信息增益為每個資料集更新特徵集。在這篇論文中以主成份分析代替信息增益,以獲得適當的特徵集。 主成份分析是一種非監督式(unsupervised)的方法,所以可以為任何資料集提取特徵。主成份分析可以找出指定資料集的主成份(Principal Component),其中每個主成份是原始特徵的線性組合,並且彼此不相關。由於非監督式的特性,我們可以在為任何資料集提取會話(Session)後直接執行主成份分析。 主成份分析和BotCluster的合作顯著提高了性能。在2019/07資料集中,平均精度從77.22%增加到94.22%,Direct IP從190增加到316。
In the thesis, we apply Principal Component Analysis (PCA) on BotCluster [1] to extract the grouping phase's proper feature set. In previous work, the selection of features is according to the Information Gain of each feature. BotCluster achieves about 97% precision with this feature set in the 2016 dataset. However, there is a noticeable drop in precision while applying the same feature set in the 2020 dataset. We suspect the reason is that the selection of features is out of date. The mutation and new type of Botnet emerge every year [13]. The critical features for the variety of Botnet might differ [6], so updating the proper feature set is essential. However, the labeled dataset is necessary for the Information Gain, so updating the feature set with Information Gain for each dataset is impracticable. The proposed method replaces Information Gain with PCA to obtain the appropriate feature set. PCA is an unsupervised method to extract new features for each dataset. PCA finds the Principal Components (PCs) for the specified dataset, where each PC is the linear combination of original features and uncorrelated to each other. Due to the unsupervised characteristics, we can perform PCA directly after the session extraction phase for each dataset. The cooperation of PCA and BotCluster improves the performance notably. In the 2019/07 dataset, the average precision increase from 77.22% to 94.22%, and Direct IP increase from 190 to 316.
[1] Wang, C.-Y., Ou, C.-L., Zhang, Y.-E., Cho, F.-M., Chen, P.-H., Chang, J.-B. and Shieh, C.-K, "BotCluster: A session-based P2P botnet clustering system on NetFlow." Computer Networks. 145, P175-189, 2018.
[2] P. Narang, J. M. Reddy, and C. Hota, "Feature Selection for Detection of Peer-to-Peer Botnet Traffic," ACM Compute, 2013.
[3] A. Gang, H. Raja and W. U. Bajwa, "Fast and Communication-efficient Distributed Pca," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, 2019, pp. 7450-7454.
[4] C. Pascoal, M. R. de Oliveira, R. Valadas, P. Filzmoser, P. Salvador and A. Pacheco, "Robust feature selection and robust PCA for internet traffic anomaly detection," 2012 Proceedings IEEE INFOCOM, Orlando, FL, 2012, pp. 1755-1763.
[5] E. Biglar Beigi, H. Hadian Jazi, N. Stakhanova and A. A. Ghorbani, "Towards effective feature selection in machine learning-based botnet detection approaches," 2014 IEEE Conference on Communications and Network Security, San Francisco, CA, 2014, pp. 247-255.
[6] F. V. Alejandre, N. C. Cortés and E. A. Anaya, "Feature selection to detect botnets using machine learning algorithms," 2017 International Conference on Electronics, Communications and Computers (CONIELECOMP), Cholula, 2017, pp. 1-7.
[7] A. Guerra-Manzanares, H. Bahsi and S. Nõmm, "Hybrid Feature Selection Models for Machine Learning Based Botnet Detection in IoT Networks," 2019 International Conference on Cyberworlds (CW), Kyoto, Japan, 2019, pp. 324-327.
[8] Shlens, Jonathon, "A Tutorial on Principal Component Analysis." Educational, 51, 2014.
[9] K. A. Pituch and J. P Stevens, Applied Multivariate Statistics for the Social Science.6th ed. Routledge Taylor & Francis Group, 2016.
[10] H. Abdi, L. J. Williams, D. Valentin, "Multiple factor analysis: principal component analysis for multitable and multiblock data sets." Wires Computational Statistics, 2013, pp. 149-179.
[11] K. Tsuda, M. Kawanabe, K. Muller, "Clustering with the fisher score." In Advances in Neural Information Processing Systems Neural information processing systems foundation, 2003.
[12] H. Mia, R. Peter, B. Karlien, "ROBPCA: A new approach to robust principal component analysis. " Technometrics, 2005, pp. 64-79.
[13] Augustine Fou June 2020, With Digital Ad Fraud, You Don't Know What You Don't Know, accessed July 2020, https://www.forbes.com/sites/augustinefou/2020/06/21/with-digital-ad-fraud-you-dont-know-what-you-dont-know
[14] National Center for High-performance Computing (NCHC), Tawain Computing Cloud https://www.nchc.org.tw/posts/Mr4kt9kyKc/taiwania2
[15] Apache Hadoop Map Reduce https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
[16] Weka, 2017. https://www.cs.waikato.ac.nz/ml/weka/
[17] VirusTotal, 2017. https://www.virustotal.com/.
[18] TaiWan Advanced Research and Education Network (TWAREN), 2017. http://www.twaren.net/.