| 研究生: |
張少懷 Zhang, Shao-Huai |
|---|---|
| 論文名稱: |
基於機器學習技術之靜態PE格式惡意程式分類之研究 A Study on Static PE Malware Type Classification Using Machine Learning Techniques |
| 指導教授: |
楊竹星
Yang, Chu-Sing |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 中文 |
| 論文頁數: | 54 |
| 中文關鍵詞: | 惡意程式分類 、靜態分析 、PE格式 、機器學習 、資料探勘 |
| 外文關鍵詞: | malware classification, static analysis, PE format, machine learning, data mining |
| 相關次數: | 點閱:195 下載:20 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,惡意程式數量逐年提升、變異的速度也遠快於以往,對如今高度依賴電腦的人類社會所帶來的威脅與日俱增。過去惡意程式偵測系統主要仰賴特徵碼比對,隨著惡意程式的加速成長和隱匿技巧的進步,更新特徵碼資料庫所需要的人力與時間成本大幅增加。因此,機器學習技術被導入惡意程式偵測,從已知樣本中自動分析與學習潛在特徵,藉此對未知的程式進行預判。儘管機器學習的預測能力尚無法完取代人力,但卻能夠作為惡意程式偵測重要的一環。過去大多數應用機器學習於惡意程式辨識的研究,主要著重於惡意與非惡意的二元分類問題,近年來則開始有較多根據惡意程式家族的多元分類。
本研究利用機器學習技術進行惡意程式與否及其種類的預測,比起二元分類問題提 供了更細膩的預測訊息,相較於惡意程式家族的分類又更具實用價值的延續性。基 於數量規模高達 80 萬的惡意程式資料集,本研究為透過 VirusTotal 其打上惡意程式 種類標記。以靜態分析方式從樣本中提取大量特徵,並比較多個分類器模型的分類 性能表現。最後發現,Random Forest 模型不但在預測能力上略優於參考對象所使用 的 LightGBM 模型,達到 micro F1 分數 0.95 和 macro F1 分數 0.90 的成積,其模型訓 練時間亦僅需要對手的一半。本研究持續分析 Random Forest 模型對於各個惡意程式 類別的預測表現,並透過訓練資料集以外的惡意程式樣本作為實例,證明本研究最 終訓練的預測模型,在實際場境中確實有能力辨識未知的惡意程式、以及預測其所屬的惡意程式種類。
This work aims to build an efficient, reliable and practical static malware classification system based on PE format files for Windows platform using machine learning techniques. With static analysis, feature extraction and anomaly detection can be done without executing the binary sample. With the large-scale dataset, the trained model can get more knowledge and perform better in practice. After comparing a variety of machine learning models, the best one are chosen as the final classifier in this work. Different from previous works which predict whether malicious or non-malicious, this work aims to predict not only whether malicious or not but also which type of malware it is. With this advanced information about malware type, the user can estimate the risk or damage such a malware may bring. Apart from malware type prediction, this work can produce the probability of all possible malware types. This makes our work more valuable in practice.
[1] M. Akbanov, V. G. Vassilakis, and M. D. Logothetis, “Wannacry ransomware: Analysis of infection, persistence, recovery prevention and propagation mechanisms.,” Journal of Telecommunications & Information Technology, no. 1, 2019.
[2] B. Li, K. Roundy, C. Gates, and Y. Vorobeychik, “Large-scale identification of malicious singleton files,” in Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy, pp. 227–238, ACM, 2017.
[3] J. Scott, “Signature based malware detection is dead,” URL: http://icitech.org/wpcontent/uploads/2017/02/ICIT-Analysis-Signature-Based-Malware-Detectionis-De ad. pdf, 2017.
[4] D. Gavriluţ, M. Cimpoeşu, D. Anton, and L. Ciortuz, “Malware detection using machine learning,” in 2009 International Multiconference on Computer Science and Information Technology, pp. 735–741, IEEE, 2009.
[5] R. Caruana and A. Niculescu-Mizil, “An empirical comparison of supervised learning algorithms,” in Proceedings of the 23rd international conference on Machine learning, pp. 161–168, ACM, 2006.
[6] J. Saxe and K. Berlin, “Deep neural network based malware detection using two dimensional binary program features,” in 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), pp. 11–20, IEEE, 2015.
[7] G. Sood, virustotal: R Client for the virustotal API, 2017. R package version 0.2.1.
[8] R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ahmadi, “Microsoft malware classification challenge,” 2018.
[9] H. S. Anderson and P. Roth, “EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models,” ArXiv e-prints, Apr. 2018.
[10] L. Bottou, “Stochastic gradient descent tricks,” in Neural networks: Tricks of the trade, pp. 421–436, Springer, 2012.
[11] P. J. Groenen, G. Nalbantov, and J. C. Bioch, “Svm-maj: a majorization approach tolinear support vector machines with different hinge errors,” Advances in data analysis and classification, vol. 2, no. 1, pp. 17–43, 2008.
[12] P. Peduzzi, J. Concato, E. Kemper, T. R. Holford, and A. R. Feinstein, “A simulation study of the number of events per variable in logistic regression analysis,” Journal of clinical epidemiology, vol. 49, no. 12, pp. 1373–1379, 1996.
[13] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
[14] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol. 61, pp. 85–117, 2015.
[15] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” in Advances in Neural Information Processing Systems, pp. 3146–3154, 2017.
[16] J. H. Friedman, “Stochastic gradient boosting,” Computational statistics & data analysis, vol. 38, no. 4, pp. 367–378, 2002.
[17] Microsoft Docs, “Pe format.” https://docs.microsoft.com/en-us/windows/win32/debug/pe-format#references, 2019.
[18] Wikipedia, “Structure of a portable executable 32 bit.” https://upload.wikimedia.org/wikipedia/commons/1/1b/Portable_Executable_32_bit_Structure_in_SVG_fixed.svg, cited July 2008.
[19] D. Devi and S. Nandi, “Pe file features in detection of packed executables,” International Journal of Computer Theory and Engineering, vol. 4, no. 3, p. 476, 2012.
[20] G. McGraw and G. Morrisett, “Attacking malicious code: A report to the infosec research council,” IEEE software, vol. 17, no. 5, pp. 33–41, 2000.
[21] X. Li, P. K. Loh, and F. Tan, “Mechanisms of polymorphic and metamorphic viruses,” in 2011 European intelligence and security informatics conference, pp. 149–154, IEEE, 2011.
[22] Microsoft Docs, “How microsoft identifies malware and potentially unwanted applications.” https://docs.microsoft.com/en-us/windows/security/threat-protection/intelligence/criteria, 2019.
[23] Microsoft Docs, “What is the difference: Viruses, worms, trojans, and bots?.” https://www.cisco.com/c/en/us/about/security-center/virus-differences.html, 2018.
[24] I. A. Saeed, A. Selamat, and A. M. Abuagoub, “A survey on malware and malware detection systems,” International Journal of Computer Applications, vol. 67, no. 16, 2013.
[25] K. Chumachenko et al., “Machine learning methods for malware detection and classification,” 2017.
[26] M. Egele, T. Scholte, E. Kirda, and C. Kruegel, “A survey on automated dynamic malware-analysis techniques and tools,” ACM computing surveys (CSUR), vol. 44, no. 2, p. 6, 2012.
[27] P. Shijo and A. Salim, “Integrated static and dynamic analysis for malware detection,” Procedia Computer Science, vol. 46, pp. 804–811, 2015.
[28] Y. Prayudi, I. Riadi, et al., “Implementation of malware analysis using static and dynamic analysis method,” International Journal of Computer Applications, vol. 117, no. 6, 2015.
[29] TWCERT/CC 電子報, “惡意程式檢測 vs 防毒軟體 (下).” https://blog.twnic.net.tw/2019/07/11/4210/, 2019.
[30] C. M. Bishop, Pattern recognition and machine learning. springer, 2006.
[31] P. Oliveri, C. Malegori, R. Simonetti, and M. Casale, “The impact of signal preprocessing on the final interpretation of analytical outcomes–a tutorial,” Analytica chimica acta, 2018.
[32] D. Pyle, Data preparation for data mining. morgan kaufmann, 1999.
[33] Selva Prabhakaran, “One hot encoding.” https://www.machinelearningplus.com/machine-learning/caret-package/attachment/one-hot-encoding/, 2018.
[34] AI.Free.Team, “資料的正規化 (normalization) 及標準化 (standardization).” https://aifreeblog.herokuapp.com/posts/54/data_science_203/, 2018.
[35] K. Weinberger, A. Dasgupta, J. Attenberg, J. Langford, and A. Smola, “Feature hashing for large scale multitask learning,” arXiv preprint arXiv:0902.2206, 2009.
[36] C. Caragea, A. Silvescu, and P. Mitra, “Protein sequence classification using feature hashing,” in Proteome science, vol. 10, p. S14, BioMed Central, 2012.
[37] S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas, “Machine learning: a review of classification and combining techniques,” Artificial Intelligence Review, vol. 26, no. 3, pp. 159–190, 2006.
[38] R. J. Lewis, “An introduction to classification and regression tree (cart) analysis,” in Annual meeting of the society for academic emergency medicine in San Francisco, California, vol. 14, 2000.
[39] L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp. 123–140, 1996.
[40] S. Sathyadevan and R. R. Nair, “Comparative analysis of decision tree algorithms: Id3, c4. 5 and random forest,” in Computational Intelligence in Data Mining-Volume 1, pp. 549–562, Springer, 2015.
[41] S. Ranka and V. Singh, “Clouds: A decision tree classifier for large datasets,” in Proceedings of the 4th Knowledge Discovery and Data Mining Conference, vol. 2, 1998.
[42] V. Van Asch, “Macro-and micro-averaged evaluation measures [[basic draft]],” Belgium: CLiPS, pp. 1–27, 2013.
[43] R. Kohavi et al., “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in Ijcai, vol. 14, pp. 1137–1145, Montreal, Canada, 1995.
[44] M. G. Schultz, E. Eskin, F. Zadok, and S. J. Stolfo, “Data mining methods for detection of new malicious executables,” in Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001, pp. 38–49, IEEE, 2000.
[45] J. Z. Kolter and M. A. Maloof, “Learning to detect and classify malicious executables in the wild,” Journal of Machine Learning Research, vol. 7, no. Dec, pp. 2721–2744, 2006.
[46] M. Z. Shafiq, S. M. Tabish, F. Mirza, and M. Farooq, “Pe-miner: Mining structural information to detect malicious executables in realtime,” in International Workshop on Recent Advances in Intrusion Detection, pp. 121–141, Springer, 2009.
[47] M. Hassen and P. K. Chan, “Scalable function call graph-based malware classification,” in Proceedings of the Seventh ACM on Conference on Data and Application Security and Privacy, pp. 239–248, ACM, 2017.
[48] Z. Ali, S. K. Shahzad, and W. Shahzad, “Performance analysis of statistical pattern recognition methods in keel,” Procedia computer science, vol. 112, pp. 2022–2030, 2017.
[49] M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and G. Giacinto, “Novel feature extraction, selection and fusion for effective malware family classification,” in Proceedings of the sixth ACM conference on data and application security and privacy, pp. 183–194, ACM, 2016.
[50] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, (New York, NY, USA), pp. 785–794, ACM, 2016.
[51] H.-D. Pham, T. D. Le, and T. N. Vu, “Static pe malware detection using gradient boosting decision trees algorithm,” in International Conference on Future Data and Security Engineering, pp. 228–236, Springer, 2018.
[52] R. Vinayakumar and K. Soman, “Deepmalnet: Evaluating shallow and deep networks for static pe malware detection,” ICT Express, vol. 4, no. 4, pp. 255–258, 2018.
[53] R. Thomas, “Lief - library to instrument executable formats.” https://lief.quarkslab.com/, April 2017.
[54] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[55] F. Chollet et al., “Keras.” https://github.com/fchollet/keras, 2015.