| 研究生: |
劉大哲 Liu, Ta-Che |
|---|---|
| 論文名稱: |
基於文本處理結合有效特徵選擇之惡意程式分類方法 Malware Classification based on N-gram and TF-IDF with Efficient Feature Set Reduction |
| 指導教授: |
李忠憲
Li, Jung-Shian |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 中文 |
| 論文頁數: | 52 |
| 中文關鍵詞: | 動態分析 、惡意程式分類 、機器學習 、深度學習 、特徵選擇 |
| 外文關鍵詞: | Dynamic Analysis, Malware Classification, Machine Learning, Deep Learning, Feature Selection |
| 相關次數: | 點閱:136 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由於電腦和網路的發展快速,人們在使用網路所帶來的便利時,也使得資訊科技犯罪的快速崛起。傳統上面對惡意程式是使用特徵資料庫來進行特徵馬比對來識別惡意程式種類,或著透過專家的經驗人為分析。然而面對越來越多變的惡意程式,使用傳統資料庫比對的方法可能會因為惡意程式做了加殼等等的動作而使特徵碼分析失去準確性,也因為惡意程式數量呈指數級增長的原因,導致特徵資料庫越來越大。綜合以上兩種主因造成傳統資料庫比對方法已無法有效且有效率的對惡意程式進行分類,因此各學者開始研究如何藉由機器學習與深度學習等方法來解決此問題。本論文以國家高速網路與計算中心提供大量的惡意程式分析報告來進行研究,將報告中的函式呼叫透過自動化程式將其取出並生成文檔,搭配自然語言處理中用來處理文本資料的演算法n-gram及詞頻-逆文件頻率將這些取出的文字量化,並轉換為具有意義的數字,最後藉由特徵選擇消除冗餘特徵減少訓練時間成本。在實驗結果中,我們比較了詞頻-逆文件頻率使用前後的準確率變化及經特徵選擇後的特徵子集與原始特徵間的效能差異;結果顯示本研究提出之方法能取得87.08%準確率,並節省87.97%訓練時間,且最後於相關研究中能取得較佳的表現。
Due to the rapid development of computers and the Internet, people's use of the Internet has also led to the rapid rise of information technology crimes. The traditional way to identify malware is to use a signature database to compare and determine the type of a malware, or artificial analysis through expert experience. However, in the face of more and more malware program, the traditional database comparison method may cause the signature database analysis to lose accuracy due to the action of encoding, and because of the exponential increase in the number of malware programs, the feature database is getting bigger and bigger. Combining the above two main causes, the traditional database comparison method cannot effectively and efficiently classify malware programs. Therefore, scholars has begun to study how to solve this problem through machine learning and deep learning.
In this thesis, we study a large number of malware analysis reports provided by the National Center for High-Performance Computing, we extracts the function calls in the report through an automated program to generate documents, the algorithm n-gram and TF-IDF used to process text data in natural language processing quantify these extracted texts and converts them into meaningful numbers. Finally, we eliminate redundant feature by feature selection to reduce the training time cost.
In the experimental results, we compare the accuracy of the TF-IDF before and after the use and the difference between the feature subset and the original feature. The results show that our proposed method can achieve 87.08% accuracy, and save 87.97% training time. Through experiments, our method outperforms the other related research.
[1]
"International, Radio Taiwan," [Online]. Available: https://www.rti.org.tw/news/view/id/2004515. [Accessed 17 6 2019].
[2] "資安趨勢部落格," [Online]. Available: https://blog.trendmicro.com.tw/?p=49656. [Accessed 19 6 2019].
[3] "KasperskyLab," [Online]. Available: https://www.kaspersky.com/about/press-releases/2017_kaspersky-lab-detects-360000-new-malicious-files-daily. [Accessed 14 6 2019].
[4] "virus," [Online]. Available: http://myweb.scu.edu.tw/~mlchao/basic/virus.htm. [Accessed 18 6 2019].
[5] "KUAS," [Online]. Available: http://computer.kuas.edu.tw/files/16-1006-28328.php. [Accessed 19 6 2019].
[6] "symantec," [Online]. Available: https://www.websecurity.symantec.com/zh/tw/security-topics/what-are-malware-viruses-spyware-and-cookies-and-what-differentiates-them. [Accessed 10 5 2019].
[7] "資安趨勢部落格," [Online]. Available: https://blog.trendmicro.com.tw/?p=143. [Accessed 19 6 2019].
[8] "wikipedia," [Online]. Available: https://en.wikipedia.org/wiki/Spyware. [Accessed 19 6 2019].
[9] "資安趨勢部落格," [Online]. Available: https://blog.trendmicro.com.tw/?cat=3929. [Accessed 19 6 2019].
[10] "trendmicro," [Online]. Available: https://www.trendmicro.com/vinfo/us/security/definition/potentially-unwanted-app. [Accessed 25 5 2019].
[11] D. Bilar, "Opcodes as predictor for malware," International Journal of Electronic Security and Digital Forensics, vol. 1, no. 2, pp. 156-168, 2 5 2007.
[12] I. Santos, B. Felix and J. Nieves, "Idea: Opcode-Sequence-Based Malware Detection," International Symposium on Engineering Secure Software and Systems, vol. 5965, pp. 35-43, 2010.
[13] J. Saxe and K. Berlin, "Deep neural network based malware detection using two dimensional binary program features," pp. 11-20, 13 8 2015.
[14] N. Kawaguchi and K. Omote, "Malware function classification using apis in initial behavior," 2015 10th Asia Joint Conference on Information Security, 13 7 2015.
[15] S. Seok and H. Kim, "Visualized malware classification based-on convolutional neural network," Journal of the Korea Institute of Information Security and Cryptology, vol. 26, no. 1, 2 2016.
[16] B. Kolosnjaji, G. Eraisha, G. Webster, A. Zarras and C. Eckert, "Empowering convolutional networks for malware classification and analysis," 2017 International Joint Conference on Neural Networks (IJCNN), 5 2017.
[17] E. Moshiri, A. B. Abdullah, R. A. B. R. Mahmood and Z. Muda, "Malware Classification Framework for Dynamic Analysis using Information Theory," Indian Journal of Science and Technology, vol. 10, 2017.
[18] H.-T. Li, "Malware Detection and Classification Based on Machine Learning Technology," Department of Computer Science and Information Engineering,National Yunlin University of Science and Technology, 2018.
[19] L. Nataraj, S. Karthikeyan, G. Jacob and B. S. Manjunath, "Visualization and automatic classification," International Symposium on Visualization for Cyber Security (VizSec), vol. 4, 2011.
[20] V. N. Vapnik, "An Overview of Statistical Learning Theory," IEEE TRANSACTIONS ON NEURAL NETWORKS, vol. 10, no. 5, 1999.
[21] M. Alazab, R. Layton, S. Venkatraman and P. Watters, "Malware Detection Based on Structural and Behavioural Features of API Calls," Proceedings of the 1st international cyber resilience conference, pp. 1-10, 2010.
[22] Y. Ye, L. Chen, D. Wang, T. Li, Q. Jiang and M. Zhao, "SBMDS: an interpretable string based malware detection system using SVM ensemble with bagging," Journal in Computer Virology, vol. 5, no. 4, pp. 283-293, 2009.
[23] K. Rieck, T. Holz, C. Willems, P. Düssel and P. Laskov, "Learning and Classification of Malware Behavior," International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, 2008.
[24] C.-W. Hsu and C.-J. Lin, "A comparison of methods for multiclass support vector machines," IEEE Transactions on Neural Networks, vol. 13, no. 2, pp. 415 - 425, 2002.
[25] M. Kruczkowski and E. N. Szynkiewicz, "Support vector machine for malware analysis and classification," 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), vol. 2, pp. 415-420, 2014.
[26] F. Murtagh, "Multilayer perceptrons for classification and regression Neurocomputing," Neurocomputing 2, vol. 2, pp. 183-197, 1991.
[27] M. Kalash, M. Rochan, N. Mohammed, N. D. B. Bruce, Y. Wang and F. Iqbal, "Malware classification with deep convolutional neural networks," 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS), pp. 1-5, 2018.
[28] "Malware Database," [Online]. Available: https://owl.nchc.org.tw/. [Accessed 20 4 2019].
[29] "VirusTotal," [Online]. Available: http://www.virustotal.com.. [Accessed 10 5 2019].
[30] J. Y.-C. Cheng, T.-S. Tsai and C.-S. Yang, "An information retrieval approach for malware classification based on Windows API calls," 2013 International Conference on Machine Learning and Cybernetics, vol. 4, pp. 1678-1683, 2013.
[31] E. Raff, R. Zak, R. Cox, J. Sylvester, P. Yacci, R. Ward, A. Tracy, M. McLean and C. Nicholas, "An investigation of byte n-gram features for malware classification," Journal of Computer Virology and Hacking Techniques, vol. 14, no. 1, pp. 1-20, 2018.
[32] "symantec," [Online]. Available: https://www.symantec.com/. [Accessed 20 6 2019].
[33] "kaspersky," [Online]. Available: https://www.kaspersky.com/. [Accessed 20 6 2019].
[34] "F-Secure," [Online]. Available: https://www.f-secure.com/en/welcome. [Accessed 20 6 2019].
[35] "Trend Micro," [Online]. Available: https://www.trendmicro.com/en_us/business.html. [Accessed 20 6 2019].
[36] S. E. Robertson, "Understanding Inverse Document Frequency:On theoretical arguments for IDF," Journal of Documentation 2004, pp. 503-520, 2004.
[37] "keras," [Online]. Available: https://keras.io/. [Accessed 5 6 2019].
[38] "scikit-learn," [Online]. Available: https://scikit-learn.org/stable/. [Accessed 5 6 2019].