| 研究生: |
吳哲宇 Wu, Che-Yu |
|---|---|
| 論文名稱: |
基於機器學習之自動化釣魚網頁偵測系統研究 The Study on an Automatic Phishing Webpage Detection System based on Machine Learning |
| 指導教授: |
楊竹星
Yang, Chu-Sing |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2020 |
| 畢業學年度: | 108 |
| 語文別: | 中文 |
| 論文頁數: | 63 |
| 中文關鍵詞: | 釣魚網頁 、機器學習 、模糊邏輯 、雲端運算 |
| 外文關鍵詞: | Phishing webpage, URL analysis, Fuzzy logic, Machine learning |
| 相關次數: | 點閱:95 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
作為歷史悠久的網路攻擊,釣魚對於人類社會始終存在著一定程度的威脅,且在社會上每個人對於電腦與網路高度依賴的如今,這樣的威脅只增不減。過去的釣魚網頁可能針對知名的網站服務進行模仿,用以盜取相關個資;而近年來則出現針對特定目標的魚叉式網路釣魚攻擊,為受害者量身訂做一系列可能使其受騙的釣魚網頁,對眾多企業造成大量經濟上損失的APT 進階持續性滲透攻擊,更是有絕大多數都以網路釣魚作為一系列攻擊流程的起手式。對於釣魚網頁的防範,大多仰賴瀏覽器維護更新其黑名單機制,以達到對使用者的保護。然而隨著網路釣魚攻擊逐年增加,黑名單的更新頻率勢必跟不上攻擊發生的速度,以至於在面對不存在於黑名單內的0day釣魚網頁,這樣的保護機制形同虛設。因此,建立一實時機制幫助網路使用者進行網頁分析,是面對網路釣魚攻擊相當重要的一環。
本研究利用釣魚攻擊者無法規避之網頁特質,進行對釣魚網頁的特徵蒐集,並透過機器學習技術,進行對合法網頁與釣魚網頁的預測,在準確率測試實驗中,F1-score評估分數達到了0.96。除了分析機制外,本研究亦設計完整的前後端架構,在使用者瀏覽網頁的過程中,將所需資料送至雲端分析系統進行分析,並在得到分析結果後 ─ 若為釣魚網頁 ─ 得以及時地提醒使用者可能誤入了網路釣魚攻擊者所設下的陷阱,以此達到保護使用者免於受到網路釣魚攻擊威脅之目的。
As the Internet has become an essential part of human beings’ lives, a growing number of people are enjoying the convenience brought by the Internet, while more are attacks coming from on the dark side of the Internet. Based on some weaknesses of human nature, hackers have designed confusing phishing pages to entice web viewers to proactively expose their privacy, sensitive information.
In this study, we propose a URL-based detection system - combining the URL of the web page URL and the URL of the web page source code as features, import Levenshtein Distance as the algorithm for calculating the similarity of strings, and supplemented by the machine learning architecture. The system is designed to provide high accuracy and low false positive rate detection results for unknown phishing pages.
[1] Google Study Finds Email Scams Are More Effective Than You’d Expect, from https://www.huffingtonpost.com/2014/11/07/phishing-scams_n_6116988.html
[2] 什麼是社交工程(social engineering )陷阱/詐騙?, from
https://blog.trendmicro.com.tw/?p=101
[3] MALWARELIST.net - Your Information Security Source | Over 90% APT-attacks Derived from Spear Phishing, from https://malwarelist.net/2012/12/03/over-90-apt-attacks-derived-from-spear-phishing/
[4] APWG Phishing Activity Trends Report Q1 2019, from http://docs.apwg.org/reports/apwg_trends_report_q1_2019.pdf
[5] What are phishing kits? Web components of phishing attacks explained ,from https://www.csoonline.com/article/3290417/csos-guide-to-phishing-and-phishing-kits.html
[6] Phishing —Baiting the Hook, from https://www.akamai.com/us/en/multimedia/documents/state-of-the-internet/soti-security-phishing-baiting-the-hook-report-2019.pdf
[7] Google Safe Browsing, from https://safebrowsing.google.com/
[8] Firefox 內建的網路釣魚和惡意軟體防護如何運作?, from https://support.mozilla.org/zh-TW/kb/how-does-phishing-and-malware-protection-work
[9] Inside Safari 3.2’s anti-phishing features
, from https://www.macworld.com/article/1137094/safari-safe-browsing.html
[10] Y. Zhang, J. I. Hong, L. F. Cranor, “CANTINA: A content-based approach to detecting phishing web sites”, Proc. 16th Int. Conf. WWW, pp. 639-648, 2007.
[11] M. Dunlop, S. Groat, D. Shelly, “GoldPhish: Using images for content-based phishing analysis”, Proc. 5th ICIMP, pp. 123-128, 2010.
[12] L. Wu, X. Du, J. Wu, “Effective defense schemes for phishing attacks on mobile computing platforms”, IEEE Transactions on Vehicular Technology, vol. 65, no. 8, pp. 6678-6691, 2016.
[13] PhishDetector - True Phishing Detection, from https://www.moghimi.net/phishdetector
[14] C. Marcelo, E.Luzeiro, “Heuristic-based Strategy for Phishing Prediction: A Survey of URL-based approach.” Computers & Security, 2019.
[15] M. N. Feroz and S. Mengel, “Phishing URL Detection Using URL Ranking,” 2015 IEEE International Congress on Big Data, New York, NY, 2015, pp. 635-638.
doi: 10.1109/BigDataCongress.2015.
[16] R. M. Mohammad, F. Thabtah and L. McCluskey, “An assessment of features related to phishing websites using an automated technique,” 2012 International Conference for Internet Technology and Secured Transactions, London, pp. 492-497, 2012.
[17] UCI Machine Learning Repository: Phishing Websites Data Set, from http://archive.ics.uci.edu/ml/datasets/phishing+websites
[18] C. L. Tan, K. L. Chiew and S. N. Sze, “Phishing Webpage Detection Using Weighted URL Tokens for Identity Keywords Retrieval,” in in Decision Support Systems 88 · June 2016.
[19] D. Pyle, Data preparation for data mining. morgan kaufmann, 1999.
[20] AI.Free.Team, “資料的正規化(normalization) 及標準化(standardization).”, from https://aifreeblog.herokuapp.com/posts/54/data_science_203/, 2018.
[21] 你可能不知道的邏輯迴歸 (Logistic Regression)
, from https://taweihuang.hpd.io/2017/12/22/logreg101/
[22] Logistic Regression, from http://ufldl.stanford.edu/tutorial/supervised/LogisticRegression/
[23] Suykens, Johan AK, and Joos Vandewalle. “Least squares support vector machine classifiers.” Neural processing letters 9.3: 293-300, 1999.
[24] Chang, Yin-Wen; Hsieh, Cho-Jui; Chang, Kai-Wei; Ringgaard, Michael; Lin, Chih-Jen, “Training and testing low-degree polynomial data mappings via linear SVM”. Journal of Machine Learning Research. 11: 1471–1490, 2010.
[25] Classic Logic, from https://en.wikipedia.org/wiki/Classical_logic
[26] Fuzzy Logic, from https://en.wikipedia.org/wiki/Fuzzy_logic
[27]Esko Ukkonen. “On approximate string matching. Foundations of Computation Theory.” Springer. pp. 487–495. 1983.
[28] Levenshtein, Vladimir I. “Binary codes capable of correcting deletions, insertions, and reversals.” Soviet physics doklady. Vol. 10. No. 8. 1966.
[29] Wagner, Robert; Fischer, Michael. “The string-to-string correction problem” (PDF). Journal of the ACM. 21 (1): 168–173. January 1974.
[30] Alexa | Keyword Research, Competitive Analysis, & Website Ranking, from https://www.alexa.com/
[31] Curlie - The Collector of URLs, from https://curlie.org/
[32] PhishTank | Join the fight against phishing, from https://www.phishtank.com/
[33] DMOZ - The Directory of the Web, from https://dmoz-odp.org/docs/en/about.html
[34] RIP DMOZ: The Open Directory Project is closing
, from https://searchengineland.com/rip-dmoz-open-directory-project-closing-270291/amp
[35] Opera 9.1 is here!, from http://my.opera.com/community/blog/2006/12/18/opera-9-1-is-here
[36] 執行摘要:第5卷,第5期 網路釣魚 — 引誘上鉤, from https://www.akamai.com/tw/zh/multimedia/documents/state-of-the-internet/soti-security-phishing-baiting-the-hook-executive-summary-2019.pdf
[37] Payapi, from https://input.payapi.io/v1/api/fraud/domain/age/
[38] 搜尋服務的資訊編排方式, from https://www.google.com/intl/zh-TW/search/howsearchworks/crawling-indexing/
[39] Google Hacking Database, from https://www.exploit-db.com/google-hacking-database
[40] “RFC3986 - Uniform Resource Identifier (URI): Generic Syntax”, Available: https://tools.ietf.org/html/rfc3986. [Accessed: January-2005]
[41] “RFC3912 - WHOIS Protocol Specification”, Available: https://tools.ietf.org/html/rfc3912. [Accessed: September-2004]
[42] Public Suffix List, from https://www.publicsuffix.org/list/public_suffix_list.dat
[43] “RFC2109 - HTTP State Management Mechanism”, Available: https://tools.ietf.org/html/rfc2109. [Accessed: February 1997]
[44] Difference between SRC and HREF, from https://stackoverflow.com/questions/3395359/difference-between-src-and-href
[45] Binstock, Andrew. “Obfuscation: Cloaking your Code from Prying Eyes”. Web.archive.org. Archived from the original on April 20, 2008.
[46] scikit-learn Machine Learning in Python, from https://scikit-learn.org/
[47] The Pallets Projects - Flask, from https://www.palletsprojects.com/p/flask/
[48] Browser Market Share, from https://netmarketshare.com/browser-market-share.aspx
[49] Powers, David M W. “Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation” (PDF). Journal of Machine Learning Technologies. 2 (1): 37–63, 2010.
[50] S. M. Beitzel, On understanding and classifying web queries, 2006.
[51] Fawcett, Tom, “An Introduction to ROC Analysis”. Pattern Recognition Letters. 27 (8): 861–874, 2006.
[52] R. J. Lewis, “An introduction to classification and regression tree (cart) analysis,” in Annual meeting of the society for academic emergency medicine in San Francisco, California, vol. 14, 2000.
[53] Web Shield - Phishing Protection, from https://chrome.google.com/webstore/detail/web-shield-phishing-prote/bmbegmfkefhoggfcleldcjhmfkmibcia
[54] Selenium webdriver introduction, from https://www.seleniumhq.org/projects/webdriver/