簡易檢索 / 詳目顯示

研究生: 吳哲宇
Wu, Che-Yu
論文名稱: 基於機器學習之自動化釣魚網頁偵測系統研究
The Study on an Automatic Phishing Webpage Detection System based on Machine Learning
指導教授: 楊竹星
Yang, Chu-Sing
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 63
中文關鍵詞: 釣魚網頁機器學習模糊邏輯雲端運算
外文關鍵詞: Phishing webpage, URL analysis, Fuzzy logic, Machine learning
相關次數: 點閱:95下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 作為歷史悠久的網路攻擊,釣魚對於人類社會始終存在著一定程度的威脅,且在社會上每個人對於電腦與網路高度依賴的如今,這樣的威脅只增不減。過去的釣魚網頁可能針對知名的網站服務進行模仿,用以盜取相關個資;而近年來則出現針對特定目標的魚叉式網路釣魚攻擊,為受害者量身訂做一系列可能使其受騙的釣魚網頁,對眾多企業造成大量經濟上損失的APT 進階持續性滲透攻擊,更是有絕大多數都以網路釣魚作為一系列攻擊流程的起手式。對於釣魚網頁的防範,大多仰賴瀏覽器維護更新其黑名單機制,以達到對使用者的保護。然而隨著網路釣魚攻擊逐年增加,黑名單的更新頻率勢必跟不上攻擊發生的速度,以至於在面對不存在於黑名單內的0day釣魚網頁,這樣的保護機制形同虛設。因此,建立一實時機制幫助網路使用者進行網頁分析,是面對網路釣魚攻擊相當重要的一環。
    本研究利用釣魚攻擊者無法規避之網頁特質,進行對釣魚網頁的特徵蒐集,並透過機器學習技術,進行對合法網頁與釣魚網頁的預測,在準確率測試實驗中,F1-score評估分數達到了0.96。除了分析機制外,本研究亦設計完整的前後端架構,在使用者瀏覽網頁的過程中,將所需資料送至雲端分析系統進行分析,並在得到分析結果後 ─ 若為釣魚網頁 ─ 得以及時地提醒使用者可能誤入了網路釣魚攻擊者所設下的陷阱,以此達到保護使用者免於受到網路釣魚攻擊威脅之目的。

    As the Internet has become an essential part of human beings’ lives, a growing number of people are enjoying the convenience brought by the Internet, while more are attacks coming from on the dark side of the Internet. Based on some weaknesses of human nature, hackers have designed confusing phishing pages to entice web viewers to proactively expose their privacy, sensitive information.
    In this study, we propose a URL-based detection system - combining the URL of the web page URL and the URL of the web page source code as features, import Levenshtein Distance as the algorithm for calculating the similarity of strings, and supplemented by the machine learning architecture. The system is designed to provide high accuracy and low false positive rate detection results for unknown phishing pages.

    摘要 I 英文延伸摘要 II 誌謝 VIII 目錄 X 圖目錄 XIII 表目錄 XV Chapter 1. 緒論 1 1.1 研究背景 1 1.2 研究動機 3 1.3 研究目的 5 1.4 論文架構 6 Chapter 2. 相關研究 7 2.1 Phishing Kit 7 2.2 現有檢測機制 8 2.2.1 Black list 8 2.2.2 Content-based Detection 9 2.2.3 URL-based Detection 10 2.3 機器學習 11 2.3.1 資料前處理 12 2.3.2 Logistic Regression 14 2.3.3 Support Vector Machine 15 2.4 Fuzzy Logic學習歷程 17 2.5 Edit distance 18 Chapter 3. 系統設計 20 3.1 系統架構 20 3.2 資料蒐集模組 22 3.3 特徵提取模組 23 3.3.1 網頁存活時間 24 3.3.2 網頁關聯性 25 3.3.3 URL分析 26 3.3.4 網路爬蟲 28 3.3.5 相似度比較 30 3.3.6 Pattern match 35 3.3.7 小節 36 3.4 機器學習建模模組 37 3.5 機器學習分類模組 38 3.6 雲端伺服器模組 38 3.7 Cache模組 39 3.8 前端Agent模組 41 Chapter 4. 實驗結果與分析 45 4.1 系統環境 45 4.2 資料集與評估標準 45 4.3 機器學習模型測試 48 4.4 與其他釣魚檢測機制之比較 50 4.4.1 與自動檢測機制之比較 50 4.4.2 與Chrome Extension釣魚檢測機制之比較 53 4.4.3 小結 54 4.5 效能評估 54 Chapter 5. 結論 56 5.1研究貢獻 56 5.2 未來研究方向 57 References 59

    [1] Google Study Finds Email Scams Are More Effective Than You’d Expect, from https://www.huffingtonpost.com/2014/11/07/phishing-scams_n_6116988.html
    [2] 什麼是社交工程(social engineering )陷阱/詐騙?, from
    https://blog.trendmicro.com.tw/?p=101
    [3] MALWARELIST.net - Your Information Security Source | Over 90% APT-attacks Derived from Spear Phishing, from https://malwarelist.net/2012/12/03/over-90-apt-attacks-derived-from-spear-phishing/
    [4] APWG Phishing Activity Trends Report Q1 2019, from http://docs.apwg.org/reports/apwg_trends_report_q1_2019.pdf
    [5] What are phishing kits? Web components of phishing attacks explained ,from https://www.csoonline.com/article/3290417/csos-guide-to-phishing-and-phishing-kits.html
    [6] Phishing —Baiting the Hook, from https://www.akamai.com/us/en/multimedia/documents/state-of-the-internet/soti-security-phishing-baiting-the-hook-report-2019.pdf
    [7] Google Safe Browsing, from https://safebrowsing.google.com/
    [8] Firefox 內建的網路釣魚和惡意軟體防護如何運作?, from https://support.mozilla.org/zh-TW/kb/how-does-phishing-and-malware-protection-work
    [9] Inside Safari 3.2’s anti-phishing features
    , from https://www.macworld.com/article/1137094/safari-safe-browsing.html
    [10] Y. Zhang, J. I. Hong, L. F. Cranor, “CANTINA: A content-based approach to detecting phishing web sites”, Proc. 16th Int. Conf. WWW, pp. 639-648, 2007.
    [11] M. Dunlop, S. Groat, D. Shelly, “GoldPhish: Using images for content-based phishing analysis”, Proc. 5th ICIMP, pp. 123-128, 2010.
    [12] L. Wu, X. Du, J. Wu, “Effective defense schemes for phishing attacks on mobile computing platforms”, IEEE Transactions on Vehicular Technology, vol. 65, no. 8, pp. 6678-6691, 2016.
    [13] PhishDetector - True Phishing Detection, from https://www.moghimi.net/phishdetector
    [14] C. Marcelo, E.Luzeiro, “Heuristic-based Strategy for Phishing Prediction: A Survey of URL-based approach.” Computers & Security, 2019.
    [15] M. N. Feroz and S. Mengel, “Phishing URL Detection Using URL Ranking,” 2015 IEEE International Congress on Big Data, New York, NY, 2015, pp. 635-638.
    doi: 10.1109/BigDataCongress.2015.
    [16] R. M. Mohammad, F. Thabtah and L. McCluskey, “An assessment of features related to phishing websites using an automated technique,” 2012 International Conference for Internet Technology and Secured Transactions, London, pp. 492-497, 2012.
    [17] UCI Machine Learning Repository: Phishing Websites Data Set, from http://archive.ics.uci.edu/ml/datasets/phishing+websites
    [18] C. L. Tan, K. L. Chiew and S. N. Sze, “Phishing Webpage Detection Using Weighted URL Tokens for Identity Keywords Retrieval,” in in Decision Support Systems 88 · June 2016.
    [19] D. Pyle, Data preparation for data mining. morgan kaufmann, 1999.
    [20] AI.Free.Team, “資料的正規化(normalization) 及標準化(standardization).”, from https://aifreeblog.herokuapp.com/posts/54/data_science_203/, 2018.
    [21] 你可能不知道的邏輯迴歸 (Logistic Regression)
    , from https://taweihuang.hpd.io/2017/12/22/logreg101/
    [22] Logistic Regression, from http://ufldl.stanford.edu/tutorial/supervised/LogisticRegression/
    [23] Suykens, Johan AK, and Joos Vandewalle. “Least squares support vector machine classifiers.” Neural processing letters 9.3: 293-300, 1999.
    [24] Chang, Yin-Wen; Hsieh, Cho-Jui; Chang, Kai-Wei; Ringgaard, Michael; Lin, Chih-Jen, “Training and testing low-degree polynomial data mappings via linear SVM”. Journal of Machine Learning Research. 11: 1471–1490, 2010.
    [25] Classic Logic, from https://en.wikipedia.org/wiki/Classical_logic
    [26] Fuzzy Logic, from https://en.wikipedia.org/wiki/Fuzzy_logic
    [27]Esko Ukkonen. “On approximate string matching. Foundations of Computation Theory.” Springer. pp. 487–495. 1983.
    [28] Levenshtein, Vladimir I. “Binary codes capable of correcting deletions, insertions, and reversals.” Soviet physics doklady. Vol. 10. No. 8. 1966.
    [29] Wagner, Robert; Fischer, Michael. “The string-to-string correction problem” (PDF). Journal of the ACM. 21 (1): 168–173. January 1974.
    [30] Alexa | Keyword Research, Competitive Analysis, & Website Ranking, from https://www.alexa.com/
    [31] Curlie - The Collector of URLs, from https://curlie.org/
    [32] PhishTank | Join the fight against phishing, from https://www.phishtank.com/
    [33] DMOZ - The Directory of the Web, from https://dmoz-odp.org/docs/en/about.html
    [34] RIP DMOZ: The Open Directory Project is closing
    , from https://searchengineland.com/rip-dmoz-open-directory-project-closing-270291/amp
    [35] Opera 9.1 is here!, from http://my.opera.com/community/blog/2006/12/18/opera-9-1-is-here
    [36] 執行摘要:第5卷,第5期 網路釣魚 — 引誘上鉤, from https://www.akamai.com/tw/zh/multimedia/documents/state-of-the-internet/soti-security-phishing-baiting-the-hook-executive-summary-2019.pdf
    [37] Payapi, from https://input.payapi.io/v1/api/fraud/domain/age/
    [38] 搜尋服務的資訊編排方式, from https://www.google.com/intl/zh-TW/search/howsearchworks/crawling-indexing/
    [39] Google Hacking Database, from https://www.exploit-db.com/google-hacking-database
    [40] “RFC3986 - Uniform Resource Identifier (URI): Generic Syntax”, Available: https://tools.ietf.org/html/rfc3986. [Accessed: January-2005]
    [41] “RFC3912 - WHOIS Protocol Specification”, Available: https://tools.ietf.org/html/rfc3912. [Accessed: September-2004]
    [42] Public Suffix List, from https://www.publicsuffix.org/list/public_suffix_list.dat
    [43] “RFC2109 - HTTP State Management Mechanism”, Available: https://tools.ietf.org/html/rfc2109. [Accessed: February 1997]
    [44] Difference between SRC and HREF, from https://stackoverflow.com/questions/3395359/difference-between-src-and-href
    [45] Binstock, Andrew. “Obfuscation: Cloaking your Code from Prying Eyes”. Web.archive.org. Archived from the original on April 20, 2008.
    [46] scikit-learn Machine Learning in Python, from https://scikit-learn.org/
    [47] The Pallets Projects - Flask, from https://www.palletsprojects.com/p/flask/
    [48] Browser Market Share, from https://netmarketshare.com/browser-market-share.aspx
    [49] Powers, David M W. “Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation” (PDF). Journal of Machine Learning Technologies. 2 (1): 37–63, 2010.
    [50] S. M. Beitzel, On understanding and classifying web queries, 2006.
    [51] Fawcett, Tom, “An Introduction to ROC Analysis”. Pattern Recognition Letters. 27 (8): 861–874, 2006.
    [52] R. J. Lewis, “An introduction to classification and regression tree (cart) analysis,” in Annual meeting of the society for academic emergency medicine in San Francisco, California, vol. 14, 2000.
    [53] Web Shield - Phishing Protection, from https://chrome.google.com/webstore/detail/web-shield-phishing-prote/bmbegmfkefhoggfcleldcjhmfkmibcia
    [54] Selenium webdriver introduction, from https://www.seleniumhq.org/projects/webdriver/

    下載圖示 校內:2025-02-13公開
    校外:2025-02-13公開
    QR CODE