簡易檢索 / 詳目顯示

研究生: 陳群凱
Chen, Qun-Kai
論文名稱: 以文字探勘方法預測假新聞-以烏俄戰爭為例
Predicting Fake News with Text Mining: A Case Study of the Russian-Ukrainian War
指導教授: 陳牧言
Chen, Mu-Yen
學位類別: 碩士
Master
系所名稱: 工學院 - 工程科學系碩士在職專班
Department of Engineering Science (on the job class)
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 51
中文關鍵詞: 文字探勘假新聞偵測深度學習機器學習烏俄戰爭
外文關鍵詞: Text Mining, Fake News Detection, Deep Learning, Machine Learning, Ukrainian-Russian War
相關次數: 點閱:192下載:51
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在這個資訊發達的年代,民眾透過電視或報紙觀看新聞的比例逐漸下降,取而代之的是瀏覽各大新聞網站或是社群平台上的新聞訊息,在這個人人都是自媒體的時代,產生假新聞的門檻越來越低,造就許多虛假新聞在網路上傳播,因此如何偵測真假新聞已是現今無法避免的一道難題,而在2022年2月爆發的烏俄戰爭,引發全球及台灣社會的高度關注,而這段期間的假新聞和謠言在網路上比以往任何時候都更加流行,這種情況不僅會導致公眾混淆,還有可能對國家安全造成威脅,因此本研究希望可以發展一個能夠有效檢測假新聞的模型,並且對烏俄戰爭的假新聞進行有效的分類。
    本研究蒐集Cofacts、MyGoPen、台灣事實查核中心、聯合新聞網內烏俄戰爭相關新聞做為樣本資料集,實做以TF-IDF(Term Frequency-Inverse Document Frequency, TF-IDF)擷取文本特徵後以機器學習演算法SVM(Support Vector Machine, SVM)、隨機森林(Random Forest, RF)、XGBoost(eXtreme Gradient Boosting, XGBoost)進行建模。而深度學習演算法則是選擇T-BERT(Taiwan-Bidirectional Encoder Representations from Transformers, T-BERT)進行微調建模。實驗結果顯示,機器學習整體表現以SVM為最佳,準確率可達96%,而深度學習模型T-BERT準確率達到98.75%,並且在少量樣本量時仍可維持94%以上的準確率,並透過訓練完成的模型建立一個烏俄戰爭假新聞的預測網頁,提供未來有假新聞時可進行預測。

    In this age of information, the proportion of people watching news through television or newspapers is gradually decreasing, replaced by browsing news on news websites or social media platforms. With the proliferation of self-media, the threshold for creating fake news is getting lower and lower, leading to the spread of many false news on the internet. Therefore, how to detect fake news has become an unavoidable problem. In February of this year, the outbreak of the Russian-Ukrainian War attracted global and Taiwanese society's high attention, and during this period, fake news and rumors on the internet were more popular than ever before. This situation not only causes public confusion, but also threatens national security. Therefore, this study hopes to develop an effective model for detecting fake news and effectively classify fake news on the Russian-Ukrainian War. The dataset of this study uses fake news on the Russian-Ukrainian War as the sample set, and implements machine learning algorithms SVM, Random Forest, XGBoost after extracting text features using TF-IDF. The deep learning algorithm is T-BERT, which is fine-tuned for modeling. The experimental results show that SVM has the best overall performance in machine learning, with an accuracy rate of 96%, while the accuracy rate of the deep learning model T-BERT can reach 98.75%. Even with a small number of samples, the accuracy rate can still be maintained at above 94%.The trained model is also used to build a webpage for predicting fake news about the Russian-Ukrainian war, which can be used to predict fake news in the future.

    摘要 I EXTEND ABSTRACT II 致謝 V 目次 VI 表目錄 VIII 圖目錄 IX 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機與目的 2 1.3 研究架構 4 第二章 文獻探討 5 2.1 假新聞 5 2.2 文字探勘 6 2.3 網路爬蟲 8 2.4 特徵擷取 9 2.5 機器學習 10 2.5.1 SVM 10 2.5.2 Random Forests 13 2.5.3 XGBoost 14 2.6 深度學習 14 第三章 研究方法 17 3.1 資料集 18 3.2 資料預處理 21 3.2.1 資料清理 21 3.2.2 斷詞 22 3.2.3 移除停用字 23 3.3 特徵建構 26 3.4 模型訓練 27 3.5 評估模型 29 第四章 實驗結果 32 4.1 實驗環境 32 4.2 文字雲 33 4.3 關鍵字差異分析 34 4.4 模型訓練與評估指標 36 4.4.1 機器學習訓練與測試集7:3 36 4.4.2 機器學習訓練與測試集8:2 39 4.4.3 機器學習十折交叉驗證 41 4.4.4 深度學習模型BERT驗證 42 4.4.5 烏俄戰爭即時驗證平台 43 4.5 實驗結果討論 45 第五章 結論與未來展望 47 5.1 結論 47 5.2 研究限制 47 5.3 未來研究 48 參考文獻 49

    [1] 王億晴, and 梁慈芳. "假新聞對閱聽者之影響探討." 圖文傳播藝術學報. 38-45. 2018。
    [2] 江純雅。「利用文字探勘與機器學習建立假新聞自動偵測模型與平台:以某內容農場醫療類新聞為例」。碩士論文,國立陽明交通大學醫務管理研究所,2021。
    [3] 林郁綺。「利用人工智慧技術偵測中文假新聞」。碩士論文,國立臺灣師範大學圖書資訊學研究所,2021。
    [4] 林治錡。「結合文字探勘與情感分析建構假新聞偵測模型:以新冠肺炎假新聞為例」。碩士論文,國立臺北科技大學資訊與財金管理系,2022。
    [5] 林順德。「應用文字探勘與機器學習偵測網路設備類型特徵」。碩士論文,國立高雄大學資訊管理學系碩士班,2022。
    [6] 洪翊玲。「運用文本探勘技術之基於風格的假新聞分類」。碩士論文,國立高雄大學資訊管理學系碩士班,2022。
    [7] 郝志揚。「使用文字探勘實作新聞事件追蹤」。碩士論文,淡江大學資訊工程學系碩士班,2017。
    [8] 陳昶旻。「基於特徵權重做文本分類」。碩士論文,中原大學資訊工程研究所,2013。
    [9] 張哲章。「應用Python網路爬蟲技術於政府開放資料平台PM2.5即時動態資料分析」。碩士論文,義守大學資訊管理學系,2018。
    [10] 黃朝曦, 王敬翔, 黃御軒, 賴冠豪 "利用網路爬蟲技術分析數位學習課程討論內容分析研究探討-以磨課師課程物聯網系統簡介為例." NCS 2019 全國計算機會議. 國立金門大學, 2019。
    [11] 楊德倫, "文字探勘之前處理與TF-IDF介紹" [Online]. Available: https://www.cc.ntu.edu.tw/chinese/epaper/0031/20141220_3103.html. 2023/11/5檢索。
    [12] 廖子涵。「以文字探勘方法探討臺灣大學校務建言與回覆關聯性之研究」。碩士論文,國立臺灣大學生物資源暨農學院藝所生物統計學組,2016。
    [13] 盧映孜, and 梁慈芳. "烏克蘭戒嚴情勢一篇掌握 俄羅斯開戰彈襲登陸並進" [Online]. Available: https://www.cna.com.tw/news/aopl/202202243002.aspx 2023/1/08檢索。
    [14] 劉賢鈞。「運用資料探勘方法預測假新聞」。碩士論文,國立交通大學管理學院資訊管理學程,2019。
    [15] 謝志淵. "2022 年俄烏戰爭源起, 戰略與對我國之啟示." 國防雜誌 37.3. 1-32, 2022。
    [16] 羅世宏. "關於[假新聞]的批判思考:老問題,新挑戰與可能的多重解方." 資訊社會研究 35. 51-86. 2018。
    [17] 鍾明諺。「T-­BERT:臺灣語言模型–以臺灣在地語言預訓練BERT模型」。碩士論文,國立臺灣大學資訊工程學研究所,2020。
    [18] 龔芸青。「應用集成學習方法結合文本情感分析於傷害性新聞辨識之研究」。碩士論文,中原大學資訊管理研究所,2020。
    [19] 鴿屋, "俄烏戰爭全球同步湧現過千假消息 專訪台灣事實查核中心" [Online]. Available: https://commonshk.com/2022/03/09/俄烏戰爭全球同步湧現過千假消息 專訪台灣事實查核中心. 2022/11/15檢索。
    [20] Cofacts真的假的, "Cofacts真的假的" [Online]. Available: https://cofacts.org 2022/11/05檢索。
    [21] Ckiplab "ckiptagger" [Online]. Available: https://github.com/ckiplab/ckiptagger. 2022/11/09檢索。
    [22] MyGoPen, "MyGoPen" [Online]. Available: https://www.mygopen.com/ 2022/11/05檢索。
    [23] TFC, "台灣事實查核中心" [Online]. Available: https://tfc-taiwan.org.tw/ 2022/11/05檢索。
    [24] TWNIC, "財團法人台灣網路資訊中心公布2022年《台灣網路報告》," [Online]. Available: https://blog.twnic.tw/2022/07/21/23809/ 2022/11/22檢索。
    [25] UDN, "聯合新聞網" [Online]. Available: https://udn.com/news/index/ 2022/11/05檢索。
    [26] Aphiwongsophon, S., and Chongstitvatana, P., "Detecting fake news with machine learning method," 2018 15th international conference on electrical engineering/electronics, computer, telecommunications and information technology (ECTI-CON), IEEE, 2018.
    [27] Allcott, H., and Gentzkow, M., "Social media and fake news in the 2016 election," Journal of economic perspectives, 31.2, 211-36, 2017.
    [28] Breiman, L., "Random forests," Machine learning, 45.1, 5-32, 2001.
    [29] Barchielli, B., et al, "Climate changes, natural resources depletion, COVID-19 pandemic, and Russian-Ukrainian war: What is the impact on habits change and mental health?," International Journal of Environmental Research and Public Health 19.19 11929, 2022.
    [30] Chen, T., and Guestrin, C., "Xgboost: A scalable tree boosting system," Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016.
    [31] Cui, Y., Che, W., Liu, T., Qin, B., Yang, Z., "Pre-training with whole word masking for chinese bert," IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 3504-3514, 2021.
    [32] Chen, Y. H., Tsai, Y. H., Chen, Y. T., "Chinese readability assessment using TF-IDF and SVM,"2011 International Conference on Machine Learning and Cybernetis, Vol. 2, IEEE, 2011.
    [33] Devlin, J., Chang, M. W., Lee, K., Toutanova, K., "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
    [34] Fayyad, U. M., Gregory P. S., Smyth, P., "Knowledge Discovery and Data Mining: Towards a Unifying Framework," KDD, Vol. 96, 1996.
    [35] Gupta, P., and Johari, K., "Implementation of web crawler," 2009 Second International Conference on Emerging Trends in Engineering & Technology, IEEE, 2009.
    [36] Janiesch, C., Zschech, P., Heinrich, K., "Machine learning and deep learning," Electronic Markets 31.3, 685-695, 2021.
    [37] Khanam, Z., Alwasel, B. N., Sirafi, H., Rashid, M., "Fake news detection using machine learning approaches," IOP Conference Series: Materials Science and Engineering, Vol. 1099, No. 1, IOP Publishing, 2021.
    [38] Kowsari, K., et al. "Text classification algorithms: A survey." Information 10.4, 150, 2019.
    [39] Kausar, M., Dhaka, V. S., Singh, S. K., "Web crawler: a review," International Journal of Computer Applications, 63.2, 2013.
    [40] Lazer, D. J., et al, "The science of fake news," Science, 359.6380, 1094-1096, 2018.
    [41] Liu, Z., Lv, X., Liu, K., Shi, S., "Study on SVM compared with the other text classification methods," 2010 Second international workshop on education technology and computer science, Vol. 1, IEEE, 2010.
    [42] LeCun, Y., Bengio, Y., Hinton, G., "Deep learning." Nature, 521.7553, 436-444, 2015.
    [43] Probst, P., Marvin N. W., Boulesteix, A. L., "Hyperparameters and tuning strategies for random forest," Wiley Interdisciplinary Reviews: data mining and knowledge discovery 9.3, e1301, 2019.
    [44] Qaiser, S., and Ali, R., "Text mining: use of TF-IDF to examine the relevance of words to documents," International Journal of Computer Applications 181.1, 25-29, 2018.
    [45] Sullivan, D., "Document warehousing and text mining: Techniques for improving business operations, marketing, and sales," New York, NY: John Wiley & Sons Inc, 2001.
    [46] UkraineFacts, "By the International Fact-checking Network Signatories," [Online]. Available: https://ukrainefacts.org/ 2022/12/15檢索
    [47] Visa, S., Ramsay, B., Ralescu, A., Knaap, E., "Confusion matrix-based feature selection," MAICS 710.1, 120-127, 2011.
    [48] Wardle, C., "Fake news.It’s complicated," [Online]. Available: https://firstdraftnews.org/articles/fake-news-complicated/ 2022/1/2檢索
    [49] Zhuang, F., et al, "A comprehensive survey on transfer learning," Proceedings of the IEEE 109.1, 43-76, 2020.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE