| 研究生: |
林冠名 Lin, Guan-Ming |
|---|---|
| 論文名稱: |
運用LDA建置新冠肺炎假訊息檢測模型 Using Latent Dirichlet Allocation to Construct a COVID-19 Fake News Detection Model |
| 指導教授: |
陳牧言
Chen, Mu-Yen |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 工程科學系碩士在職專班 Department of Engineering Science (on the job class) |
| 論文出版年: | 2022 |
| 畢業學年度: | 111 |
| 語文別: | 中文 |
| 論文頁數: | 49 |
| 中文關鍵詞: | 隱含狄利克雷分配 、詞頻-逆向文件頻率 、CKIP 、T-BERT 、假訊息 |
| 外文關鍵詞: | Latent Dirichlet Allocation, TF-IDF, CKIP, T-BERT, Fake News |
| 相關次數: | 點閱:213 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在現今資訊爆炸的時代,透過社交平台傳遞資訊與知識,往往會有民眾誤信錯誤訊息或惡意製造的假訊息,並且將其資訊互相傳遞,造成社會不同程度的影響,而這個傳遞過程被稱為「資訊流行病(Infodemic)」,「資訊流行病」一詞,最早出現於2003年SARS疫情期間,當時大量傳播著不實資訊、謠言,不僅讓公衛危機難以控制,同時透過各種傳播管道,快速大量於全球流竄,造成國安、經濟、政治等受到影響,因此問題必須從源頭進行遏止。
本研究使用主題模型中的隱含狄利克雷分配(Latent Dirichlet Allocation, LDA)方法,結合TF與TF-IDF進行實驗比較,而LDA結合TF-IDF特徵於SVM、隨機森林兩種模型皆能提升F1-score分數,其中SVM模型提升效果最顯著,經十折交叉驗證結果,平均F1-score成長幅度提升1.13%,Accuracy為98.04%,Accuracy的標準差為0.95%,F1-score達到98.10%。而使用基於BERT預訓練模型之T-BERT,Accuracy是所有模型當中最高,達到98.36%。
In the present information-saturated era, the transmission of information and knowledge through social media platforms often results in the public believing in incorrect or maliciously fabricated fake news, which is then disseminated, leading to various degrees of impact on society. This process is referred to as an "Infodemic." The term "Infodemic" first appeared during the 2003 SARS epidemic, when a large amount of false information and rumors were disseminated, not only making it difficult to control public health crises, but also rapidly spreading globally through various channels, resulting in impacts on national security, economy, politics, etc. Therefore, addressing this issue at its source is necessary.
In this study, we conducted an experimental comparison using Latent Dirichlet Allocation (LDA) in combination with the TF and TF-IDF in the topic model. Our results showed that LDA combined with TF-IDF features in both SVM and Random Forest models improved the F1-score, with the SVM model showing the most significant improvement. Through ten-fold cross-validation, the average F1-score growth rate increased by 1.13%, with an Accuracy of 98.04% and a standard deviation of 0.95% for Accuracy, and an F1-score of 98.10%. Additionally, we found that using the T-BERT, which is based on the BERT pre-trained model, resulted in the highest Accuracy among all models, at 98.36%.
Junyi, S. (2013). 結巴中文分詞. https://github.com/fxsjy/jieba
王奕雅. (2020). 社群媒體假訊息的管制模式-以美國法為比較基礎 國立臺灣大學]. http://dx.doi.org/10.6342/NTU202004463
林佩君. (2020). 深度學習應用於部落格文章分類. In 工程科學及海洋工程學研究所 (Vol. 碩士, pp. 43). 台北市: 國立臺灣大學.
洪翊玲. (2022). 運用文本探勘技術之基於風格的假新聞分類. In 資訊管理學系碩士班 (Vol. 碩士, pp. 57). 高雄市: 國立高雄大學.
孫維三. (2022). 病毒、資訊與社會:以Luhmann的社會系統理論觀察訊息疫情 [Virus, Information, and Society: Observation of Infodemic Using Luhmann's Social Systems Theory]. 新聞學研究(152), 161-208. https://doi.org/10.30386/MCR.202207.0014
張日威. (2014). 應用LDA進行Plurk主題分類及使用者情緒分析. In 資訊管理系 (Vol. 碩士, pp. 65). 雲林縣: 國立雲林科技大學.
張凱程. (2019). 使用序列到序列與長短期記憶模型進行中文分詞. In 資訊工程學系 (Vol. 碩士, pp. 46). 桃園縣: 長庚大學.
張晶郁. (2021). 運用隱含狄利克雷分布於 Google 排名優化-以程式設計網站為例. In 資訊管理系碩士班 (Vol. 碩士, pp. 75). 台中市: 國立臺中科技大學.
黃兆璽, 謝宗順, 翁福元, & 胡世澤. (2021). 假新聞與媒體識讀認知理論之分析探討 [Discussion on the Promotion of News Media Literacy and Evaluation to Prevent Fake News]. 台灣教育研究期刊, 2(6), 283-298. https://www.AiritiLibrary.com/Publication/Index/P20220316001-202111-202203230013-202203230013-283-298
葉乃靜. (2020). 由新冠病毒(COVID-19)防疫機制談假新聞防制 [Prevention of fakes news can learn from the COVID-19's control experience in Taiwan]. 臺北市立圖書館館訊, 35(3), 90-113. https://www.AiritiLibrary.com/Publication/Index/10112081-202006-202009030016-202009030016-90-113
歐昱傑, 楊淑晴, 宋庭瑋, & 羅藝方. (2022). 新冠肺炎謠言內容分析之探究 [Content analysis of COVID-19 rumors]. 台灣公共衛生雜誌, 41(1), 51-68. https://doi.org/10.6288/TJPH.202202_41(1).110112
蔡文鴻. (2022). 基於文本分析和潛在狄利克雷分配的智慧型財務詐欺檢測模型. In 資訊管理系 (Vol. 碩士, pp. 39). 雲林縣: 國立雲林科技大學.
鄭元皓, 顧以謙, & 吳永達. (2020). 殭屍入侵臺灣-探討臉書假帳號與假訊息之現況與未來 [When Zombies Invade Taiwan - Current Status and Future Challenges of Facebook Fake Accounts and Disinformation]. 刑事政策與犯罪防治研究專刊(26), 65-123. https://doi.org/10.6460/CPCP.202012_(26).02
鍾明諺. (2020). T-BERT:臺灣語言模型–以臺灣在地語言預訓練BERT模型. In 資訊工程學研究所 (Vol. 碩士, pp. 29). 台北市: 國立臺灣大學.
魏然, 郭靖, 王賽, & 黃懿慧. (2022). 新冠肺炎虛假資訊接觸對認知和態度的負面影響: 探究數字媒體資訊近用性的形塑作用. 傳播與社會學刊(62), 207-264. https://doi.org/10.30180/CS.202210_62.0008
Akram-Ali-Hammouri, Z., Fernández-Delgado, M., Cernadas, E., & Barro, S. (2022). Fast Support Vector Classification for Large-Scale Problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10), 6184-6195. https://doi.org/10.1109/TPAMI.2021.3085969
Ayoub, J., Yang, X. J., & Zhou, F. (2021). Combat COVID-19 infodemic using explainable natural language processing models. Inf Process Manag, 58(4), 102569. https://doi.org/10.1016/j.ipm.2021.102569
Bin Naeem, S., & Kamel Boulos, M. N. (2021). COVID-19 Misinformation Online and Health Literacy: A Brief Overview. International Journal of Environmental Research and Public Health, 18(15). https://www.mdpi.com/1660-4601/18/15/8091
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123-140. https://doi.org/10.1007/BF00058655
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
Carrion-Alvarez, D., & Tijerina-Salina, P. X. (2020). Fake news in COVID-19: A perspective. Health Promot Perspect, 10(4), 290-291. https://doi.org/10.34172/hpp.2020.44
Chauhan, U. a. S., Apurva. (2021). Topic Modeling Using Latent Dirichlet allocation: A Survey. ACM Comput. Surv., 54(September 2022), 35, Article 145. https://doi.org/10.1145/3462478
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357. https://doi.org/10.1613/jair.953
Chen, Q., Leaman, R., Allot, A., Luo, L., Wei, C. H., Yan, S., & Lu, Z. (2021). Artificial Intelligence in Action: Addressing the COVID-19 Pandemic with Natural Language Processing. Annu Rev Biomed Data Sci, 4, 313-339. https://doi.org/10.1146/annurev-biodatasci-021821-061045
Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining,
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. https://doi.org/10.1007/BF00994018
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. https://doi.org/arXiv:1810.04805
Freund, Y., & Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1), 119-139. https://doi.org/https://doi.org/10.1006/jcss.1997.1504
Li, P.-H., Fu, T.-J., & Ma, W.-Y. (2020). Why attention? Analyze BiLSTM deficiency and its remedies in the case of NER. Proceedings of the AAAI Conference on Artificial Intelligence,
Li, P.-H., & Ma, W.-Y. (2019). CKIP Neural Chinese Word Segmentation, POS Tagging, and NER. https://github.com/ckiplab/ckiptagger
Maharana, K., Mondal, S., & Nemade, B. (2022). A review: Data pre-processing and data augmentation techniques. Global Transitions Proceedings, 3(1), 91-99. https://doi.org/https://doi.org/10.1016/j.gltp.2022.04.020
Mheidly, N., & Fares, J. (2020). Leveraging media and health communication strategies to overcome the COVID-19 infodemic. Journal of Public Health Policy, 41(4), 410-420. https://doi.org/10.1057/s41271-020-00247-w
Nugroho, A., Fanani, A. Z., & Shidik, G. F. (2021, 18-19 Sept. 2021). Evaluation of Feature Selection Using Wrapper For Numeric Dataset With Random Forest Algorithm. 2021 International Seminar on Application for Technology of Information and Communication (iSemantic),
Peng, J., & Han, K. (2021, 10-12 Dec. 2021). Survey of Pre-trained Models for Natural Language Processing. 2021 International Conference on Electronic Communications, Internet of Things and Big Data (ICEIB),
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. https://doi.org/10.48550/ARXIV.1910.01108
Shivahare, B. D., Ranjan, S., Rao, A. M., Balaji, J., Dattattrey, D., & Arham, M. (2022, 27-29 April 2022). Survey Paper: Study of Sentiment Analysis and Machine Translation using Natural Language Processing and its Applications. 2022 3rd International Conference on Intelligent Engineering and Management (ICIEM),
Tasnim, S., Hossain, M. M., & Mazumder, H. (2020). Impact of Rumors and Misinformation on COVID-19 in Social Media. J Prev Med Public Health, 53(3), 171-174. https://doi.org/10.3961/jpmph.20.094
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. https://doi.org/10.48550/ARXIV.1706.03762
Venegas-Vera, A. V., Colbert, G. B., & Lerma, E. V. (2020). Positive and negative impact of social media in the COVID-19 era. Rev Cardiovasc Med, 21(4), 561-564. https://doi.org/10.31083/j.rcm.2020.04.195
WHO. (2020). Attacks on health care in the context of COVID-19. https://www.who.int/news-room/feature-stories/detail/attacks-on-health-care-in-the-context-of-covid-19
Yen, S.-J., & Lee, Y.-S. (2009). Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36(3, Part 1), 5718-5727. https://doi.org/https://doi.org/10.1016/j.eswa.2008.06.108
校內:2028-01-11公開