| 研究生: |
張瓊云 Zhang, Qiong-Yun |
|---|---|
| 論文名稱: |
整合TF-IDF和BERT特徵的圖神經網路方法用於處理長文本的單標籤分類 Integrating TF-IDF and BERT features in Graph Neural Networks for Long Text Single-Label Classification |
| 指導教授: |
李韶曼
Lee, Shao-Man |
| 學位類別: |
碩士 Master |
| 系所名稱: |
敏求智慧運算學院 - 智慧科技系統碩士學位學程 MS Degree Program on Intelligent Technology Systems |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 英文 |
| 論文頁數: | 65 |
| 中文關鍵詞: | 圖神經網路 、長文本分類 、單標籤 、BERT 、TF-IDF |
| 外文關鍵詞: | BERT, graph neural network, long text classification, TF-IDF, single-label |
| 相關次數: | 點閱:140 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在文本分類任務中取得好的特徵至關重要。然而由於長文本的篇幅和文章的複雜性使得提取準確的特徵變得更加困難。對於傳統的詞頻-逆文檔頻率(TF-IDF)方法由於忽略了語義關係和上下文資訊,使得在面臨長文本時無法準確的提取特徵。另一方面,BERT雖然能夠考慮整個文本序列來捕捉詞上下文的語意關係,但在處理長文本分類時仍受到輸入大小的限制。
本研究探討了整合TF-IDF和BERT的特徵,並利用各自的優勢來處理長文本單標籤分類的挑戰。圖神經網絡(GNN)能夠捕捉資料間的關係且在多種應用中表現出色,因此引入了GNN以充分發揮TF-IDF和BERT的優勢。在方法中,文本表示轉換為圖結構,其中每篇文章被視為圖中的一個節點,分析如何在節點屬性和邊權重之間權衡BERT和TF-IDF方法的優點,並使用GNN進行訓練,同時利用了文本分割方法處理BERT的輸入大小限制。在實驗中,探討了文本分割對於長文本分類的影響,並與與其他模型進行比較。此外,為了了解所提出的方法在處理複雜的分類任務能力,實驗中也利用長文本多標籤資料集來評估提出的方法。同時也在文章長度較短的資料集上測試了該方法的穩健性。實驗結果顯示,文本分割能夠克服BERT的輸入大小限制,且提出的方法在長文本分類中獲得了最佳的表現。本研究發現在圖結構中選擇合適的節點屬性和邊權重的重要性,此外也在處理單標籤長篇文章分類上提出了一個有效的方法,使模型能夠更廣泛地涵蓋文章內的資訊。
For text classification, extracting features is crucial. However, long texts are challenging due to their length and complexity, making feature extraction more difficult. Traditional approaches like Term Frequency-Inverse Document Frequency (TF-IDF) overlook semantic relationships and contextual information, resulting in significant drawbacks when applied to long texts. BERT can capture contextual representations of words by considering the entire text. However, BERT still faces input size limitations when handling long text classification.
This study investigates the integration of TF-IDF and BERT to address the challenges of long-text single-label classification, leveraging the advantages of each method. Graph Neural Networks (GNNs) can capture documents’ relationships and have demonstrated excellent performance across various applications. Therefore, GNN is introduced to enhance the capabilities of TF-IDF and BERT. This approach transforms textual representations into graph structures, where each article is represented as a node. The study explores the optimal balance between the benefits of BERT and TF-IDF, especially in terms of node attributes and edge weights. To accommodate BERT's input size restrictions, text segmentation methods are applied. The GNN is then trained on this graph structure.
The experimental analysis delves into the impact of text segmentation on long-text classification and compares the proposed approach with other models in long-text single-label classification. Moreover, to understand the capability of the proposed method in handling complex classification, the experiment also evaluates it using long-text multi-label datasets. Simultaneously, the method's robustness is tested on datasets with shorter text lengths. This study demonstrates that text segmentation can overcome the input size limitations of BERT and the proposed method outperforming other models in long-text single-label classification. It also emphasizes the importance of choosing the appropriate node attributes and edge weights within the graph structure. Overall, this research offers a solution to long-text single-label classification task.
[1] Akiko Aizawa. 2003. An information-theoretic perspective of TF–IDF measures. Information Processing and Management:21.
[2] Alec Radford and Karthik Narasimhan. 2018. Improving Language Understanding by Generative Pre-Training.
[3] Alex Fout, Jonathon Byrd, Basir Shariat, and Asa Ben-Hur. 2017. Protein Interface Prediction using Graph Convolutional Networks. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
[4] Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A Neural Attention Model for Abstractive Sentence Summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal. Association for Computational Linguistics.
[5] Alvaro Sanchez-Gonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin Riedmiller, Raia Hadsell, and Peter Battaglia. 2018. Graph Networks as Learnable Physics Engines for Inference and Control. In Proceedings of the 35th International Conference on Machine Learning, pages 4470–4479. PMLR.
[6] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
[7] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
[8] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to Fine-Tune BERT for Text Classification? In Maosong Sun, Xuanjing Huang, Heng Ji, Zhiyuan Liu, and Yang Liu, editors, Chinese Computational Linguistics, pages 194–206, Cham. Springer International Publishing.
[9] David Bamman and Noah A. Smith. 2013. New alignment methods for discriminative book summarization. arXiv:1305.1319.
[10] David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. J. Mach. Learn. Res., 5:361–397. Alistair EW Johnson, David J. Stone, Leo A. Celi, and Tom J. Pollard. 2017. MIMIC-III, a freely accessible critical care database. Nature
[11] Ding Weijie, Li Yunyi, Zhang Jing, and Shen Xuchen. 2021. Long Text Classification Based on BERT. In 2021 IEEE 5th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), volume 5, pages 1147–1151.
[12] Eneldo Loza Mencía and Johannes Fürnkranz. 2010. Efficient Multilabel Classification Algorithms for Large-Scale Problems in the Legal Domain. In Enrico Francesconi, Simonetta Montemagni, Wim Peters, and Daniela Tiscornia, editors, Semantic Processing of Legal Texts: Where the Language of Law Meets the Law of Language, Lecture Notes in Computer Science, pages 192–215. Springer, Berlin, Heidelberg.
[13] Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513–523.
[14] Gongbo Tang, Mathias Müller, Annette Rios, and Rico Sennrich. 2018. Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4263–4272, Brussels, Belgium. Association for Computational Linguistics.
[15] H. Sak, A. Senior and F. Beaufays, 2014. Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling, Proc. Interspeech,.
[16] Ilias Chalkidis, Emmanouil Fergadiotis, Prodromos Malakasiotis, and Ion Androutsopoulos. 2019. Large-scale multi-label text classification on EU legislation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6314–6322, Florence, Italy. Association for Computational Linguistics.
[17] Ioannis Partalas, Aris Kosmopoulos, Nicolas Baskiotis, Thierry Artieres, George Paliouras, Eric Gaussier, Ion Androutsopoulos, Massih-Reza Amini, and Patrick Galinari. 2015. LSHTC: A Benchmark for Large-Scale Text Classification. arXiv:1503.08581 [cs].
[18] Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document Transformer. arXiv:2004.05150.
[19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional Transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota—Association for Computational Linguistics.
[20] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks: A review of methods and applications. AI Open, 1:57–81.
[21] Jingang Chen and Shu Lv. 2022. Long Text Truncation Algorithm Based on Label Embedding in Text Classification. Applied Sciences, 12(19):9874.
[22] Johannes Kiesel, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. 2019. SemEval-2019 Task 4: Hyperpartisan News Detection. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 829–839, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
[23] John Boaz Lee, Ryan A. Rossi, Sungchul Kim, Nesreen K. Ahmed, and Eunyee Koh. 2019. Attention Models in Graphs: A Survey. ACM Transactions on Knowledge Discovery from Data, 13(6):62:1-62:25.
[24] Ken Lang. 1995. Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth International Conference on Machine Learning, pages 331–339.
[25] Krzysztof Fiok, Waldemar Karwowski, Edgar Gutierrez-Franco, Mohammad Reza Davahli, Maciej Wilamowski, Tareq Ahram, Awad Al-Juaid, and Jozef Zurada. 2021. Text Guide: Improving the Quality of Long Text Classification by a Text Selection Method Based on Feature Importance. IEEE Access, 9:105439–105450.
[26] WANG, Kunze; HAN, Soyeon Caren; POON, Josiah. Induct-gcn: Inductive graph convolutional networks for text classification. In: 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 2022. p. 1243-1249.
[27] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big Bird: Transformers for Longer Sequences. In Advances in Neural Information Processing Systems, volume 33, pages 17283–17297. Curran Associates, Inc.
[28] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
[29] Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. 2017. Geometric Deep Learning: Going beyond Euclidean data. IEEE Signal Processing Magazine, 34(4):18–42.
[30] Nazli Goharian, Ankit Jain, and Qian Sun. 2003. Comparative Analysis of Sparse Matrix Algorithms For Information Retrieval. Journal of Systemics, Cybernetics and Informatics, 1.
[31] Park, Hyunji, Yogarshi Vyas, and Kashif Shah. 2022. “Efficient Classification of Long Documents Using Transformers.” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 702–709. Dublin, Ireland: Association for Computational Linguistics.
[32] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. arXiv:1710.10903 [cs, stat].
[33] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Caglar Gulcehre, Francis Song, Andrew Ballard, Justin Gilmer, George Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Langston, et al. 2018. Relational inductive biases, deep learning, and graph networks. arXiv:1806.01261 [cs, stat].
[34] Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba, Yishay Carmiel, and Najim Dehak. 2019. Hierarchical Transformers for long document classification. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 838–844.
[35] Shu Zhang, Dequan Zheng, Xinchen Hu, and Ming Yang. 2015. Bidirectional Long Short-Term Memory Networks for Relation Classification. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, pages 73–78, Shanghai, China.
[36] Soyeon Caren Han, Zihan Yuan, Kunze Wang, Siqu Long, and Josiah Poon. 2022. Understanding Graph Convolutional Networks for Text Classification. arXiv:2203.16060 [cs].
[37] Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion | The Journal of Machine Learning Research.
[38] Takuo Hamaguchi, Hidekazu Oiwa, Masashi Shimbo, and Yuji Matsumoto. 2017. Knowledge Transfer for Out-of-Knowledge-Base Entities: A Graph Neural Network Approach. :1802–1808.
[39] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, pages 3111–3119, Red Hook, NY, USA. Curran Associates Inc.
[40] Vedangi Wagh, Snehal Khandve, Isha Joshi, Apurva Wani, Geetanjali Kale, and Raviraj Joshi. 2021. Comparative Study of Long Document Classification. In TENCON 2021 - 2021 IEEE Region 10 Conference (TENCON), pages 732–737.
[41] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
[42] Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Graph convolutional networks for text classification: 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Annual Conference on Innovative Applications of Artificial Intelligence, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019. 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019:7370–7377.
[43] Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. 2020. Long Range Arena: A Benchmark for Efficient Transformers
[44] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs].
[45] Yongji Wu, Defu Lian, Yiheng Xu, Le Wu, and Enhong Chen. 2020. Graph Convolutional Networks with Markov Random Field Reasoning for Social Spammer Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 34(01):1054–1061.
[46] KIM, Yoon. Convolutional Neural Networks for Sentence Classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2014.
[47] Yoshua Bengio, Holger Schwenk, Jean-Sébastien Senécal, Fréderic Morin, and Jean-Luc Gauvain. 2006. Neural Probabilistic Language Models. In Dawn E. Holmes and Lakhmi C. Jain, editors, Innovations in Machine Learning, volume 194 of Studies in Fuzziness and Soft Computing, pages 137–186. Springer-Verlag, Berlin/Heidelberg.
[48] Yu-Fei Ma, Lie Lu, Hong-Jiang Zhang, and Mingjing Li. 2002. A user attention model for video summarization. In Proceedings of the tenth ACM international conference on Multimedia, pages 533–542, New York, NY, USA. Association for Computing Machinery.
[49] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter Battaglia. 2018b. Learning deep generative model
[50] Yuxiao Lin, Yuxian Meng, Xiaofei Sun, Qinghong Han, Kun Kuang, Jiwei Li, and Fei Wu. 2021. BertGCN: Transductive Text Classification by Combining GNN and BERT. In the Association for Computational Linguistics Findings: ACL-IJCNLP 2021, pages 1456–1462, Online. Association for Computational Linguistics.
[51] Zein Shaheen, Gerhard Wohlgenannt, and Erwin Filtz. 2020. Large Scale Legal Text Classification Using Transformer Models. arXiv:2010.12871 [cs].
Zhang Yun-tao, Gong Ling, and Wang Yong-cheng. 2005. An improved TF-IDF approach for text classification. Journal of Zhejiang University-SCIENCE A, 6(1):49–55
[52] Kaize Ding, Jianling Wang, Jundong Li, Dingcheng Li, and Huan Liu. 2020. Be More with Less: Hypergraph Attention Networks for Inductive Text Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4927–4936, Online. Association for Computational Linguistics.
[53] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, pages 120 142–150.
[54] Ken Lang, “Newsweeder: Learning to filter netnews,” in Proceedings of the 12th international conference on machine learning, 1995, pp. 331– 339
[55] W. Hersh, C. Buckley, T. J. Leone, and D. Hickam, “OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research,” in SIGIR ’94, B. W. Croft and C. J. van Rijsbergen, Eds., London: Springer, 1994, pp. 192–201.