| 研究生: |
唐郁秀 Tang, Yu-Siou |
|---|---|
| 論文名稱: |
KE-BERT:預訓練知識增強詞表示模型於語言理解 KE-BERT: Pre-training of Knowledge-Enhanced Word Representation For Language Understanding |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 英文 |
| 論文頁數: | 87 |
| 中文關鍵詞: | 語言模型 、詞表示 、文本語料庫 、知識工程 、深度學習 、自然語言處理 |
| 外文關鍵詞: | Word representation, language model, text corpus, knowledge engineering, natural language understanding |
| 相關次數: | 點閱:140 下載:19 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
詞表示(Word Representations))廣泛應用於自然語言處理(Natural Language Processing, NLP)任務。上下文詞表示(Contextual Word Representations, CWRs)是通過關注文本的上下文語境而取得的詞表示。上下文詞表示已被應用於各種自然語言處理任務中,且此詞表示為當前最先進的自然語言技術中的一部份。上下文詞表示是以非監督式方法在大量文本上訓練而得。許多研究揭露,此訓練方法使模型記下大量的語法與句子關係,但無法真正理解人類知識或高級語義意涵。為了使上下文詞表示含有高級語義信息,本論文提出了一個新的詞表示訓練框架,此框架命名為知識增強詞表示模型(Knowledge-Enhanced BERT, KE-BERT)。本論文引入事實知識(Factual Knowledge)作為增強上文下詞表示之語義的方法,此事實知識是通過對句子中實體(Entity)與關係(Relation)進行監督式學習來引入的。知識增強詞表示模型訓練而得的詞表示為知識增強上文文詞表示(Knowledge-Enhanced Contextual Word Representations),此詞表示有效地存有事實知識信息於詞表示中。實驗結果表明,本文的方法改進了傳統的上下文詞表示,並在許多自然語言理解任務和知識相關任務上取得了良好的效果。在CoNLL 2003英文數據集上,本文方法取得F1 92.8的結果,相比RoBERTa的結果提升了0.4;在TACRED數據集上,本文方法取得F1 71.8的結果,相比RoBERTa的結果提升了0.5;在GLUE基準測試上,本文方法達到了90.0的平均分數,與RoBERTa的平均分數相比提高了0.9。此外,實驗結果表明,與過去增強知識信息於詞表示的方法(89.6)相比,本文方法(90.0)在許多自然語言理解任務上,取得了平均更好的效果。然而,過去增強知識信息於詞表示的方法具有儲存更多知識信息的優勢,使其在知識相關的任務中取得好的結果,其分別在CoNLL 2003英文數據集和TACRED數據集取得F1 94.3和72.7的結果。
為了透由自訓練和監督學習訓練一個知識增強詞表示模型,本論文創建了一個用於關係提取和實體識別的大型語料庫(A Large Corpus for Relation Extraction and Entity Recognition, CREER)。為了獲得純文本的多個語義標註,作者以維基百科數據集作為純文本來源,作者使用史丹佛大學的開源語義標註工具對文本進行標註。該語料庫可以作為語義相關任務的訓練集或基準,並將有利於未來的自然語言系統開發。
Distributional word representations are widely applicable in natural language processing systems, and Contextual Word Representations (CWRs) are word representations obtained by paying attention to the context of the text. Recently, contextual word representations have become an indispensable part of state-of-the-art natural language techniques. Many studies have shown that methods for learning contextual word representations enable models to memorize a large number of grammatical and sentence relations, but not really understand human knowledge or high-level semantics. To make the contextual word representation contain high-level semantic information, this thesis proposes a new word representation training framework, KE-BERT, and introduces supervised learning factual knowledge into contextual word representations as a means to enhance the semantics of representations. Experimental results show that my method improves on conventional contextual word representations and achieves good results on many natural language understanding tasks and knowledge-related tasks. On the CoNLL 2003 English dataset, my method achieved F1 92.8, which is 0.4 higher than that of RoBERTa; on the TACRED dataset, my method achieved F1 71.8, which is 0.5 higher than that of RoBERTa; on the GLUE benchmark, my method achieved an average score of 90.0, an improvement of 0.9 compared to RoBERTa's average score. In addition, experimental results show that my method (90.0) achieved averagely better results on many natural language understanding tasks than previous works (89.6) that injected knowledge information into word representations. However, previous works that introduced knowledge-relevant parameters and mechanisms have the advantage of having more knowledge information, enabling them to achieve good results in knowledge-related tasks.
In order to train a knowledge-enhanced word representation model through self-training and supervised learning, the author created a CREER corpus (A Large Corpus for Relation Extraction and Entity Recognition). In order to obtain multiple semantic annotations of plain text, the author used the Wikipedia dataset as the source of the plain text, and the author used Stanford CoreNLP Annotator to annotate the text. In the future, the CREER corpus can serve as a training set or benchmark for semantically related tasks and will benefit future natural language system development.
[1] Y.-S. Tang and C.-H. Wu, "Latent Attribute Control for Story Generation," presented at the 2021 International Conference on Asian Language Processing (IALP), 2021.
[2] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," Advances in neural information processing systems, vol. 26, 2013.
[3] J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532-1543.
[4] A. Miaschi and F. Dell’Orletta, "Contextual and non-contextual word embeddings: an in-depth linguistic investigation," in Proceedings of the 5th Workshop on Representation Learning for NLP, 2020, pp. 110-119.
[5] Y. Kim, "Convolutional Neural Networks for Sentence Classification," Association for Computational Linguistics, vol. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
[6] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, "Neural architectures for named entity recognition," in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 260–270.
[7] K. Cho et al., "Learning phrase representations using RNN encoder-decoder for statistical machine translation," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734.
[8] N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, and N. A. Smith, "Linguistic knowledge and transferability of contextual representations," presented at the Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
[9] J. Hewitt and P. Liang, "Designing and interpreting probes with control tasks," in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 2733–2743.
[10] J. Hewitt and C. D. Manning, "A structural probe for finding syntax in word representations," in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4129-4138.
[11] I. Tenney et al., "What do you learn from context? Probing for sentence structure in contextualized word representations," in Proceedings of the 2019 International Conference on Learning Representations (ICLR), 2019.
[12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," presented at the Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2018.
[13] M. N. Matthew E. Peters, Luke Zettlemoyer, Wen-tau Yih, "Dissecting Contextual Word Embeddings: Architecture and Representation," Association for Computational Linguistics, vol. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237, 2018, doi: 10.18653/v1/N18-1202.
[14] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, "Language models are unsupervised multitask learners," OpenAI blog, vol. 1, no. 8, p. 9, 2019.
[15] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, "Xlnet: Generalized autoregressive pretraining for language understanding," Advances in neural information processing systems, vol. 32, 2019.
[16] F. Petroni et al., "Language models as knowledge bases?," in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 2463–2473.
[17] R. L. Logan IV, N. F. Liu, M. E. Peters, M. Gardner, and S. Singh, "Barack's wife hillary: Using knowledge-graphs for fact-aware language modeling," Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5962-5971, 2019.
[18] N. Kassner and H. Schütze, "Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly," Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7811–7818, 2019.
[19] J. Davison, J. Feldman, and A. M. Rush, "Commonsense knowledge mining from pretrained models," in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 1173-1178.
[20] Z. Bouraoui, J. Camacho-Collados, and S. Schockaert, "Inducing relational knowledge from BERT," in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, vol. 34, no. 05, pp. 7456-7463.
[21] A. Roberts, C. Raffel, and N. Shazeer, "How much knowledge can you pack into the parameters of a language model?," Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5418-5426, 2020.
[22] I. Yamada, A. Asai, H. Shindo, H. Takeda, and Y. Matsumoto, "Luke: deep contextualized entity representations with entity-aware self-attention," Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6442-6454, 2020.
[23] X. Wang et al., "KEPLER: A unified model for knowledge embedding and pre-trained language representation," Transactions of the Association for Computational Linguistics, vol. 9, pp. 176-194, 2021.
[24] R. Wang et al., "K-adapter: Infusing knowledge into pre-trained models with adapters," vol. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 1405–1418, 2020.
[25] M. E. Peters et al., "Knowledge enhanced contextual word representations," Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 43-54, 2019.
[26] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, "ERNIE: Enhanced language representation with informative entities," Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1441-1451, 2019.
[27] B. He, D. Zhou, J. Xiao, Q. Liu, N. J. Yuan, and T. Xu, "BERT-MK: Integrating graph contextualized knowledge into pre-trained language models," Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2281-2290, 2020.
[28] N. Poerner, U. Waltinger, and H. Schütze, "E-BERT: Efficient-yet-effective entity embeddings for BERT," presented at the Findings of the Association for Computational Linguistics: EMNLP 2020, 2020.
[29] W. Xiong, J. Du, W. Y. Wang, and V. Stoyanov, "Pretrained encyclopedia: Weakly supervised knowledge-pretrained language model," Proceedings of the 2020 International Conference on Learning Representations (ICLR), 2020.
[30] N. Gupta, S. Singh, and D. Roth, "Entity linking via joint encoding of types, descriptions, and context," in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 2681-2690.
[31] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko, "Translating embeddings for modeling multi-relational data," Advances in neural information processing systems, vol. 26, 2013.
[32] L. B. Soares, N. FitzGerald, J. Ling, and T. Kwiatkowski, "Matching the blanks: Distributional similarity for relation learning," Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2895-2905, 2019.
[33] P. F. Brown, V. J. Della Pietra, P. V. Desouza, J. C. Lai, and R. L. Mercer, "Class-based n-gram models of natural language," Computational linguistics, vol. 18, no. 4, pp. 467-480, 1992.
[34] R. K. Ando, T. Zhang, and P. Bartlett, "A framework for learning predictive structures from multiple tasks and unlabeled data," Journal of Machine Learning Research, vol. 6, no. 11, 2005.
[35] J. Blitzer, R. McDonald, and F. Pereira, "Domain adaptation with structural correspondence learning," in Proceedings of the 2006 conference on empirical methods in natural language processing, 2006, pp. 120-128.
[36] J. Turian, L. Ratinov, and Y. Bengio, "Word representations: a simple and general method for semi-supervised learning," in Proceedings of the 48th annual meeting of the association for computational linguistics, 2010, pp. 384-394.
[37] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information," Transactions of the association for computational linguistics, vol. 5, pp. 135-146, 2017.
[38] A. Mnih and G. E. Hinton, "A scalable hierarchical distributed language model," Advances in neural information processing systems, vol. 21, 2008.
[39] R. Kiros et al., "Skip-thought vectors," Advances in neural information processing systems, vol. 28, 2015.
[40] L. Logeswaran and H. Lee, "An efficient framework for learning sentence representations," in Proceedings of the 2018 International Conference on Learning Representations (ICLR), 2018.
[41] Q. Le and T. Mikolov, "Distributed representations of sentences and documents," in International conference on machine learning, 2014: PMLR, pp. 1188-1196.
[42] Y. Jernite, S. R. Bowman, and D. Sontag, "Discourse-based objectives for fast unsupervised sentence representation learning," arXiv preprint arXiv:1705.00557, 2017.
[43] F. Hill, K. Cho, and A. Korhonen, "Learning distributed representations of sentences from unlabelled data," in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1367–1377.
[44] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, "Improving language understanding by generative pre-training," 2018.
[45] R. Collobert and J. Weston, "A unified architecture for natural language processing: Deep neural networks with multitask learning," in Proceedings of the 25th international conference on Machine learning, 2008, pp. 160-167.
[46] A. M. Dai and Q. V. Le, "Semi-supervised sequence learning," Advances in neural information processing systems, vol. 28, 2015.
[47] J. Howard and S. Ruder, "Universal language model fine-tuning for text classification," Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339, 2018.
[48] M. E. Peters, W. Ammar, C. Bhagavatula, and R. Power, "Semi-supervised sequence tagging with bidirectional language models," Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1756–1765, 2017.
[49] M. N. Matthew E. Peters, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer, "Deep Contextualized Word Representations," vol. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237, doi: 10.18653/v1/N18-1202.
[50] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, "GLUE: A multi-task benchmark and analysis platform for natural language understanding," in Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018, pp. 353–355.
[51] A. Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
[52] Y. Liu et al., "Roberta: A robustly optimized bert pretraining approach," arXiv preprint arXiv:1907.11692, 2019.
[53] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy, "Spanbert: Improving pre-training by representing and predicting spans," Transactions of the Association for Computational Linguistics, vol. 8, pp. 64-77, 2020.
[54] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, "Albert: A lite bert for self-supervised learning of language representations," in Proceedings of the 2020 International Conference on Learning Representations (ICLR), 2020.
[55] M. Lewis et al., "Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension," Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871-7880, 2020.
[56] K. Clark, U. Khandelwal, O. Levy, and C. D. Manning, "What does bert look at? an analysis of bert's attention," Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 276-286, 2019.
[57] C. Raffel et al., "Exploring the limits of transfer learning with a unified text-to-text transformer," Journal of Machine Learning Research, vol. 21, pp. 1-67, 2020.
[58] I. Yamada, H. Shindo, H. Takeda, and Y. Takefuji, "Joint learning of the embedding of words and entities for named entity disambiguation," presented at the Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, 2016.
[59] H. Hayashi, Z. Hu, C. Xiong, and G. Neubig, "Latent relation language models," in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, vol. 34, no. 05, pp. 7911-7918.
[60] D. Ye et al., "Coreferential reasoning learning for language representation," in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7170–7186, 2020.
[61] K. Clark and C. D. Manning, "Deep reinforcement learning for mention-ranking coreference models," Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2256–2262, 2016.
[62] N. Kolitsas, O.-E. Ganea, and T. Hofmann, "End-to-end neural entity linking," Proceedings of the 22nd Conference on Computational Natural Language Learning, pp. 519-529, 2018.
[63] L. Logeswaran, M.-W. Chang, K. Lee, K. Toutanova, J. Devlin, and H. Lee, "Zero-shot entity linking by reading entity descriptions," Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3449–3460, 2019.
[64] A. Conneau and G. Lample, "Cross-lingual language model pretraining," Advances in neural information processing systems, vol. 32, 2019.
[65] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu, "Exploring the limits of language modeling," arXiv preprint arXiv:1602.02410, 2016.
[66] X. Liu, P. He, W. Chen, and J. Gao, "Multi-task deep neural networks for natural language understanding," Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4487–4496, 2019.
[67] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky, "The Stanford CoreNLP natural language processing toolkit," in Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, 2014, pp. 55-60.
[68] Y.-S. Tang and C.-H. Wu, "CREER: A Large-Scale Corpus for Relation Extraction and Entity Recognition," arXiv preprint arXiv:2204.12710, 2022.
[69] C. Kulkarni, W. Xu, A. Ritter, and R. Machiraju, "An annotated corpus for machine reading of instructions in wet lab protocols," in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 2018, pp. 97–106.
[70] E. F. Sang and F. De Meulder, "Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition," in Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 2003, pp. 142–147.
[71] I. Hendrickx et al., "Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals," arXiv preprint arXiv:1911.10422, 2019.
[72] R. Weischedel et al., "OntoNotes Release 5.0. Linguistic Data Consortium," Philadelphia, Technical Report, 2012.
[73] M. Marcus et al., "The Penn treebank: Annotating predicate argument structure," in Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994, 1994.
[74] G. Stanovsky and I. Dagan, "Creating a large benchmark for open information extraction," in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 2300-2305.
[75] L. Deng and J. Wiebe, "Mpqa 3.0: An entity/event-level sentiment corpus," in Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, 2015, pp. 1323-1328.
[76] D. G. Maria Pontiki, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, Suresh Manandhar, "SemEval-2014 Task 4: Aspect Based Sentiment Analysis," vol. Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 2014, doi: 10.3115/v1/S14-2004.
[77] R. Sennrich, B. Haddow, and A. Birch, "Neural machine translation of rare words with subword units," in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 1715–1725.
[78] Y. Zhang, V. Zhong, D. Chen, G. Angeli, and C. D. Manning, "Position-aware attention and supervised data improve slot filling," in Conference on Empirical Methods in Natural Language Processing, 2017.
[79] H. Chen, Z. Lin, G. Ding, J. Lou, Y. Zhang, and B. Karlsson, "GRN: Gated relation network to enhance convolutional neural network for named entity recognition," in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, no. 01, pp. 6236-6243.
[80] Y. Zhang, P. Qi, and C. D. Manning, "Graph convolution over pruned dependency trees improves relation extraction," in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 2205–2215.