簡易檢索 / 詳目顯示

研究生: 黎修文
Li, Hsiu-Wen
論文名稱: 使用預訓練知識和偽標籤遷移技術改進無監督式中文分詞表現
Improved Unsupervised Chinese Word Segmentation Using Pre-trained Knowledge and Pseudo-labeling Transfer
指導教授: 高宏宇
Kao, Hung-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 43
中文關鍵詞: 自然語言處理非監督式中文分詞預訓練語言模型
外文關鍵詞: Natural Language Processing, Unsupervised, Chinese word segmentation, Pre-trained language model
相關次數: 點閱:47下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來大型預訓練語言模型已被廣泛應用於多項自然語言處理下游任務中,如文本分類、故事生成、命名實體辨別、QA問答等。仰賴其蘊藏的豐富語意知識,各項任務的表現取得卓越的進步,中文分詞任務也同樣受惠於此。
    現今主流的監督式中文分詞作法通常會轉換為序列標注任務,對每個字元進行分類,透過根據分類結果了解詞彙邊界,進而組成詞彙單元。藉由大型預訓練語言模型提供了豐富的句子語義與結構資訊,提昇分類品質。
    而在無監督式中文分詞領域(UCWS),近期達到了最先進(SOTA)分數的研究,也同樣是得益於語言模型的預訓練知識。該研究利用基於掩蔽式語言模型的特性,得以探尋預訓練模型內部知識,進而取得句子中字元間彼此的關係,搭配上自訓練(self-training)機制,超越了過去無監督式中文分詞研究的分數,但該方法耗費相當可觀的訓練時間。
    本文對 UCWS 的 SOTA 方法提出了速度和性能改進。所提出的方法在偽標籤(pseudo labeling)框架下,將來自無監督分段模型的隱含的分割訊號,與基於 BERT 的分類器相結合。在八個中文分詞任務中,我們提出的方法相較於先前的 SOTA 方法,有顯著的性能提昇與訓練時間顯著的減少。

    In recent years, large pre-trained language models have been extensively applied to various natural language processing (NLP) tasks, including text classification, story generation, named entity recognition, question answering, and Chinese word segmentation (CWS). Their rich semantic knowledge has significantly enhanced the performance of these tasks.
    The current mainstream supervised approach addresses the CWS task as the sequence tagging task, classifying each character and then composing word units based on the classification results. The pre-trained language model, which offers sentence semantics and structure information, enhances the accuracy of the classifier in the classification process.
    In the field of unsupervised Chinese word segmentation (UCWS), recent research that achieved state-of-the-art (SOTA) scores also take good advantage of the pre-trained knowledge of the language model.
    The SOTA work in UCWS uses the characteristics of the masked language model to probe the internal knowledge of the pre-trained model and obtain the relationship between characters in the sentence, combined with self-training mechanisms to surpass the scores of the previous UCWS research.
    However, this method consumes a considerable amount of training time. This work proposes improvements in both speed and performance for the SOTA approach in UCWS. The proposed method combines the implicit segmentation signal from the unsupervised segmentation model with the BERT-based classifier under a pseudo-labeling framework. On eight Chinese word segmentation tasks, our proposed method demonstrates substantial performance improvement and a significant reduction in training time compared to the previous SOTA approach.

    摘要 i Abstract ii 誌謝 iii Table of Contents iv List of Tables vii List of Figures viii Chapter 1. Introduction 1 1.1 Chinese Word Segmentation (CWS) 1 1.2 Supervised Manner 2 1.3 Unsupervised Manner 3 1.4 Large Pre-trained Language Model 4 1.5 Motivation 4 1.6 Our Work 5 Chapter 2. Related Work 6 2.1 Dictionary-based Method 6 2.2 Statistical Model 7 2.3 Neural Network Model 8 2.4 Pre-trained Language Model 9 2.4.1. Language Modeling Objective 10 2.4.2. Perturbed Masking: Probing Masked Language Model 10 2.5 Unsupervised Neural Model for Word Segmentation 12 2.5.1. Segment Language Model 12 2.5.2. The State-Of-The-Art Unsupervised Chinese Word Segmentation Method 14 Chapter 3. Methodology 18 3.1 Segment Model 18 3.2 Classifier 22 3.2.1. Tagging Schema 23 3.3 Two-stage Training Framework 23 Chapter 4. Experiment 25 4.1 Datasets 25 4.2 Implementation Details 26 4.2.1. Preprocess 26 4.2.2. Pre-trained Character Embedding 26 4.2.3. Hyperparameter and Training Detail 27 4.3 Result 27 4.3.1. Re-implementation Comparison 28 4.4 Training Time Comparison 29 4.5 Comparison of Model Performance on Different Segmentation Lengths 30 Chapter 5. Analysis 31 5.1 Does Pre-trained Knowledge Really Help? 31 5.2 Apply Self-training Scenario to Our Framework 32 5.3 Apply Different Tagging Schema 33 5.4 Apply Different Pre-trained Model to Our Framework 34 5.5 Case Study 35 5.6 Apply the Two-Stage Training to Existing Word Segmentation Tools 36 Chapter 6. Discussion and Conclusion 39 References 40

    [1] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
    [2] Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuan-Jing Huang. Long short-term memory neural networks for chinese word segmentation. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1197–1206, 2015.
    [3] Xinchi Chen, Zhan Shi, Xipeng Qiu, and Xuanjing Huang. Adversarial multi-criteria learning for Chinese word segmentation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1193–1203, Vancouver, Canada, July 2017. Association for Computational Linguistics.
    [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
    [5] C.m. Downey, Fei Xia, Gina-Anne Levow, and Shane Steinert-Threlkeld. A masked segmental language model for unsupervised natural language segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 39–50, Seattle, Washington, July 2022. Association for Computational Linguistics.
    [6] Thomas Emerson. The second international Chinese word segmentation bakeoff. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, 2005.
    [7] Pierre Godard, Gilles Adda, Martine Adda-Decker, Juan Benjumea, Laurent Besacier, Jamison Cooper-Leavitt, Guy-Noel Kouarata, Lori Lamel, Hélène Maynard, Markus Mueller, Annie Rialland, Sebastian Stueker, François Yvon, and Marcely Zanon-Boito. A very low resource language speech corpus for computational language documentation experiments. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA).
    [8] Pierre Godard, Laurent Besacier, François Yvon, Martine Adda-Decker, Gilles Adda, Hélène Maynard, and Annie Rialland. Adaptor Grammars for the linguist: Word segmentation experiments for very low-resource languages. In Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 32–42, Brussels, Belgium, October 2018. Association for Computational Linguistics.
    [9] Sharon Goldwater, Thomas L Griffiths, and Mark Johnson. A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112(1):21–54, 2009.
    [10] John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, China, November 2019. Association for Computational Linguistics.
    [11] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
    [12] Changning Huang and Hai Zhao. Chinese word segmentation: A decade review. Journal of Chinese Information Processing, 21(3):8–20, 2007.
    [13] Guangjin Jin and Xiao Chen. The fourth international Chinese language processing bakeoff: Chinese word segmentation, named entity recognition and Chinese POS tagging. In Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, 2008.
    [14] Zhen Ke, Liang Shi, Songtao Sun, Erli Meng, Bin Wang, and Xipeng Qiu. Pre-training with meta learning for Chinese word segmentation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5514–5523, Online, June 2021. Association for Computational Linguistics.
    [15] Yen-Hsuan Lee and Yih-Ru Wang. 繁體中文依存句法剖析器 (traditional Chinese dependency parser) [in Chinese]. In Proceedings of the 30th Conference on Computational Linguistics and Speech Processing (ROCLING 2018), pages 61–75, Hsinchu, Taiwan, October 2018. The Association for Computational Linguistics and Chinese Language Processing (ACLCLP).
    [16] Wei Li, Yuhan Song, Qi Su, and Yanqiu Shao. Unsupervised Chinese word segmentation with BERT oriented probing and transformation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3935–3940, Dublin, Ireland, May 2022. Association for Computational Linguistics.
    [17] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    [18] Ji Ma, Kuzman Ganchev, and David Weiss. State-of-the-art Chinese word segmentation with Bi-LSTMs. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4902–4908, Brussels, Belgium, October-November 2018. Association for Computational Linguistics.
    [19] Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In Yoshua Bengio and Yann LeCun, editors, 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013.
    [20] Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda. Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 100–108, Suntec, Singapore, August 2009. Association for Computational Linguistics.
    [21] Fuchun Peng, Fangfang Feng, and Andrew McCallum. Chinese segmentation and new word detection using conditional random fields. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pages 562–568, Geneva, Switzerland, aug 23–aug 27 2004. COLING.
    [22] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
    [23] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
    [24] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
    [25] Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. Automatic keyword extraction from individual documents. Text mining: applications and theory, pages 1–20, 2010.
    [26] Zhiqing Sun and Zhi-Hong Deng. Unsupervised neural word segmentation for Chinese via segmental language modeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4915–4920, Brussels, Belgium, October-November 2018. Association for Computational Linguistics.
    [27] Yu Tong, Jingzhi Guo, Jizhe Zhou, Ge Chen, and Guokai Zheng. Word segmentation by separation inference for East Asian languages. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3924–3934, Dublin, Ireland, May 2022. Association for Computational Linguistics.
    [28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
    [29] Zhiyong Wu, Yun Chen, Ben Kao, and Qun Liu. Perturbed masking: Parameter-free probing for analyzing and interpreting BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4166–4176, Online, July 2020. Association for Computational Linguistics.
    [30] Naiwen Xue, Fei Xia, Fu-Dong Chiou, and Marta Palmer. The penn chinese treebank: Phrase structure annotation of a large corpus. Natural language engineering, 11(2):207–238, 2005.
    [31] Nianwen Xue. Chinese word segmentation as character tagging. In International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing, pages 29–48, February 2003.
    [32] Yushi Yao and Zheng Huang. Bi-directional lstm recurrent neural network for chinese word segmentation. In Neural Information Processing: 23rd International Conference, ICONIP 2016, Kyoto, Japan, October 16–21, 2016, Proceedings, Part IV 23, pages 345–353. Springer, 2016.
    [33] Daniel Zeman, Jan Hajič, Martin Popel, Martin Potthast, Milan Straka, Filip Ginter, Joakim Nivre, and Slav Petrov. CoNLL 2018 shared task: Multilingual parsing from raw text to Universal Dependencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1–21, Brussels, Belgium, October 2018. Association for Computational Linguistics.
    [34] Xiaoqing Zheng, Hanyang Chen, and Tianyu Xu. Deep learning for chinese word segmentation and pos tagging. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 647–657, 2013.

    下載圖示 校內:2024-05-26公開
    校外:2024-05-26公開
    QR CODE