簡易檢索 / 詳目顯示

研究生: 林峻毅
Lin, Chun-Yi
論文名稱: 利用學習語句表達改進中文斷詞對於不同標準的適應性
Improving Multi-Criteria Chinese Word Segmentation through Learning Sentence Representation
指導教授: 高宏宇
Kao, Hung-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 人工智慧科技碩士學位學程
Graduate Program of Artificial Intelligence
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 44
中文關鍵詞: 自然語言處理中文斷詞
外文關鍵詞: Natural Language Processing, Chinese Word Segmentation
相關次數: 點閱:151下載:7
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 詞語是語言中具有完整表達的基本單位。分開的詞語可以應用於許多下游任務,或使人們更容易閱讀。但與西方語言不同,中文中並沒有明確的詞語分界線。中文詞語分割(CWS)是一個找出中文詞語邊界的任務。近年來,通過使用預訓練語言模型,監督式中文詞語分割已經取得了競爭力的性能。但是,對於如何分割一個句子存在不同的方法。正確的句子分割方式取決於標註者對句子的理解。我們需要一種方法讓模型區分來自不同數據集的每個標準,因為所有標準都是正確的。我們將這個任務定義為多準則中文詞語分割。為了應對這個問題,我們在每個輸入前添加了一個標準標記。標準標記代表每個數據集的標準。標準標記也可以被視為一個控制標記,用於選擇分割給定句子的標準。中文詞語分割中的另一個常見問題是詞彙外(OOV)問題。在我們日常使用中,中文詞語每天都可能被創造出來。數據集無法包含所有中文詞語,因此正確地對包含新詞語的句子進行分割是一個關鍵問題。我們提出了一種新的方法,通過對句子表示任務進行 CWS 模型的訓練,讓模型在句子中存在 OOV 詞語時表現更好。我們的方法在多準則中文詞語分割的平均 F1 得分和OOV 召回率方面達到了最先進的水平。

    Words are the fundamental unit that has a complete representation of a language. Separated words can be used in many downstream tasks or make people read it easier. But unlike the Western languages, which have a blank to separate words, Chinese doesn’t have an explicit boundary between words. Chinese word segmentation (CWS) is a task to find the boundary between Chinese words. In recent years, supervised Chinese word segmentation has achieved a competitive performance by using pre-trained language models. But there are different ways to separate a sentence. The correct way to separate a sentence depends on how the annotator comprehends the sentence. We need a method to let the model distinguish each criterion from a different dataset because all criteria are correct. And we defined this task as multi-criteria Chinese word segmentation. To deal with this problem, we add a criteria token before each input. Criteria tokens represent the criterion of each dataset. The criteria token can also be viewed as a control token to select the criterion to segment a given sentence. Another common problem in Chinese word segmentation is the out-of-vocabulary (OOV) problem. Chinese words can be created every day during our daily use. The dataset couldn’t include all the Chinese words, so segmenting a sentence correctly with new words is a critical problem. We propose a novel method, training the CWS model with a sentence representation task and letting the model perform better when there are OOV words in a sentence. Our approach reaches the state-of-the-art average F1 score and OOV recall on multi-criteria Chinese word segmentation.

    摘要 i Abstract ii Table of Contents iii List of Tables vi List of Figures ix Chapter 1. Introduction 1 1.1 Introduction to Chinese Word Segmentation 1 1.2 Application of Chinese Word Segmentation 1 1.3 Difficulty of Chinese Word Segmentation 2 1.3.1. SCCWS 3 1.3.2. MCCWS 3 1.4 Motivation 4 1.4.1. Simpilfy the existing method 4 1.4.2. Sentence representation 5 1.5 Our Work 6 Chapter 2. Related Work 8 2.1 Transformer Encoder 8 2.2 Bert 9 2.2.1. masked language model(MLM) 10 2.2.2. next sentence prediction(NSP) 10 2.3 Private-structure-based MCCWS model 11 2.4 Input-hint-based MCCWS model 12 2.5 Sentence Representation 13 Chapter 3. Methodology 15 3.1 Problem definition 15 3.2 CWS model definition 16 3.2.1. Input Sequence 16 3.2.2. Encoder 17 3.2.3. Decoder 17 3.2.4. Criterion Classification 18 3.2.5. Sentence Representation 18 3.2.6. Total loss 19 3.3 Training overview 20 3.3.1. Encoder overview 20 3.3.2. Decoder overview 21 3.3.3. Criterion Classifier 22 3.3.4. Sentence Repreasentation 23 Chapter 4. Experiment 25 4.1 Datasets 25 4.2 Preprocess 25 4.3 Hyperparameters 26 4.4 Evaluation Method 26 4.4.1. F1 score 27 4.4.2. OOV recall 27 4.5 Main Results 28 4.5.1. SOTA F1 score 28 4.5.2. SOTA OOV recall 28 4.5.3. Real OOV recall 29 Chapter 5. Analysis 32 5.1 Ablation Study 32 5.1.1. Remove criteria classification 32 5.1.2. Remove Sentence Representation task 33 5.1.3. Remove both additional losses 33 5.2 Different Sentence representation training method 33 5.2.1. MSE loss 34 5.2.2. Cosine Embedding loss 34 5.2.3. ArcCSE 35 5.2.4. Conclusion 35 5.3 Criteria controlling 35 5.4 Dataset mislabeling 36 5.5 Segment sentence by referring to the context 38 5.6 Analysis the influence of the α in our loss function 38 Chapter 6. Discussion and Conclusion 40 References 41

    [1] Xinchi Chen, Zhan Shi, Xipeng Qiu, and Xuanjing Huang. Adversarial multi-criteria learning for Chinese word segmentation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1193–1203, Vancouver, Canada, July 2017. Association for Computational Linguistics.
    [2] Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljacic, Shang-Wen Li, Scott Yih, Yoon Kim, and James Glass. DiffCSE: Difference-based contrastive learning for sentence embeddings. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4207–4218, Seattle, United States, July 2022. Association for Computational Linguistics.
    [3] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. Revisiting pre-trained models for Chinese natural language processing. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 657–668, Online, November 2020. Association for Computational Linguistics.
    [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
    [5] Thomas Emerson. The second international Chinese word segmentation bakeoff. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, 2005.
    [6] Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
    [7] Jingjing Gong, Xinchi Chen, Tao Gui, and Xipeng Qiu. Switch-lstms for multi-criteria chinese word segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6457–6464, Jul. 2019.
    [8] Han He, Lei Wu, Hua Yan, Zhimin Gao, Yi Feng, and George Townsend. Effective neural solution for multi-criteria word segmentation. In Suresh Chandra Satapathy, Vikrant Bhateja, and Swagatam Das, editors, Smart Intelligent Computing and Applications, pages 133–142, Singapore, 2019. Springer Singapore.
    [9] Kaiyu Huang, Degen Huang, Zhuang Liu, and Fengran Mo. A joint multiple criteria model in transfer learning for cross-domain Chinese word segmentation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3873–3882, Online, November 2020. Association for Computational Linguistics.
    [10] Kaiyu Huang, Junpeng Liu, Degen Huang, Deyi Xiong, Zhuang Liu, and Jinsong Su. Enhancing Chinese word segmentation via pseudo labels for practicability. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4369–4381, Online, August 2021. Association for Computational Linguistics.
    [11] Weipeng Huang, Xingyi Cheng, Kunlong Chen, Taifeng Wang, and Wei Chu. Towards fast and accurate neural Chinese word segmentation with multi-criteria learning. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2062–2072, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics.
    [12] Guangjin Jin and Xiao Chen. The fourth international Chinese language processing bakeoff: Chinese word segmentation, named entity recognition and Chinese POS tagging. In Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, 2008.
    [13] Zhen Ke, Liang Shi, Erli Meng, Bin Wang, Xipeng Qiu, and Xuanjing Huang. Unified multi-criteria chinese word segmentation with bert. arXiv preprint arXiv:2004.05808, 2020.
    [14] Zhen Ke, Liang Shi, Songtao Sun, Erli Meng, Bin Wang, and Xipeng Qiu. Pre-training with meta learning for Chinese word segmentation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5514–5523, Online, June 2021. Association for Computational Linguistics.
    [15] Zhen Ke, Liang Shi, Songtao Sun, Erli Meng, Bin Wang, and Xipeng Qiu. Pre-training with meta learning for Chinese word segmentation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5514–5523, Online, June 2021. Association for Computational Linguistics.
    [16] John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.
    [17] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    [18] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
    [19] Ji Ma, Kuzman Ganchev, and David Weiss. State-of-the-art Chinese word segmentationwith Bi-LSTMs. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4902–4908, Brussels, Belgium, October-November 2018. Association for Computational Linguistics.
    [20] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
    [21] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
    [22] Fuchun Peng, Fangfang Feng, and Andrew McCallum. Chinese segmentation and new word detection using conditional random fields. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pages 562–568, Geneva, Switzerland, aug 23–aug 27 2004. COLING.
    [23] Nanyun Peng and Mark Dredze. Multi-task domain adaptation for sequence tagging. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 91–100, Vancouver, Canada, August 2017. Association for Computational Linguistics.
    [24] Xipeng Qiu, Hengzhi Pei, Hang Yan, and Xuanjing Huang. A concise model for multi-criteria Chinese word segmentation with transformer encoder. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2887–2897, Online, November 2020. Association for Computational Linguistics.
    [25] Yuanhe Tian, Yan Song, Fei Xia, Tong Zhang, and Yonggang Wang. Improving Chinese word segmentation with wordhood memory networks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8274–8285, Online, July 2020. Association for Computational Linguistics.
    [26] Yu Tong, Jingzhi Guo, Jizhe Zhou, Ge Chen, and Guokai Zheng. Word segmentation by separation inference for East Asian languages. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3924–3934, Dublin, Ireland, May 2022. Association for Computational Linguistics.
    [27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
    [28] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
    [29] Xing Wu, Chaochen Gao, Zijia Lin, Jizhong Han, Zhongyuan Wang, and Songlin
    Hu. InfoCSE: Information-aggregated contrastive learning of sentence embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3060–3070, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
    [30] Naiwen Xue, Fei Xis, Fu-Dong Chiou, and Marta Palmer. The penn chinese treebank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2):207 – 238, 2005.
    [31] Daniel Zeman and Jan Hajič, editors. Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, Belgium, October 2018. Association for Computational Linguistics.
    [32] Meishan Zhang, Yue Zhang, Wanxiang Che, and Ting Liu. Type-supervised domain adaptation for joint segmentation and POS-tagging. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 588–597, Gothenburg, Sweden, April 2014. Association for Computational Linguistics.
    [33] Min Zhang, GuoDong Zhou, LingPeng Yang, and DongHong Ji. Chinese word segmentation and named entity recognition based on a context-dependent mutual information independence model. In Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pages 154–157, Sydney, Australia, July 2006. Association for Computational Linguistics.
    [34] Yuhao Zhang, Hongji Zhu, Yongliang Wang, Nan Xu, Xiaobo Li, and Binqiang Zhao. A contrastive framework for learning sentence representations from pairwise and triple-wise perspective in angular space. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4892–4903, Dublin, Ireland, May 2022. Association for Computational Linguistics.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE