簡易檢索 / 詳目顯示

研究生: 周子軒
Chou, Tzu-Hsuan
論文名稱: 多重準則中文斷詞任務上的自動化準則抉擇機制
A Criterion-Choosing Auto-Mechanism for Multi-Criteria Chinese Word Segmentation
指導教授: 高宏宇
Kao, Hung-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 112
語文別: 英文
論文頁數: 32
中文關鍵詞: 自然語言處理中文斷詞多重準則中文斷詞
外文關鍵詞: Natural Language Processing, Chinese Word Segmentation, Multi-Criteria Chinese Word Segmentation
相關次數: 點閱:214下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年的多重準則中文斷詞任務(Multi-Criteria Chinese Word Segmentation,MCCWS)模型的表現水準已經幾乎追上了單一準則中文斷詞任務(Single-Criteria Chinese Word Segmentation,SCCWS)模型。相比於 SCCWS,MCCWS 是一個更貼近實用場景的解決方案。但目前已知的 MCCWS 研究仍然缺乏了一個關鍵因素,導致 MCCWS 無法在實務中派上用場:這些研究並沒有提供抉擇準則的機制。在此研究中,我們將提出一個全新的準則去噪任務(criterion-denoising objective),以該任務訓練而得的 MCCWS 將擁有自動化的準則抉擇機制(criterion-choosing auto-mechanism)。實驗結果顯示,該自動化機制結合至基於輸入提示(input-hint-based)的 MCCWS 模型可以在多個中文斷詞資料集上同時達成最佳表現(state-of-the-art),因此更加推進了 MCCWS 研究的成果與實用性。

    Recent works on Multi-Criteria Chinese Word Segmentation (MCCWS) have achieved on-par performance compared to Single-Criteria Chinese Word Segmentation (SCCWS). Despite being a more practical solution compared to their single-criteria alternatives, existing MCCWS models still miss out on one essential ability to make themselves useful: They cannot provide a criterion-choosing suggestion for a given text input. In this work, we proposed a novel criterion-denoising objective that makes our MCCWS model capable of automatically choosing criteria. Results show that combining a criterion-choosing auto-mechanism on top of an input-hint-based MCCWS model can simultaneously achieve state-of-the-art results over multiple datasets, therefore pushing MCCWS works toward a more robust and practical direction.

    摘要 i Abstract ii Acknowledgements iii Table of Contents iv List of Tables vi List of Figures vii Chapter 1. Introduction 1 Chapter 2. Related Work 5 Chapter 3. Methodology 8 3.1 Problem Definition 8 3.1.1. SCCWS 8 3.1.2. Input-hint-based MCCWS 9 3.2 Model Definition 10 3.2.1. Input Format 10 3.2.2. Encoder 10 3.2.3. Decoder 12 3.2.4. Criterion Classification 12 3.2.5. Total Loss 13 3.3 Criterion Denoising 13 3.4 Efficiency 14 Chapter 4. Experiments 15 4.1 Datasets 15 4.2 Hyperparameters 17 4.3 Main Results 18 4.3.1. SoTA F1-score 18 4.3.2. Noisy but near SoTA 18 4.3.3. SoTA OOV Recall 18 4.3.4. Auto Mechanism 20 4.4 Ablation Study 21 4.4.1. Increase Criterion Denoising Rate 21 4.4.2. Reduce Maximum Sentence Length 22 4.4.3. Criterion Classifier 23 4.4.4. Case Study 24 Chapter 5. Conclusion 28 References 30

    [1] Xinchi Chen, Zhan Shi, Xipeng Qiu, and Xuanjing Huang. Adversarial multi-criteria learning for Chinese word segmentation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1193–1203, Vancouver, Canada, July 2017. Association for Computational Linguistics.
    [2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
    [3] Thomas Emerson. The second international Chinese word segmentation bakeoff. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, 2005.
    [4] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1126–1135. PMLR, 06–11 Aug 2017.
    [5] Jingjing Gong, Xinchi Chen, Tao Gui, and Xipeng Qiu. Switch-lstms for multi-criteria chinese word segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6457–6464, Jul. 2019.
    [6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
    [7] Han He, Lei Wu, Hua Yan, Zhimin Gao, Yi Feng, and George Townsend. Effective neural solution for multi-criteria word segmentation. In Suresh Chandra Satapathy, Vikrant Bhateja, and Swagatam Das, editors, Smart Intelligent Computing and Applications, pages 133–142, Singapore, 2019. Springer Singapore.
    [8] Kaiyu Huang, Degen Huang, Zhuang Liu, and Fengran Mo. A joint multiple criteria model in transfer learning for cross-domain Chinese word segmentation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3873–3882, Online, November 2020. Association for Computational Linguistics.
    [9] Weipeng Huang, Xingyi Cheng, Kunlong Chen, Taifeng Wang, and Wei Chu. Towards fast and accurate neural Chinese word segmentation with multi-criteria learning. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2062–2072, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics.
    [10] Guangjin Jin and Xiao Chen. The fourth international Chinese language processing bakeoff: Chinese word segmentation, named entity recognition and Chinese POS tagging. In Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, 2008.
    [11] Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351, 2017.
    [12] Zhen Ke, Liang Shi, Erli Meng, Bin Wang, Xipeng Qiu, and Xuanjing Huang. Unified multi-criteria chinese word segmentation with bert. arXiv preprint arXiv:2004.05808, 2020.
    [13] Zhen Ke, Liang Shi, Songtao Sun, Erli Meng, Bin Wang, and Xipeng Qiu. Pre-training with meta learning for Chinese word segmentation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5514–5523, Online, June 2021. Association for Computational Linguistics.
    [14] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
    [15] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.
    [16] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
    [17] Ji Ma, Kuzman Ganchev, and David Weiss. State-of-the-art Chinese word segmentation with Bi-LSTMs. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4902–4908, Brussels, Belgium, October-November 2018. Association for Computational Linguistics.
    [18] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
    [19] Xipeng Qiu, Hengzhi Pei, Hang Yan, and Xuanjing Huang. A concise model for multi-criteria Chinese word segmentation with transformer encoder. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2887–2897, Online, November 2020. Association for Computational Linguistics.
    [20] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014.
    [21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
    [22] William Yang Wang, Lingpeng Kong, Kathryn Mazaitis, and William W. Cohen. Dependency parsing for Weibo: An efficient probabilistic logic programming approach. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1152–1158, Doha, Qatar, October 2014. Association for Computational Linguistics.
    [23] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics.
    [24] Naiwen Xue, Fei Xis, Fu-Dong Chiou, and Marta Palmer. The Penn Chinese Treebank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2):207–238, 2005.
    [25] Nianwen Xue. Chinese word segmentation as character tagging. In International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing, pages 29–48, February 2003.
    [26] Daniel Zeman, Jan Hajič, Martin Popel, Martin Potthast, Milan Straka, Filip Ginter, Joakim Nivre, and Slav Petrov. CoNLL 2018 shared task: Multilingual parsing from raw text to Universal Dependencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1–21, Brussels, Belgium, October 2018. Association for Computational Linguistics.
    [27] Meishan Zhang, Yue Zhang, Wanxiang Che, and Ting Liu. Type-supervised domain adaptation for joint segmentation and POS-tagging. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 588–597, Gothenburg, Sweden, April 2014. Association for Computational Linguistics.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE