簡易檢索 / 詳目顯示

研究生: 李熙堃
Lee, She-Kun
論文名稱: 具前後處理校正模型的領域特定命名實體增強型神經網路機器翻譯
Domain specific Named Entity Enhanced NMT with Pre-Post Processing Correction Model
指導教授: 盧文祥
Lu, Wen-Hsiang
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 55
中文關鍵詞: 命名實體類神經機器翻譯混合機器翻譯改述同義詞句法矯正
外文關鍵詞: Named Entity Recognition, Neural Machine Translation, Hybrid Machine Translation, Paraphrase, Synonym, Syntax Correction
相關次數: 點閱:41下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 台灣已進入高齡化社會,同時間有大量印尼外籍勞工移入來進行,而解決老人與外籍移工的溝通問題為其中之重要議題。現今已有非常多類神經網路的翻譯模型,但這些模組在翻譯專有名詞時表現相對較差。而此實驗主要使用 Meta 公司開源的訓練工具 Fairseq 訓練深度學習模組,並將命名實體 (Namded Entity) 納入訓練翻譯模型的流程,同時辨識該句會分配到的領域並藉此領域進行命名實體的翻譯;最後根據中文以及印尼文設計了中文的改述、簡化功能,降低真實輸入對於模組的影響,同時研究了印尼文的句法、內文,並對於機器翻譯的輸出進行句法的修正。我們利用 Google 翻譯建構平行語料,並在訓練完後添加前處理、後處理模組。實驗結果顯示在 300 句外部測試中僅有 4.25%詞彙選擇錯誤率、3.5%多譯錯誤率、1.75% 少譯錯誤率。這項研究的結果顯示,透過整合命名實體辨識與領域識別,能有效提升翻譯模型在專有名詞處理上的精準度。未來,我們計劃進一步優化這些功能,並擴展至更多語言對的應用,以推動跨語言交流的精確性和效率。

    Taiwan has entered an aging society, and at the same time, a large number of Indonesian foreign workers have migrated here, making the issue of communication between the elderly and foreign workers a significant concern. While many neural network translation models exist today, these models generally perform poorly when translating proper nouns. In this experiment, we primarily used Meta's open-source training tool, Fairseq, to train deep learning models and incorporated Named Entity Recognition (NER) into the translation model's training process. Additionally, we identified the domain of the sentence and used this domain to translate the named entities. Finally, we designed paraphrasing and simplification functions for Chinese based on both Chinese and Indonesian, reducing the impact of real input on the model. We also studied the syntax and content of Indonesian and made syntactic corrections to the machine translation output. We utilized Google Translate to build parallel corpora and added preprocessing and postprocessing modules after training. The experimental results showed that in 300 external tests, there was only a 4.25% word choice error rate, a 3.5% over-translation error rate, and a 1.75% under-translation error rate. The results of this study demonstrate that integrating NER and domain recognition can significantly improve the accuracy of translation models in handling proper nouns. In the future, we plan to further optimize these features and extend their application to more language pairs, enhancing the accuracy and efficiency of cross-language communication.

    Abstract i Acknowledgments iii Table of Contents iv List of Tables vii List of Figures viii Table of Formulas viii Chapter 1. Introduction 1 1.1 Motivation 1 1.2 Issues regarding the modern MT system 2 1.3 Paper Contribution 3 Chapter 2. Related Works 4 2.1 Machine Translation 4 2.1.1 Rule-Based Machine Translation (RBMT) 4 2.1.2 Neural Machine Translation (NMT) 4 2.1.3 Hybrid Machine Translation (HMT) 5 2.2 Text Simplification in Machine Translation 5 2.2.1 Text Simplification Using Synonym Substitution 5 2.2.2 Text Simplification in Machine Translation Pipeline 6 2.3 OOV in Machine Translation 6 Chapter 3. System Architecture and Methodology 8 3.1 Overview 8 3.2 Sentence Segmentation and Tokenization 9 3.2.1 Text Normalization and Tokenization 9 3.2.2 Sentence Segmentation 9 3.3 Term Substitution Models 10 3.3.1 Preprocessing: Synonym Substitution Function 10 3.3.2 Preprocessing: Named Entity Substitution Function 11 3.4 Semantic-Based Sentence-Level Domain Detection 13 3.4.1 Sentence Domains and Domain Dictionary 13 3.4.2 Sentence Domain Detection Model 14 3.5 Neural Machine Translation Training 15 3.5.1 Byte Pair Encoding 15 3.5.2 Base Model 17 3.6 Post Process Syntax Error Correction Model 17 3.6.1 Question Sentence Patterns in Indonesian 18 3.6.2 Conjunction Sentence Patterns in Indonesian 19 3.6.3 Predicate Sentence Patterns in Indonesian 19 3.6.4 Modifiers and Miscellaneous Sentence Patterns 19 Chapter 4. Experiment 21 4.1 Data Collection 21 4.1.1 Data Cleaning 21 4.1.2 Named Entity Data Augmentation 22 4.2 Preprocess: Synonym Substitution 24 4.2.1 Evaluation Method 24 4.2.2 Result 25 4.2.3 Error Analysis 25 4.3 Preprocess: NER substitution 26 4.3.1 Evaluation Method 27 4.3.2 Result 28 4.3.3 Error Analysis 28 4.4 Semantic Based Sentence Domain Detection 29 4.4.1 Evaluation Method 29 4.4.2 Internal Test Result 30 4.4.3 External Test Result 30 4.4.4 Error Analysis 31 4.5 Transformer NMT Model 32 4.5.1 Evaluation Method 33 4.5.2 Result 34 4.5.3 Error Analysis 34 4.5.3.1 Grammatical Mistakes 35 4.5.3.2 Wrong Word Selection 37 4.5.3.3 Over-Translation 38 4.5.3.4 Under-Translation 39 4.6 Post-process Syntax Correction Model Experiment 39 4.6.1 Result 40 Chapter 5. Conclusion and Future Works 42 References 43

    [1] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. (2019). fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota. Association for Computational Linguistics.
    [2] Arvi Hurskainen and Jörg Tiedemann. (2017). Rule-based Machine translation from English to Finnish. In Proceedings of the Second Conference on Machine Translation, pages 323–329, Copenhagen, Denmark. Association for Computational Linguistics.
    [3] Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27.
    [4] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
    [5] Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
    [6] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
    [7] Hunsicker, S., Chen, Y., & Federmann, C. (2012, June). Machine learning for hybrid machine translation. In Proceedings of the seventh workshop on statistical machine translation (pp. 312-316).
    [8] Chandrasekar, R., Doran, C., & Bangalore, S. (1996). Motivations and methods for text simplification. In COLING 1996 Volume 2: The 16th International Conference on Computational Linguistics.
    [9] Siddharthan, A. (2002, December). An architecture for a text simplification system. In Language Engineering Conference, 2002. Proceedings (pp. 64-71). IEEE.
    [10] Štajner, S. (2016). New data-driven approaches to text simplification.
    [11] Štajner, S. (2021). Automatic text simplification for social good: Progress and challenges. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2637-2652.
    [12] Aminian, M., Ghoneim, M., & Diab, M. (2014, October). Handling OOV words in dialectal Arabic to English machine translation. In Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants (pp. 99-108).
    [13] Huck, M., Hangya, V., & Fraser, A. (2019, July). Better OOV translation with bilingual terminology mining. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 5809-5815).
    [14] Luong, M. T., Sutskever, I., Le, Q. V., Vinyals, O., & Zaremba, W. (2014). Addressing the rare word problem in neural machine translation. arXiv preprint arXiv:1410.8206.
    [15] He, H., & Choi, J. D. (2021). The stem cell hypothesis: Dilemma behind multi-task learning with transformer encoders. arXiv preprint arXiv:2109.06939.
    [16] Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
    [17] Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., ... & Auli, M. (2019). fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038.
    [18] CKIP Group. (2009). Lexical semantic representation and semantic composition-An introduction to E-HowNet. Technical Report), Institute of Information Science, Academia Sinica.
    [19] Koehn, P., & Hoang, H. (2010). Moses-Statistical Machine Translation System. User manual and code guide.
    [20] Shirinzadeh, S. A., & Mahadi, T. S. T. (2014). Translating Proper Nouns: A Case Study on English Translation of Hafez's Lyrics. English Language Teaching, 7(7), 8-16.
    [21] Moslem, Y., Haque, R., Kelleher, J. D., & Way, A. (2022). Domain-specific text generation for machine translation. arXiv preprint arXiv:2208.05909.
    [22] Pastor, M. L. C., & Mora, M. Á. C. (2013). Variation in the translation patterns of English complex noun phrases into Spanish in a specific domain. Languages in Contrast, 13(1), 28-45.
    [23] El-Kishky, A., Chaudhary, V., Guzmán, F., & Koehn, P. (2019). CCAligned: A massive collection of cross-lingual web-document pairs. arXiv preprint arXiv:1911.06154.
    [24] Xu, B. (2019). Nlp chinese corpus: Large scale chinese corpus for nlp. Zenodo.

    無法下載圖示 校內:2029-08-28公開
    校外:2029-08-28公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE