簡易檢索 / 詳目顯示

研究生: 李佳臻
Li, Jia-Zhen
論文名稱: 利用韓文漢字與注音符號增強繁體中文與韓文之間的神經機器翻譯
Enhancing Neural Machine Translation between Traditional Chinese and Korean Utilizing Korean Hanja and Bopomofo
指導教授: 賀保羅
Horton, Paul
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 64
中文關鍵詞: 神經機器翻譯
外文關鍵詞: Neural Machine Translation
相關次數: 點閱:53下載:12
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 神經機器翻譯(NMT)極大地促進了跨語言交流,但在具有不同文字和語言結構的語言之間進行翻譯的複雜性仍然具有挑戰性。雖然神經機器翻譯(NMT)在英語和其他西方語言之間的翻譯方面取得了重大進展,但在繁體中文和韓語之間的翻譯方面面臨著相當大的挑戰。這些挑戰源自於三個主要問題:結構差異、形態複雜性以及繁體中文和韓語之間缺乏平行資料。
    為了解決缺乏平行資料的問題,本文使用爬蟲抓取TED Talks的翻譯文本,並透過Sentence-BERT模型來估計句子之間的相似度,以建立繁體中文與韓文的平行語料庫。
    此外,本文提出了一種透過使用韓語漢字字符和注音符號來增強繁體中文和韓語之間的NMT的新方法。漢字作為韓語中使用的漢字,和注音符號可以提供更豐富的語義上下文並提高翻譯準確性。我們主要使用中文BERT模型來取得中文句子、漢字句子以及注音符號句子的詞嵌入,為了確保漢字句子及注音符號句子的詞嵌入有意義,我們加入韓文及注音符號的分詞。
    我們使用雙語評估替補分數(BLEU)評分我們的翻譯結果,透過實證實驗評估其效果,並討論其對提高翻譯品質的影響。我們的研究結果表明,漢字和注音符號的結合有效地提高了繁體中文和韓語之間神經機器翻譯系統的準確性和流暢性。

    Neural Machine Translation (NMT) has significantly facilitated cross-linguistic communication, but translating between languages with different scripts and structures remains challenging. While NMT has made substantial progress in translating between English and other Western languages, it faces considerable challenges when translating between Traditional Chinese and Korean. These challenges arise from three main issues: structural differences, morphological complexity, and the lack of parallel data between Traditional Chinese and Korean.
    To address the lack of parallel data, this paper utilizes web crawlers to collect translated texts from TED Talks and employs the Sentence-BERT model to estimate sentence similarity, thereby creating a parallel corpus for Traditional Chinese and Korean.
    Moreover, this paper proposes a novel approach to enhance NMT between Traditional Chinese and Korean by using Korean Hanja characters and Bopomofo. Hanja, as Chinese characters used in Korean, and Bopomofo can provide richer semantic context and improve translation accuracy. We primarily use the Chinese BERT model to obtain word embeddings for Chinese sentences, Hanja sentences, and Bopomofo sentences. To ensure meaningful word embeddings for Hanja and Bopomofo sentences, we incorporate tokenization for Korean and Bopomofo.
    We evaluate our translation results using the Bilingual Evaluation Understudy (BLEU) score and assess their effectiveness through empirical experiments, discussing their impact on translation quality. Our research findings indicate that the combination of Hanja and Bopomofo effectively enhances the accuracy and fluency of NMT systems between Traditional Chinese and Korean.

    中文摘要 i Abstract iii 誌謝 v Contents vi List of Tables ix List of Figures x Nomenclature xii 1 Introduction 1 1.1 Background 2 1.1.1 Neural Machine Translation 2 1.1.2 Traditional Chinese-Korean Neural Machine Translation 4 1.1.3 Korean Hanja History 6 2 Related Work 7 2.1 Naver Dictionary 7 2.2 Sentence-BERT 7 2.3 BERT Model 8 2.3.1 Ckiplab/bert-base-chinese model 10 2.3.2 Kim/bert-kor-base model 11 3 Methods 12 3.1 Data Collection 12 3.1.1 Google Translation 12 3.1.2 Web Crawling from TED Talks 13 3.1.3 Align Traditional Chinese - Korean sentence pairs with Sentence-BERT 14 3.2 Korean Hanja Data Collection 15 3.2.1 KoNLPy 17 3.2.2 Web Crawling from Naver Dictionary 18 3.2.3 Transform Korean Sentence to Hanja Sentence with Sentence-BERT 19 3.3 Bopomofo Data Extraction 21 3.3.1 Dragon Mapper 21 3.4 Tokenization 22 3.4.1 HuggingFace Tokenizers 23 3.4.2 Add Korean and Bopomofo Tokens 23 3.4.3 Comparison of Tokenizers 24 3.5 Embedding 24 3.5.1 BERT 24 3.5.2 Joint Embedding 27 3.6 Model 27 3.6.1 Framework 27 3.6.2 Cross Entropy Loss 28 4 Experiment 30 4.1 Dataset 30 4.2 BLEU Score 31 4.3 Result 31 5 Discussion 35 5.1 Case Study 35 5.2 Embedding Analysis 38 5.2.1 Principal Component Analysis 38 5.2.2 t-Distributed Stochastic Neighbor Embedding 40 6 Conclusions and Future Work 46 6.1 Conclusions 46 6.2 Future Work 47 Bibliography 48

    [1] Jacob Devlin et al. “Bert: Pre-training of deep bidirectional transformers for language understanding”. arXiv preprint arXiv:1810.04805 (2018).
    [2] Zhenliang Guo et al. “CNA: A Dataset for Parsing Discourse Structure on Chinese News Articles”. 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI). 2022, pp. 990–995. DOI: 10.1109/ICTAI56018.2022.00151.
    [3] Lee Hae-jin. Naver Dictionary. 1999. URL: https://dict.naver.com/.
    [4] SuHun Han. Googletrans3.0.0. 2020. URL: https://dict.naver.com/.
    [5] Douwe Kiela, Changhan Wang, and Kyunghyun Cho. “Dynamic Meta-Embeddings for Improved Sentence Representations”. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Ed. by Ellen Riloff et al. Brussels, Belgium: Association for Computational Linguistics, Oct. 2018, pp. 1466–1477. DOI: 10.18653/v1/D18-1176. URL: https://aclanthology.org/D18-1176.
    [6] Kiyoung Kim. Pretrained Language Models For Korean. https://github.com/ kiyoungkim1/LMkor. 2020.
    [7] Zhenzhong Lan et al. “Albert: A lite bert for self-supervised learning of language representations”. arXiv preprint arXiv:1909.11942 (2019).
    [8] Mike Lewis et al. “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension”. arXiv preprint arXiv:1910.13461 (2019).
    [9] Yinhan Liu et al. “Roberta: A robustly optimized bert pretraining approach”. arXiv preprint arXiv:1907.11692 (2019).
    [10] Laurens van der Maaten and Geoffrey Hinton. “Visualizing Data using t-SNE”. Journal of Machine Learning Research 9.86 (2008), pp. 2579–2605. URL: http://jmlr.org/papers/v9/vandermaaten08a.html.
    [11] Andrzej Maćkiewicz and Waldemar Ratajczak. “Principal components analysis (PCA)”. Computers Geosciences 19.3 (1993), pp. 303–342. ISSN: 0098-3004. DOI: https://doi.org/10.1016/0098-3004(93)90090-R. URL: https://www.sciencedirect. com/science/article/pii/009830049390090R.
    [12] Kishore Papineni et al. “Bleu: a Method for Automatic Evaluation of Machine Translation”. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Ed. by Pierre Isabelle, Eugene Charniak, and Dekang Lin. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, July 2002, pp. 311–318. DOI: 10.3115/1073083.1073135. URL: https://aclanthology.org/P02-1040.
    [13] Eunjeong L. Park and Sungzoon Cho. “KoNLPy: Korean natural language processing in Python”. Proceedings of the 26th Annual Conference on Human Cognitive Language Technology. Chuncheon, Korea, Oct. 2014.
    [14] Alec Radford et al. “Language models are unsupervised multitask learners”. OpenAI blog 1.8 (2019), p. 9.
    [15] Colin Raffel et al. “Exploring the limits of transfer learning with a unified text-to-text transformer”. Journal of machine learning research 21.140 (2020), pp. 1–67.
    [16] Nils Reimers and Iryna Gurevych. “Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation”. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Nov. 2020. URL: https://arxiv.org/abs/2004.09813.
    [17] Nils Reimers and Iryna Gurevych. “Sentence-bert: Sentence embeddings using Siamese bert-networks”. arXiv preprint arXiv:1908.10084 (2019).
    [18] Harry Marks Richard Saul Wurman. TED Talks. 1984. URL: https://www.ted.com/.
    [19] Thomas Roten. Dragon Mapper. 2014. URL: https://github.com/tsroten/dragonmapper.
    [20] Zijun Sun et al. “ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information”. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Ed. by Chengqing Zong et al. Online: Association for Computational Linguistics, Aug. 2021, pp. 2065–2075. DOI: 10.18653/v1/2021.acl-long.161. URL: https://aclanthology.org/2021.acl-long.161.
    [21] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to sequence learning with neural networks”. Advances in neural information processing systems 27 (2014).
    [22] Ashish Vaswani et al. “Attention is all you need”. Advances in neural information processing systems 30 (2017).
    [23] Mu Yang. ckip/bert-base-chinese. 2020. URL: https://huggingface.co/ckiplab/bert-base-chinese.
    [24] Zhilu Zhang and Mert Sabuncu. “Generalized cross entropy loss for training deep neural networks with noisy labels”. Advances in neural information processing systems 31 (2018).
    [25] Jinhua Zhu et al. “Incorporating bert into neural machine translation”. arXiv preprint arXiv:2002.06823 (2020).

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE