簡易檢索 / 詳目顯示

研究生: 宋筱萱
Sung, Hsiao-Hsuan
論文名稱: 利用 Googletrans 與維基百科爬蟲擴充萌典中的日語翻譯
Expanding Japanese Translations in Moedict using Googletrans and Wikipedia Crawling
指導教授: 賀保羅
Horton, Paul
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 49
中文關鍵詞: 網路爬蟲自動化翻譯文本語料庫文本處理
外文關鍵詞: Web crawling, Automated translation, Text corpus, Text processing
相關次數: 點閱:239下載:39
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 萌典是一個線上開源國語辭典,該辭典提供多種語言的字詞翻譯,協助想學習中文的外國人了解中文字詞的涵義。目前,萌典支援英語、德語和法語,但尚未包含日語翻譯。

    由於目前編纂多國語言詞典主要還是使用人工翻譯的方式,以確保詞典的正確性,但這樣的方式需要花費大量的人力與時間。因此本研究的目標是希望通過爬取維基百科資料,加上利用 Googletrans library 替萌典中的詞彙添加日文翻譯。並且我們將分析這兩種方法在詞語翻譯準確度的表現,以評估它們在為萌典擴充日文翻譯功能方面的潛力。

    本研究主要採用具有可信度的字典--Wiktionary作為評估自動翻譯和維基百科爬蟲翻譯準確度的標準。自動翻譯的部份我們主要會利用Googletrans 將萌典中的詞語進行中翻日,以及將萌典中有收錄英文翻譯的詞語擷取英文翻譯並翻譯成日文。接著我們會將這三種翻譯來源對比Wiktionary進行評估。使用的方法有以下三種:包含法(Sentence Inclusion Criteria)、完全相同法(Sentence Equivalence Assessment)、及相似度計算法(Sentence Similarity Calculation Method)。相似度計算法利用Sentence-BERT搭配 rinna/japanese-roberta-base 模型計算句子相似度。我們將分析相似度達到0.7、0.75、和0.8以上時,翻譯的準確度變化。此外,我們探究mean token strategy和max token strategy在組合單詞向量形成句子向量時的影響。

    可能是由於中日語言的差異,導致一些中文詞語在日文中並無相應的對應詞,使得Wiktionary能夠抓取到的日文翻譯比例僅為15%。由於我們的評估標準—Wiktionary收錄的詞語數量較少,所以整體上被視為翻譯正確的比例也相應降低。本研究分別針對詞,成語,俗語的翻譯準確度作分析,最高的比例是在詞的部分,但也不超過35%。這可能是因為機器翻譯的結果本身是正確的,但由於Wiktionary中並未收錄,因此造成了準確度的下降。為了評估未被收錄到萌典中的詞語翻譯準確度,我們也試著使用交叉合併評估(Cross-Merged Evaluation)產出可信賴的機器翻譯作為新的標準進行對比,來擴增未收錄在Wiktionary詞語中,其他的詞語翻譯。雖然擴增的標準的詞語數量並不多,但也增加了原本標準所產生的詞語數量的35%。

    總的來說,在本研究中,由於作為標準的Wiktionary收錄的詞語數量過少,使得機器翻譯的準確度可能無法被全面評估。然而,但對於建立基礎的多國語言詞典,這些發現仍然具有重要的意義。

    Moedict is an open-source online Mandarin dictionary that provides word translations in various languages, aiding foreigners interested in learning Chinese to understand the meanings of Chinese words and phrases. Currently, Moedict supports English, German, and French, but Japanese translations are not yet included.

    Traditionally, the compilation of multilingual dictionaries has been done using manual translation to ensure accuracy. However, this method requires a significant amount of manpower and time. Therefore, the goal of this study is to expand the Japanese translations in Moedict by using data scraped from Wikipedia and the Googletrans library. We also analyze the translation accuracy of these two methods to assess their potential for expanding Moedict's Japanese translation.

    In this study, we primarily use Wiktionary, a reliable online dictionary, as a standard to evaluate the accuracy of automatic translations and those from Wikipedia crawling. For automatic translation, we use Googletrans to translate Chinese to Japanese words from Moedict and extract English translations of words included in Moedict to translate them into Japanese. We then compare these three translation sources against Wiktionary for evaluation. The methods used include the Sentence Inclusion Criteria, Sentence Equivalence Assessment, and Sentence Similarity Calculation Method. The Sentence Similarity Calculation Method calculates sentence similarity using Sentence-BERT, specifically the rinna/japanese-roberta-base model. We analyze the changes in translation accuracy when the similarity reaches 0.7, 0.75, and 0.8. Additionally, we explore the effects of the mean token strategy and max token strategy when forming sentence vectors from word vectors.

    Due to the differences between the Chinese and Japanese languages, some Chinese words do not have corresponding words in Japanese, resulting in only a 15% proportion of Japanese translations that can be extracted by Wiktionary. As Wiktionary, our evaluation standard, extracts a smaller number of words, the overall proportion deemed correct in translation is correspondingly reduced. This study specifically analyzes the translation accuracy of terms (詞), idioms (成語), and phrases (俗語), with the highest proportion being in terms, but it does not exceed 35%. This could be because the machine translation results are correct, but they were not included in Wiktionary, resulting in a decrease in accuracy. To evaluate the translation accuracy of words not included in Moedict, we also use the Cross-Merged Evaluation method to generate reliable machine translations as new standards for comparison, expanding the translation of other words not included in Wiktionary. Although the number of words in the expanded standard is not large, it has increased the number of words generated by the original standard by 35%.

    In summary, due to the small number of words included in Wiktionary as a standard, the accuracy of machine translation may not be fully assessed. However, these findings still have significant implications for establishing foundational multilingual dictionaries.

    中文摘要 i Abstract iii 誌謝 v Contents vi List of Tables viii List of Figures x Nomenclature xii 1 Introduction 1 1.1 Background 1 1.2 Thesis Framework 2 2 Related Work 4 2.1 Format of Wikipedia and Wiktionary 4 2.2 Sentence-BERT 4 2.3 rinna/japanese-roberta-base model 6 3 Methods 7 3.1 Data Collection 7 3.1.1 Automated Translation 7 3.1.2 Web Crawling from Wikipedia 9 3.2 Method of Establishing Standards 10 3.2.1 Web Crawling from Wiktionary 11 3.3 Method of Analysis 13 3.3.1 Sentence Equivalence Assessment 14 3.3.2 Sentence Inclusion Criteria 14 3.3.3 Sentence Similarity Calculation Method 16 4 Results 18 4.1 Analysis Using Wiktionary as a Benchmark 19 4.2 Results of Merged Dictionary 24 4.3 Expand the Dictionary 27 5 Discussion 37 6 Conclusions and Future Work 45 6.1 Conclusions 45 6.2 Future Work 46 Bibliography 47

    [1] Audrey Tang et al. moedict-webkit. [Online; accessed 10-Mar-2023]. URL: https:
    //github.com/g0v/moedict-webkit
    [2] Audrey Tang et al. 萌典. [Online; accessed 24-June-2023]. 2023. URL: https://
    www.moedict.tw/%E8%90%8C.
    [3] Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. “Constructing a Chinese— Japanese Parallel Corpus from Wikipedia”. Proceedings of the Ninth International
    Conference on Language Resources and Evaluation (LREC’14). Reykjavik, Iceland:
    European Language Resources Association (ELRA), May 2014, pp. 642–647. URL: http://www.lrec-conf.org/proceedings/lrec2014/pdf/21_Paper.pdf.
    [4] Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 4171–4186. DOI: 10.18653/v1/N19-1423. URL: https://aclanthology.org/N19-1423.
    [5] Pablo Gamallo and Isaac González López. “Wikipedia as Multilingual Source of Comparable Corpora”. 2011.
    [6] SuHun Han. googletrans 3.0.0. URL: https://pypi.org/project/googletrans/.
    [7] Felix Hill et al. Embedding Word Similarity with Neural Machine Translation. 2015. DOI: 10.48550/arXiv.1412.6448.
    [8] Elad Hoffer and Nir Ailon. Deep metric learning using Triplet network. 2014. DOI:10.48550/arXiv.1412.6622.
    [9] Tiara Ridha Imami, Fatchul Mu’in, and Nasrullah. “Linguistic and Cultural Problems in Translation”. Proceedings of the 2nd International Conference on Education, Language, Literature, and Arts (ICELLA 2021). Vol. 587. Advances in Social Science, Education and Humanities Research. 2021.
    [10] Yinhan Liu et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. 2019. DOI: 10.48550/arXiv.1907.11692
    [11] Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (2019). DOI: 10.48550/arXiv.1908.10084
    [12] Mukta Sathisha. “Linguistic and Cultural Challenges Faced by Translators”. International Journal of Psychosocial Rehabilitation 24.02 (2020). ISSN: 1475-7192.
    [13] Ahmad R. Shahid and Dimitar Kazakov. “AUTOMATIC MULTILINGUAL LEXICON GENERATION USING WIKIPEDIA AS A RESOURCE” (2009), pp. 357–360.
    DOI: 10.5220/0001783003570360.
    [14] Pamela Shapiro and Kevin Duh. “Morphological Word Embeddings for Arabic Neural Machine Translation in Low-Resource Settings”. Proceedings of the Second Workshop on Subword/Character LEvel Models. New Orleans: Association for Computational Linguistics, June 2018, pp. 1–11. DOI: 10.18653/v1/W18- 1201. URL: https://aclanthology.org/W18-1201.
    [15] Vít Suchomel and Jan Pomikálek. “Efficient Web Crawling for Large Text Corpora”. Proceedings of the seventh Web as Corpus Workshop (WAC7). 2012, pp. 39–43. URL: http://sigwac.org.uk/raw-attachment/wiki/WAC7/wac7-proc.pdf.
    [16] Bin Wang et al. “Evaluating word embedding models: methods and experimental results”. APSIPA Transactions on Signal and Information Processing 8 (2019), e19. DOI: 10.1017/ATSIP.2019.12.
    [17] Wikipedia contributors. Wikipedia. [Online; accessed 24-June-2023]. 2023. URL:https://zh.wikipedia.org/wiki/%E7%B6%AD%E5%9F%BA%E7%99%BE%E7%A7%91
    [18] Wiktionary contributors. Wiktionary. [Online; accessed 24-June-2023]. 2023. URL: https://www.wiktionary.org/.
    [19] 台灣教育部. 教育部重編國語辭典修訂本. [Online; accessed 29-June-2023]. URL:
    http://dict.revised.moe.edu.tw/.
    [20] 趙 天雨 and 沢田 慶. “日本語自然言語処理における事前学習モデルの公
    開”. 人工知能学会研究会資料 言語・音声理解と対話処理研究会 93 (2021), pp. 169–170. DOI: 10.11517/jsaislud.93.0_169.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE