簡易檢索 / 詳目顯示

研究生: 林浚弘
Lin, Jiun-Hung
論文名稱: 利用搜尋結果為基礎之多階段未知術語翻譯擷取方法
A Multi-Stage Translation Extraction Method for Unknown Terms Using Web Search Results
指導教授: 盧文祥
Lu, Wen-Shiang
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2006
畢業學年度: 94
語文別: 中文
論文頁數: 76
中文關鍵詞: 跨語資訊檢索術語翻譯未知術語
外文關鍵詞: cross-language information retrieval, unknown term, term translation
相關次數: 點閱:104下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在過去研究未知術語翻譯的研究中,部分學者已經提出利用網路探勘技術經由挖掘全球資訊網內蘊藏豐富多語資源的方法來解決未知術語翻譯的問題。然而上述方法在擷取低頻未知術語的翻譯時,通常會遭受到資料稀疏 (Data Sparseness) 與間接關聯錯誤 (Indirect Association Errors) 的問題,因此造成對於低頻未知術語翻譯效能的低落。因此本篇論文中提出一個多階段未知術語翻譯擷取方法,利用未知術語具有的自然語言特性,採用分類式與多階段的方法更細膩地解決對於低頻未知術語的翻譯問題。下面將簡要的說明本研究一些額外的研究成果:
    本論文提出一個改進的術語翻譯擷取模型 (Improved Web-based Term Translation Extraction Model) 可以成功地改進 Cheng et al. (2004) 提出的術語翻譯擷取方法;此改進的術語翻譯擷取模型在本論文的實驗中,對於未知術語的英翻中 (E-C) 部分約可以提升15% (36%~51%) 的Top-1翻譯涵蓋率 (Translation Inclusion Rate);中翻英 (C-E) 部分約可以提升14% (28%~42%) 的Top-1翻譯涵蓋率。
    我們首先提出將未知術語依其自然語言特性自動分類,並且根據類別採用多階段翻譯擷取方法;例如:對於未知的音譯詞與縮寫詞翻譯問題,本論文分別提出一個混合式雙階段音譯擷取方法 (Two-Stage Hybrid Transliteration Extraction Method) 及以網路為基礎之縮寫詞翻譯擷取方法 (Search-result-based Abbreviation Translation Extraction Method) 來解決之。
    為了更進一步地解決擷取低頻未知術語翻譯時,經常會遭遇到的資料稀疏與間接關聯錯誤的問題。本論文提出一個改進的第二回合搜尋結果擷取方法,利用擷取出含有更明確資訊以及更多正確翻譯配對 (Correct Translation Pair) 的第二回合搜尋結果,來改進對於低頻未知術語的翻譯問題。實驗顯示此方法可以相當有效地提升未知術語的翻譯效能。

    Recently, a few researchers have proposed several effective search-result-based term translation extraction methods to mine translations of unknown terms in queries from Web search results. However, these methods are often suffered the problems of data sparseness and indirect assocication errors while extracting translations of infrequent unknown terms. Thereforce, in this paper we present a multi-stage translation extraction method to mitigate the problems of extracting translations of infrequent unknown terms. Some valueable results in this paper are presented as follows:
    In this paper, we propose an improved Web-based term translation extraction model which can effectively improve the translation performance of previous Web-based term translation extraction methos proposed by Cheng et al. (2004). Compared with above method proposed by Cheng et al., our experimental results show that the improved Web-based term translation extraction method can effectively upturn about 15% (36%~51%) top-1 translation inclusion rate for English to Chinese (E-C) translation of unknown terms and upturn about 14% (28%~42%) top-1 translation inclusion rate for Chinese to English (C-E) translation of unknown terms.
    We firstly propose a multi-stage translation extraction method to solve the translation problem of unknown terms. Unknown terms are classified according to their linguistic features, and we use a multi-stage translation extraction method to extract translations of unknown terms belonging to different types. For example, we present a two-stage hybrid transliteration extraction method and search-result-based abbreviation translation extraction method to solve translation problems of transliterated terms and abbreviated terms.
    To further solve the problems of data sparseness and indirect assocication errors in extracting translation of infrequent unknown terms, we present an improved extraction method to utilize second-round search results which may contain more clear information and more correct translation pairs, and can be used to improve the translation peformance of infrequent unknown terms. Our experimental results show that this method can effectively improve the translation performance of unknown terms.

    摘要..............................................................III Abstract..........................................................V 誌 謝.............................................................VII 目錄..............................................................VIII 圖目錄............................................................X 表目錄............................................................XI 第一章 導論......................................................1 1.1 研究動機......................................................1 1.2 研究議題......................................................2 1.2.1 跨語資訊檢索所遭遇的未知術語翻譯問題........................2 1.2.2 資料稀疏與間接關聯錯誤問題..................................3 1.2.3 音譯詞翻譯問題..............................................5 1.3 研究方法簡介..................................................5 1.3.1 多階段未知術語翻譯擷取方法之處理流程........................6 1.3.2 改進的術語翻譯擷取模型......................................8 1.3.3 分類別的翻譯擷取方法........................................8 1.3.4 第二回合搜尋結果擷取方法....................................10 1.4 論文架構......................................................10 第二章 文獻探討與相關工作........................................12 2.1 利用網路探勘技術擷取未知詞翻譯之相關研究......................12 2.2 縮寫詞擴展之相關研究..........................................12 2.3 音譯詞翻譯之相關研究.........................................13 第三章 研究方法...................................................15 3.1 多階段未知術語翻譯處理流程....................................15 3.2 未知術語分類器................................................18 3.3 改進的術語翻譯擷取模型........................................20 3.4 音譯詞翻譯方法................................................22 3.4.1 混合式雙階段音譯擷取方法....................................23 3.4.2 混合音節對應音譯模型........................................23 3.5 縮寫詞翻譯方法................................................28 3.5.1 英文縮寫詞擴展方法..........................................28 3.5.2 中文縮寫詞擴展方法..........................................31 3.6 混合詞翻譯方法................................................33 3.7 雜類詞翻譯方法................................................34 3.8 擷取第二回合搜尋結果..........................................34 3.8.1 Huang et al.提出之第二回合搜尋結果擷取方法..................35 3.8.2 本論文提出之第二回合搜尋結果擷取方法........................36 第四章 實驗結果與分析.............................................45 4.1 音譯詞翻譯效能評估............................................45 4.2 縮寫詞翻譯效能評估............................................51 4.3 混合詞翻譯效能評估............................................59 4.4 雜類詞翻譯效能評估............................................64 4.5 整體分類與翻譯效能評估........................................69 第五章 結論與未來研究方向.........................................72 5.1 結論..........................................................72 5.2 未來研究方向..................................................73 參考文獻..........................................................74

    P. F. Brown, , S. A. D. Pietra, V. D. J. Pietra and R. L. Mercer. 1993. The
    Mathematics of Machine Translation. Computational Linguistics, 19(2): 263- 312.
    L. A. Ballesteros and W. B. Croft. 1998. Resolving Ambiguity for Cross-
    Language Retrieval, Proceedings of the 21st Annual International ACM
    SIGIR Conference, 64-71.
    Y. B. Cao and H. Li. 2002. Base noun phrase translation using Web data and
    the EM algorithm. In Proc. of COLING 2002: 127-133.
    J. S. Chang, and Y. T. Lai, 2004. “A Preliminary Study on Probabilistic
    Models for Chinese Abbreviations.” Proceedings of the Third SIGHAN
    Workshop, ACL 2004.
    J. S. Chang, and W. L. Teng. 2006. “Mining Atomic Chinese Abbreviation Pairs:
    A Probabilistic Model for Single Character Word Recovery.” Proceedings of
    the Fifth SIGHAN Workshop, COLING-ACL 2006.
    P. J. Cheng, J. W. Teng, R. C. Chen, J.H. Wang, W.H. Lu, L.F. Chien. 2004.
    Translating unknown queries with web corpora for cross-language information
    retrieval. In Proc. of SIGIR 2004:146-153.
    L.F. Chien. 1997. PAT-tree-based keyword extraction for Chinese information
    retrieval. In Proceedings of the ACM SIGIR’97 Conference (Philadelphia,
    PA), 50-58.
    R. Cooley, B. Mobasher, J. Srivastava. 1997 Web Mining: Information and
    Pattern Discovery on the World Wide Web, Proceedings of the 9th IEEE
    International Conference on Tools with Artificial Intelligence (ICTAI'97)
    M. W. Davis and W. C. Ogden. 1998. Free Resources and Advanced Alignment for
    Cross-Language Text Retrieval. In Proc. of the Sixth Text Retrieval
    Conference (TREC6): 385-394.
    P. Fung and L. Y. Yee. 1998. An IR approach for translating new words from
    nonparallel, comparable texts. In Proc. of ACL 1998: 414-420.
    W. Gao, K. F. Wong and W. Lam. 2004. Phoneme-based Transliteration of Foreign
    Name for OOV Problem. In Proc. of IJCNLP 2004: 274-381.
    F. Huang, Y. Zhang and S. Vogel. 2005. Mining Key Phrase Translations from Web
    Corpora. In Proc. of HLT-EMNLP 2005.
    A. Kilgarriff and G. Grefenstette. 2003. Introduction to the special issue on
    the web as corpus.Computational Linguistics 29(3): 333-348.
    K. Knight and J. Graehl. 1998. Machine Transliteration, Computational
    Linguistics 24(4): 599-612.
    R. Kosala, and H. Blockeel. 2000 Web Mining Research: A Survey, ACM SIGKDD
    Explorations, 2(1), 1-15.
    W. Lam, R. Huang, P. S. Cheung. 2004. Learning phonetic similarity for
    matching named entity translations and mining new translations. In Proc. of
    SIGIR 2004: 281-288.
    V. Lavrenko, M. Choquette, W. B. Croft, 2002. Cross-Lingual Relevance
    Models,Proceedings of the 25th Annual International ACM SIGIR Conference,
    175-182.
    L. Leah, P. Ogilvie, A. Price, and B. Tamilio. 2000. Acrophile: An Automated
    Acronym Extractor and Server. In Proc. of the ACM Digital Libraries
    Conference: 205-214.
    H. Li, M. Zhang and J. Su. 2004. A Joint Source-Channel Model for Machine
    Transliteration. In Proc. of ACL 2004: 160-167.
    J. H. Lin, M. S. Shia, K. H. Lin, S. J. Lin, S. Yu, W. H. Lu. (2005). Search-
    Result-Based Method for Unknown Term Translation in Cross-Language
    Information Retrieval. In Proceedings of the NTCIR5 Workshop.
    T. Lin, C. C.Wu, J. S. Chang. 2003.Word-Transliteration Alignment, In Proc. of
    ROCLING XV, 1-16.
    W. H. Lin and H. H. Chen. 2002. Backward machine transliteration by learning
    phonetic similarity. In Proc. of CONLL 2002: 139-145.
    W. H. Lu, L. F., Chien, H. J. Lee. 2002. Translation of Web Queries using
    Anchor Text Mining, ACM Transactions on Asian Language Information
    Processing, 1(2), 159-172.
    W. Y. Ma and K. J. Chen. 2003. A Bottom-up Merging Algorithm for Chinese
    Unknown Word Extraction, In Proc. of ACL workshop on Chinese Language
    Processing 2003: 31-38.
    I. D. Melamed. 2000. Models of translational equivalence among words.
    Computational Linguistics, 26(2):221-249.
    H. Meng, W. K. Lo, B. Chen and K. Tang. 2001. Generate Phonetic Cognates to
    Handle Name Entities in English-Chinese Cross-Language Spoken Document
    Retrieval, In Proc. of ASRU 2001.
    J. Y. Nie, P. Isabelle, M. Simard, and R. Durand. 1999. Cross-language
    Information Retrieval Based on Parallel Texts and Automatic Mining of
    Parallel Texts from the Web, In Proc. of ACM-SIGIR’99,74-81.
    Y. Park and R. J. Byrd. 2001. Hybrid text mining for finding abbreviations and
    their definitions. In Proc. of EMNLP2001.
    R. Rapp. 1999. Automatic identification of word translations from unrelated
    English and German corpora, In Proc. of ACL 1999: 519-526.
    P. Resnik. 1999. Mining the Web for Bilingual Text, In Proceedings of the 37th
    Annual Meeting of the Association for Computational Linguistics.
    M. S. Shia, J. H. Lin, S. Yu, W. H. Lu. 2005. A Web-based Unsupervised
    Algorithm for Learning Transliteration Model to Improve Translation of Low-
    Frequency Proper Names. In Proc. of IEEE Natural Language Processing and
    Knowledge Engineering.
    F. Smadja, K.McKeown, and V. Hatzivassiloglou. 1996. Translating collocations
    for bilingual lexicons:a statistical approach. Computational Linguistics, 22
    (1):1-38.
    K. Taghva and J. Gilbreth. 1995. Recognizing Acronyms and their Definitions.
    Technical Report 95-03, ISRI (Information Science Research Institute),
    UNLV, June, 1995.
    S. Wan and C. M. Verspoor. 1998. Automatic English-Chinese name
    transliteration for development of multilingual resources. In Proc. of ACL
    1998: 1352-1357.
    J. Xiao, J. Liu and T. S. Chua. 2002. Extracting pronunciation-translated
    names from Chinese texts using bootstrapping approach, the 1st SIGHAN
    workshop on Chinese Language Processing, Taipei, Taiwan, Aug 2002.
    C. C. Yang and K. W. Li. 2003. Automatic Construction of English/Chinese
    Parallel Corpora, Journal of the American society for Information Science
    and Technology, 54(8), 730-742.
    Y. Zhang and P. Vines. 2004. Using the web for automated translation
    extraction in cross-language information retrieval. In Proc. of SIGIR 2004.

    下載圖示 校內:立即公開
    校外:2006-07-19公開
    QR CODE