研究生: |
林浚弘 Lin, Jiun-Hung |
---|---|
論文名稱: |
利用搜尋結果為基礎之多階段未知術語翻譯擷取方法 A Multi-Stage Translation Extraction Method for Unknown Terms Using Web Search Results |
指導教授: |
盧文祥
Lu, Wen-Shiang |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2006 |
畢業學年度: | 94 |
語文別: | 中文 |
論文頁數: | 76 |
中文關鍵詞: | 跨語資訊檢索 、術語翻譯 、未知術語 |
外文關鍵詞: | cross-language information retrieval, unknown term, term translation |
相關次數: | 點閱:104 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在過去研究未知術語翻譯的研究中,部分學者已經提出利用網路探勘技術經由挖掘全球資訊網內蘊藏豐富多語資源的方法來解決未知術語翻譯的問題。然而上述方法在擷取低頻未知術語的翻譯時,通常會遭受到資料稀疏 (Data Sparseness) 與間接關聯錯誤 (Indirect Association Errors) 的問題,因此造成對於低頻未知術語翻譯效能的低落。因此本篇論文中提出一個多階段未知術語翻譯擷取方法,利用未知術語具有的自然語言特性,採用分類式與多階段的方法更細膩地解決對於低頻未知術語的翻譯問題。下面將簡要的說明本研究一些額外的研究成果:
本論文提出一個改進的術語翻譯擷取模型 (Improved Web-based Term Translation Extraction Model) 可以成功地改進 Cheng et al. (2004) 提出的術語翻譯擷取方法;此改進的術語翻譯擷取模型在本論文的實驗中,對於未知術語的英翻中 (E-C) 部分約可以提升15% (36%~51%) 的Top-1翻譯涵蓋率 (Translation Inclusion Rate);中翻英 (C-E) 部分約可以提升14% (28%~42%) 的Top-1翻譯涵蓋率。
我們首先提出將未知術語依其自然語言特性自動分類,並且根據類別採用多階段翻譯擷取方法;例如:對於未知的音譯詞與縮寫詞翻譯問題,本論文分別提出一個混合式雙階段音譯擷取方法 (Two-Stage Hybrid Transliteration Extraction Method) 及以網路為基礎之縮寫詞翻譯擷取方法 (Search-result-based Abbreviation Translation Extraction Method) 來解決之。
為了更進一步地解決擷取低頻未知術語翻譯時,經常會遭遇到的資料稀疏與間接關聯錯誤的問題。本論文提出一個改進的第二回合搜尋結果擷取方法,利用擷取出含有更明確資訊以及更多正確翻譯配對 (Correct Translation Pair) 的第二回合搜尋結果,來改進對於低頻未知術語的翻譯問題。實驗顯示此方法可以相當有效地提升未知術語的翻譯效能。
Recently, a few researchers have proposed several effective search-result-based term translation extraction methods to mine translations of unknown terms in queries from Web search results. However, these methods are often suffered the problems of data sparseness and indirect assocication errors while extracting translations of infrequent unknown terms. Thereforce, in this paper we present a multi-stage translation extraction method to mitigate the problems of extracting translations of infrequent unknown terms. Some valueable results in this paper are presented as follows:
In this paper, we propose an improved Web-based term translation extraction model which can effectively improve the translation performance of previous Web-based term translation extraction methos proposed by Cheng et al. (2004). Compared with above method proposed by Cheng et al., our experimental results show that the improved Web-based term translation extraction method can effectively upturn about 15% (36%~51%) top-1 translation inclusion rate for English to Chinese (E-C) translation of unknown terms and upturn about 14% (28%~42%) top-1 translation inclusion rate for Chinese to English (C-E) translation of unknown terms.
We firstly propose a multi-stage translation extraction method to solve the translation problem of unknown terms. Unknown terms are classified according to their linguistic features, and we use a multi-stage translation extraction method to extract translations of unknown terms belonging to different types. For example, we present a two-stage hybrid transliteration extraction method and search-result-based abbreviation translation extraction method to solve translation problems of transliterated terms and abbreviated terms.
To further solve the problems of data sparseness and indirect assocication errors in extracting translation of infrequent unknown terms, we present an improved extraction method to utilize second-round search results which may contain more clear information and more correct translation pairs, and can be used to improve the translation peformance of infrequent unknown terms. Our experimental results show that this method can effectively improve the translation performance of unknown terms.
P. F. Brown, , S. A. D. Pietra, V. D. J. Pietra and R. L. Mercer. 1993. The
Mathematics of Machine Translation. Computational Linguistics, 19(2): 263- 312.
L. A. Ballesteros and W. B. Croft. 1998. Resolving Ambiguity for Cross-
Language Retrieval, Proceedings of the 21st Annual International ACM
SIGIR Conference, 64-71.
Y. B. Cao and H. Li. 2002. Base noun phrase translation using Web data and
the EM algorithm. In Proc. of COLING 2002: 127-133.
J. S. Chang, and Y. T. Lai, 2004. “A Preliminary Study on Probabilistic
Models for Chinese Abbreviations.” Proceedings of the Third SIGHAN
Workshop, ACL 2004.
J. S. Chang, and W. L. Teng. 2006. “Mining Atomic Chinese Abbreviation Pairs:
A Probabilistic Model for Single Character Word Recovery.” Proceedings of
the Fifth SIGHAN Workshop, COLING-ACL 2006.
P. J. Cheng, J. W. Teng, R. C. Chen, J.H. Wang, W.H. Lu, L.F. Chien. 2004.
Translating unknown queries with web corpora for cross-language information
retrieval. In Proc. of SIGIR 2004:146-153.
L.F. Chien. 1997. PAT-tree-based keyword extraction for Chinese information
retrieval. In Proceedings of the ACM SIGIR’97 Conference (Philadelphia,
PA), 50-58.
R. Cooley, B. Mobasher, J. Srivastava. 1997 Web Mining: Information and
Pattern Discovery on the World Wide Web, Proceedings of the 9th IEEE
International Conference on Tools with Artificial Intelligence (ICTAI'97)
M. W. Davis and W. C. Ogden. 1998. Free Resources and Advanced Alignment for
Cross-Language Text Retrieval. In Proc. of the Sixth Text Retrieval
Conference (TREC6): 385-394.
P. Fung and L. Y. Yee. 1998. An IR approach for translating new words from
nonparallel, comparable texts. In Proc. of ACL 1998: 414-420.
W. Gao, K. F. Wong and W. Lam. 2004. Phoneme-based Transliteration of Foreign
Name for OOV Problem. In Proc. of IJCNLP 2004: 274-381.
F. Huang, Y. Zhang and S. Vogel. 2005. Mining Key Phrase Translations from Web
Corpora. In Proc. of HLT-EMNLP 2005.
A. Kilgarriff and G. Grefenstette. 2003. Introduction to the special issue on
the web as corpus.Computational Linguistics 29(3): 333-348.
K. Knight and J. Graehl. 1998. Machine Transliteration, Computational
Linguistics 24(4): 599-612.
R. Kosala, and H. Blockeel. 2000 Web Mining Research: A Survey, ACM SIGKDD
Explorations, 2(1), 1-15.
W. Lam, R. Huang, P. S. Cheung. 2004. Learning phonetic similarity for
matching named entity translations and mining new translations. In Proc. of
SIGIR 2004: 281-288.
V. Lavrenko, M. Choquette, W. B. Croft, 2002. Cross-Lingual Relevance
Models,Proceedings of the 25th Annual International ACM SIGIR Conference,
175-182.
L. Leah, P. Ogilvie, A. Price, and B. Tamilio. 2000. Acrophile: An Automated
Acronym Extractor and Server. In Proc. of the ACM Digital Libraries
Conference: 205-214.
H. Li, M. Zhang and J. Su. 2004. A Joint Source-Channel Model for Machine
Transliteration. In Proc. of ACL 2004: 160-167.
J. H. Lin, M. S. Shia, K. H. Lin, S. J. Lin, S. Yu, W. H. Lu. (2005). Search-
Result-Based Method for Unknown Term Translation in Cross-Language
Information Retrieval. In Proceedings of the NTCIR5 Workshop.
T. Lin, C. C.Wu, J. S. Chang. 2003.Word-Transliteration Alignment, In Proc. of
ROCLING XV, 1-16.
W. H. Lin and H. H. Chen. 2002. Backward machine transliteration by learning
phonetic similarity. In Proc. of CONLL 2002: 139-145.
W. H. Lu, L. F., Chien, H. J. Lee. 2002. Translation of Web Queries using
Anchor Text Mining, ACM Transactions on Asian Language Information
Processing, 1(2), 159-172.
W. Y. Ma and K. J. Chen. 2003. A Bottom-up Merging Algorithm for Chinese
Unknown Word Extraction, In Proc. of ACL workshop on Chinese Language
Processing 2003: 31-38.
I. D. Melamed. 2000. Models of translational equivalence among words.
Computational Linguistics, 26(2):221-249.
H. Meng, W. K. Lo, B. Chen and K. Tang. 2001. Generate Phonetic Cognates to
Handle Name Entities in English-Chinese Cross-Language Spoken Document
Retrieval, In Proc. of ASRU 2001.
J. Y. Nie, P. Isabelle, M. Simard, and R. Durand. 1999. Cross-language
Information Retrieval Based on Parallel Texts and Automatic Mining of
Parallel Texts from the Web, In Proc. of ACM-SIGIR’99,74-81.
Y. Park and R. J. Byrd. 2001. Hybrid text mining for finding abbreviations and
their definitions. In Proc. of EMNLP2001.
R. Rapp. 1999. Automatic identification of word translations from unrelated
English and German corpora, In Proc. of ACL 1999: 519-526.
P. Resnik. 1999. Mining the Web for Bilingual Text, In Proceedings of the 37th
Annual Meeting of the Association for Computational Linguistics.
M. S. Shia, J. H. Lin, S. Yu, W. H. Lu. 2005. A Web-based Unsupervised
Algorithm for Learning Transliteration Model to Improve Translation of Low-
Frequency Proper Names. In Proc. of IEEE Natural Language Processing and
Knowledge Engineering.
F. Smadja, K.McKeown, and V. Hatzivassiloglou. 1996. Translating collocations
for bilingual lexicons:a statistical approach. Computational Linguistics, 22
(1):1-38.
K. Taghva and J. Gilbreth. 1995. Recognizing Acronyms and their Definitions.
Technical Report 95-03, ISRI (Information Science Research Institute),
UNLV, June, 1995.
S. Wan and C. M. Verspoor. 1998. Automatic English-Chinese name
transliteration for development of multilingual resources. In Proc. of ACL
1998: 1352-1357.
J. Xiao, J. Liu and T. S. Chua. 2002. Extracting pronunciation-translated
names from Chinese texts using bootstrapping approach, the 1st SIGHAN
workshop on Chinese Language Processing, Taipei, Taiwan, Aug 2002.
C. C. Yang and K. W. Li. 2003. Automatic Construction of English/Chinese
Parallel Corpora, Journal of the American society for Information Science
and Technology, 54(8), 730-742.
Y. Zhang and P. Vines. 2004. Using the web for automated translation
extraction in cross-language information retrieval. In Proc. of SIGIR 2004.