| 研究生: |
丁鈺純 Ting, Yu-Chun |
|---|---|
| 論文名稱: |
英中跨語言資訊檢索系統的建立 The Establishment of English-Chinese Cross-language Information Retrieval System |
| 指導教授: |
王惠嘉
Wang, Hei-Chia |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2013 |
| 畢業學年度: | 101 |
| 語文別: | 中文 |
| 論文頁數: | 52 |
| 中文關鍵詞: | 跨語言資訊檢索 、查詢翻譯 、查詢擴張 、相關回饋 |
| 外文關鍵詞: | Cross-language information retrieval(CLIR), Query translation, Query expansion, Relevance feedback(RF) |
| 相關次數: | 點閱:110 下載:5 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來由於快速的資訊流通與資訊分享的便利性,使得網路上的資訊過載現象產生,因此要如何從大量的資料中取得使用者所需的知識是相當重要的,資訊檢索系統(Information Retrieval System)可幫助使用者找尋到和查詢字詞相同語言的資料,但是對於跨語言文件的搜尋結果不佳。而在目前全球化的環境中,當使用者想要了解非母語的字詞與文件時,經常需要搜尋跨語言的文件來取得母語相關的資訊來協助使用者跨越閱讀障礙問題,因此需建立跨語言資訊檢索(Cross-Language Information Retrieval, CLIR)系統來幫助使用者搜尋相關的跨語文件。
目前的研究指出查詢翻譯(Query Translation)以及查詢擴張(Query Expansion)可用來提高CLIR系統的檢索準確率,但是查詢字詞的歧義性以及越來越多的OOV(Out-of-vocabulary)字詞容易導致翻譯的錯誤,過去研究指出應用網路資源翻譯查詢字詞在OOV字詞上表現良好,但在一般字詞上成效較差,對此本研究以雙語語料庫以及Google搜尋結果和維基百科的資料來擷取正確的查詢字詞翻譯,降低字詞歧義性的問題並取得OOV字詞的翻譯,另外用Google搜尋結果以及維基百科的資料取得與查詢(Source Query)相關的字詞做為擴張字詞,用來提高CLIR的檢索效能,本研究的方法以NTCIR-8的文件集做測試,其結果顯示本研究的方法能有效的提高檢索準確率。
In recent years, due to the fast information flow and convenience of information sharing, information overload happens. How to obtain the information required by users from large amounts of data becomes important. Information retrieval systems perform well in monolingual information retrieval, but not in cross-language information retrieval. In current globalized environment, when the users intend to understand non-native words or files, they often need to search for cross-language documents to obtain native related information assisting users across dyslexia problems. Therefore, it is necessary to establish cross-language information retrieval(CLIR) systems to help users search for relevant cross-language document.
The present study indicates tjar query translation and query expansion can be used to improve the retrieval accuracy of CLIR. However, the ambiguity of query terms as well as more and more out-of vocabulary(OOV) terms easily lead to translation errors. Cheng et al. (2004) apply network resources to translate query terms, and perform well in OOV terms, but not in general terms. Therefore, this study uses bilingual corpus, Google search results and Wikipedia to extract correct query translation terms in order to reduce word ambiguity problems and thus obtain translation of OOV terms. In addition, in order to improve the performance of CLIR this study uses Google search results and Wikipedia to obtain expansion terms related to query terms. This study ways to NTCIR-8 dataset to do the test, the results show that this method can effectively improve the retrieval accuracy.
Baeza-Yates, Ricardo A., & Ribeiro-Neto, Berthier. (1999). Modern Information Retrieval: Addison-Wesley Longman Publishing Company, Inc.
Ballesteros Lisa, & Croft W. Bruce. (1997). Phrasal translation and query expansion techniques for cross-language information retrieval. Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, 31(SI), 84-91.
Ballesteros Lisa, & Croft W. Bruce. (1998). Resolving ambiguity for cross-language retrieval. Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 64-71.
Buckley Chris, Salton Gerard, & Allan James. (1994). The effect of adding relevance information in a relevance feedback environment. Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, 292-230.
Cao Guihong, Nie Jian-Yun, Gao Jianfeng, & Robertson Stephen. (2008). Selecting good expansion terms for pseudo-relevance feedback. Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 243-250.
Chen Kuang-hua, & Chen Hsin-Hsi. (1994). A part-of-speech-based alignment algorithm. Proceedings of the 15th conference on Computational linguistics, 166-171.
Chen Aitao, & Gey Fredric C. (2003). Experiments on cross-language and patent retrieval at NTCIR-3 workshop. Proceedings of the third NTCIR workshop on research in information retrieval, automatic text summarization and question answering.
Cheng Chen-Hsin, Shue Reuy-Jye, Lee Hung-Lin, Hsieh Shu-Yu, Yeh Guann-Cyun, & Bian Guo-Wei. (2007). ANLIP at NTCIR-6: Evaluations for multilingual and cross-lingual information retrieval. Proceedings of NTCIR-6 workshop meeting, 60-65.
Cheng Pu-Jen, Teng Jei-Wen, Chen Ruei-Cheng, Wang Jenq-Haur, Lu Wen-Hsiang, & Chien Lee-Feng. (2004). Translating unknown queries with web corpora for cross-language information retrieval. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, 146-153.
Chirita Paul – Alexandru, Firan Claudiu S., & Nejdl Wolfgang. (2007). Personalized query expansion for the web. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 7-14.
Crouch Carolyn J., & Yang Bokyung. (1992). Experiments in automatic statistical thesaurus construction. Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, 77-88.
Cui Hang, Wen Ji-Rong, Nie Jian-Yun, & Ma Wei-Ying. (2003). Query Expansion by Mining User Logs. IEEE Transactions on Knowledge and Data Engineering, 15(4), 829-839.
Davis M. (1997). New experiments in cross-language text retrieval at NMSU's computing research lab. Proceedings of the 5th Text Retrieval Conference, 447-454.
Fan Weiguo, Luo Ming, Wang Li, Xi Wensi, & Fox Edward A. (2004). Tuning before feedback: combining ranking discovery and blind feedback for robust retrieval. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, 138-145.
Fonseca Bruno M., Golgher Paulo, Pôssas Bruno, Ribeiro-Neto, Berthier, & Ziviani Nivio. (2005). Concept-based interactive query expansion. Proceedings of the 14th ACM international conference on Information and knowledge management, 696-703.
Gonzalo Julio, Verdejo Felisa, Peters Carol, & Calzolari Nicoletta. (1998). Applying EuroWordNet to cross-language text retrieval. Computers and the Humanities, 185-207.
Harman Donna. (1992). Relevance feedback revisited. Paper presented at the Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, 1-10.
He Ben, & Ounis Iadh. (2007). Combining fields for query expansion and adaptive query expansion. Information Processing and Management, 43(5), 1294-1307.
He Daqing, & Wu Dan. (2011). Enhancing query translation with relevance feedback in translingual information retrieval. Information Processing and Management, 47(1), 1-17.
Hull David A., & Grefenstette Gregory. (1996). Querying across languages: a dictionary-based approach to multilingual information retrieval. Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, 49-57.
Jimeno-Yepes Antonio, Berlanga-Llavori Rafael, & Rebholz-Schuhmann Dietrich. (2010). Ontology refinement for improved information retrieval. Information Processing and Management, 46(4), 426-435.
Jones Gareth, Sakai Tetsuya, Collier Nigel, Kumano Akira, & Sumita Kazuo. (1999). A comparison of query translation methods for English-Japanese cross-language information retrieval (poster abstract). Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, 269-270.
Kishida Kazuaki. (2005). Technical issues of cross-language information retrieval: a review. Information Processing and Management, 41(3), 433-455.
Kwok, K. L., & Chan, M. (1998). Improving two-stage ad-hoc retrieval for short queries. Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 250-256.
Levow Gina-Anne, Oard Douglas W., & Resnik Philip. (2005). Dictionary-based techniques for cross-language information retrieval. Information Processing and Management, 41(3), 523-547.
Li Qing, Chen Yuanzhu Peter, Myaeng Sung-Hyon, Jin Yun, & Kang Bo-Yeong. (2009). Concept unification of terms in different languages via web mining for Information Retrieval. Information Processing and Management, 45(2), 246-262.
Lu Wen-Hsiang, Lin Ray S., Chan Yi-Che, & Chen Kuan-Hsi. (2008). Using Web resources to construct multilingual medical thesaurus for cross-language medical information retrieval. Decision Support Systems, 45(3), 585-595.
Mihalcea Rada. (2007). Using Wikipedia for Automatic Word Sense Disambiguation. Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL 2007), 196-203.
N. Efthimis, Efthimiadis, & Biron Paul V. (1994). UCLA-Okapi at TREC-2: Query Expansion Experiments. Proceedings of the Second Text Retrieval Conference, 500-215.
Na Seung-Hoon, & Ng Hwee Tou. (2011). Enriching document representation via translation for improved monolingual information retrieval. Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, 853-862.
Nie Jian-yun, & Jin Fuman. (2002). Integrating Logical Operators in Query Expansion in Vector Space Model. Proceedings of ACM SIGIR-2002 Workshop on Mathematical and Formal Methods in Information Retrieval.
Robertson Stephen E., & Jones Karen Sparck. (1988). Relevance weighting of search terms. Document retrieval systems, 143-160.
Robertson Stephen E., Walker Steve, Jones Susan, Hancock-Beaulieu Micheline, & Gatford Mike. (1995). Okapi at TREC-3. Proceedings of the Third Text REtrieval Conference (TREC 1994).
Salton G., Wong A., & Yang C. S. (1975). A vector space model for automatic indexing. Communications of the ACM (11), 613-620.
Salton Gerard. (1969). Automatic processing of foreign language documents. Proceedings of the 1969 conference on Computational linguistics, 1-28.
Salton Gerard, & Buckley Chris. (1997). Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41(4), 288-297.
Seo Hee-Cheol, Kim Sang-Bum, Rim Hae-Chang, & Myaeng Sung-Hyon. (2005). Improving query translation in English-Korean cross-language information retrieval. Information Processing and Management, 41(3), 507-522.
Sheridan Páraic, & Ballerini Jean Paul. (1996). Experiments in multilingual information retrieval using the SPIDER system. Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 58-65.
Sheridan Paraic, Braschler Martin, & Schäuble Peter. (1997). Cross-Language Information Retrieval in a Multilingual Legal Domain. Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, 253-268.
Wei Jie, Bressan Stéphane, & Ooi Beng Chin. (2000). Mining Term Association Rules for Automatic Global Query Expansion: Methodology and Preliminary Results. Proceedings of First International Conference on Web Information Systems Engineering, Hong Kong, China, 366-373.
Xu Jinxi, & Weischedel Ralph. (2001). TREC-9 cross-lingual retrieval at BBN. Proceedings of the Ninth Text Retrieval Conference(TREC-9), 106-116.
Zhang Ying, Vines Phil, & Zobel Justin. (2005). Chinese OOV translation and post-translation query expansion in chinese—english cross-lingual information retrieval. ACM Transactions on Asian Language Information Processing (TALIP), 4(2), 57-77.
杜浩. (2005). 基于雙語對齊語料--英漢詞典的自動生成.
林典鍵. (2008). 利用維基百科連結作資訊檢索查詢擴展.
韓詠, 孔蕾蕾, & 齊浩亮. (2009). 科技論文原創性檢查系統的研究. 第五屆全國信息檢索學術會議論文集.
孫萌, 梁穎紅, 葛運東, 顏振祥, & 姚建民. (2010). 基於平行語料庫和網絡的未登錄詞譯文挖掘. [Study on OOV Translation Mining from Parallel Corpora and the Web]. 江南大學學報(自然科學版), 66-70.
孫瑛澤, 陳建良, 劉峻杰, 劉昭麟, & 蘇豐文. (2010). 中文短句之情緒分類. Proceedings of 22nd Conference on Computational Linguistics and Speech Processing, 184-198.
陳信希. (2002). 跨語言資訊檢索:理論、技術與應用. Journal of Library and Information Science, 28(1), 19-32.