| 研究生: |
詹易哲 Chan, Yi-che |
|---|---|
| 論文名稱: |
利用Google的二階段斷詞方法以及從維基百科學習辨別問題分類之多語問答系統 Google-Based Two-Stage Text Segmentation and Learning Question Type Identification from Wikipedia for a Multilingual QA System |
| 指導教授: |
盧文祥
Lu, Wen-Hsiang |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2007 |
| 畢業學年度: | 95 |
| 語文別: | 英文 |
| 論文頁數: | 52 |
| 中文關鍵詞: | 問題分類 、維基百科 、多語問答系統 、多語斷詞方法 |
| 外文關鍵詞: | Multilingual Question Answering System, Multilingual Text Segmentation, Wikipedia, Question Classification |
| 相關次數: | 點閱:104 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
繼搜尋引擎的風潮之後,問答系統(Question Answering System)被認為是對於網路資源的下一項重要應用。問答系統能夠接受使用者所輸入的自然語言問題,並且從目標文件集當中擷取出最適當的答案回應給使用者。隨著網路上各種語言的文件資源不停地增加,如何有效的架構一個多語問答系統(Multilingual Question Answering System)便成為一項重要的研究課題。
本論文利用Google的搜尋結果以及結合數個Language-independent的斷詞方法,實作出一個二階段的多語斷詞工具,並且從維基百科中的分類資源,用半自動的方式學習問題分類。最後再結合了關鍵詞擷取、文件檢索以及答案擷取三個部份,實作出一個多語問答系統。
多語斷詞的第一階段首先利用Google的搜尋結果將問題句斷為最小的詞塊(Chunks),而在第二階段中再將這些小詞塊結合成有意義的詞組。第二階段中的斷詞演算法結合了以下四種斷詞特徵:n-gram序列的機率(Probability of N-gram Sequence)、分枝熵(Branching Entropy)、意義評估(Significance Estimation)以及複合分數(Compound Scores)。
在問題分類中,本論文利用自動擷取出的問題詞語產生一些基本的分類規則,再由維基百科中的分類資源,用半自動的方式學習較細的問題類型。這些問題類型能夠對於答案擷取有很大的幫助。另外本論文利用Google的搜尋結果以及維基百科的條目標題截取出的特徵來完成關鍵詞擷取,在答案擷取中則是利用候選答案與關鍵字的距離來排序,然後再利用問題分類的結果過濾錯誤的候選答案之後,最後產生出一個最佳的答案給使用者。
實驗結果證明本論文提出的斷詞方法可以有效的擷取出具有意義的字詞,使系統能擷取正確的關鍵詞,並進一步找出可能含有正確答案的文件。問題分類的結果則可以幫助提升答案擷取的效率,並對於提升問答系統的正確率有很大的幫助。
After search engines became popular, Question Answering System is regarded as an important application on Web resources. Question Answering System can support with natural language queries, retrieve from target document sets and response the exact answer to the user. With the increasing amount of Web resources in various languages, how to implement a multilingual question answering system becomes an important research issue.
We present the methods of Google-based two-stage segmentation and learning question types from Wikipedia. With the above methods, we complete a multilingual QA system by implementing the remaining parts which include extracting keyterms, document retrieval and answer extraction.
As first stage in the procedure of Google-based two-stage segmentation, we utilize Google search results to segment natural language questions into chunks. Then we introduce an algorithm to transform the chunks to the significant words or phases in second stage. There are four segmentation features combined in our algorithm: probability of n-gram sequence, branching entropy, significance estimation, and compound scores.
In question classification, we automatically extract the question words from the questions and construct some basic classification rules. Besides, we used Wikipedia categories to learn the finer question types semi-automatically. These question types can assist in the procedure of answer extraction. Then we propose some general methods to extract keyterms by applying Google search results and Wikipedia resources as the scoring features. From the retrieved snippets, we extract the answer candidates and used the distance model to rank them. The answer candidates with wrong question types are filtered out and the top-ranked answer candidate is regarded as the final answer.
The experimental results show that the proposed segmentation method can extract significant words or phrases effectively. It also helps the system extract the accurate keyterms and find the related documents which contain the accurate answer. The results of question classification can assist the efficiency of answer extraction and promote the overall accuracy.
[1] S. F. Adafre and M. de Rijke, “Finding Similar Sentences across Multiple Languages in Wikipedia,” In Proceedings of the workshop on NEW TEXT Wikis and blogs and other dynamic text sources, 2006.
[2] A. Al-Maskari and M. Sanderson, “The Affect of Machine Translation on the Performance of Arabic-English QA System,” In EACL 2006 Workshop on Multilingual Question Answering - MLQA06, 2006.
[3] R. K. Ando and L. Lee, “Mostly-unsupervised statistical segmentation of Japanese: Applications to Kanji,” In ANLP-NAACL, 2000.
[4] P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. Della Pietra and J. C. Lai, “Class-Based n-gram Models of Natural Language,” In Association for Computational Linguistic, 1992.
[5] D. Buscaldi and P. Rosso, “Mining Knowledge from Wikipedia for the Question Answering task,” In Proceedings of the International Conference on Language Resources and Evaluation, 2006.
[6] T. Chalishain and R. Dornfest. “Google Hacks,” O’Reilly & Associates, 2003.
[7] Y.-C. Chan, K.-H. Chen and W.-H. Lu, “Extracting and Ranking Question-Focused Terms Using the Titles of Wikipedia Articles,” In NTCIR-6, 2006.
[8] L.-F. Chien, “PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval,” In Proceedings of the 20th Annual International ACM SIGIR Conference, 1997.
[9] S. Cucerzan, “Large-Scale Named Entity Disambiguation Based on Wikipedia Data,” In Proceedings of the Joint Conference on EMNLP-CoNLL, 2007.
[10] H. T. Dang, J. Lin and D. Kelly, “Overview of the TREC 2006 Question Answering Track,” In Proceedings of the 15th Text Retrieval Conference, 2006
[11] S. Dumais, M. Banko, E. Brill, J. Lin and A. Ng, “Web Question Answering: Is More Always Better?” In Proceedings of the 25th Annual International ACM SIGIR Conference, 2002.
[12] H. D. Feng, K. Chen, C. Y. Kit and X. T. Deng, “Unsupervised Segmentation of Chinese Corpus Using Accessor Variety,” In IJCNLP, 2004.
[13] R. Gligorov, Z.Aleksovski, W. ten Kate and F. van Harmelen, “Using Google Distance to weight approximate ontology matches,” In Proceedings of the 16th International World Wide Web Conference (WWW 2007), 2007.
[14] N. Hidaka, F. Masui and K. Tosaki, “MAIQA: Mie Univ. Participated System at NTCIR4 QAC2,” In NTCIR-4, 2004.
[15] J. Ho, “APEC Multilingual International Trade Project: Methodology and Case Reports on Needs Assessment,” In Journal of CMC, 2003.
[16] E. Hovy, L. Gerber, U. Hermjakob, M. Junk and C. Lin, “Question answering in Webclopedia,” In Proceedings of the Ninth Text Retrieval Conference (TREC-9), 2001.
[17] Z. Jin and K. Tanaka-Ishii, “Unsupervised Segmentation of Chinese Text by Use of Branching Entropy,” In Proceedings of the COLING/ACL Main Conference Poster Sessions, 2006.
[18] J. Ko, T. Mitamura and E. Nyberg, “Language-independent Probabilistic Answer Ranking for Question Answering,” In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, 2007.
[19] C. Kwok, O. Etzioni and D. S. Weld, “Scaling question answering to the Web,” In Proceedings of the 10th International World Wide Web Conference (WWW 2001), 2001.
[20] K.-L. Kwok, P. Deng, N. Dinstl and S. Choi, “NTCIR-5 English-Chinese Cross Language Question-Answering Experiments using PIRCS,” In NTCIR-5, 2005.
[21] C.-W. Lee, C.-W. Shih, M.-Y. Day, T.-H. Tsai, T.-J. Jiang, C.-W. Wu, C.-L. Sung, Y.-R. Chen, S.-H. Wu and W.-L. Hsu, “ASQA: Academia Sinica Question Answering System for NTCIR-5 CLQA,” In NTCIR-5, 2005.
[22] X. Li and D. Roth, “Learning Question Classifiers,” In Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), 2002.
[23] B. Magnini, D. Giampiccolo, P. Forner, C. Ayache, V. Jijkoun, P. Osenova, A. Peñas, P. Rocha, B. Sacaleanu and R. Sutcliffe, “Overview of the CLEF 2006 Multilingual Question Answering Track,” In Working Notes for the CLEF 2006 Workshop, 2006.
[24] D. Moldovan, S. Harabagiu, M. Pasca, R. Mihalcea, R. Girju, R. Goodrum and V. Rus, “The Structure and Performance of Open-Domain Question Answering System,” In Proceedings of the 38th Annual Meeting of the Association of Computational Linguistics, 2000.
[25] D. Moldovan, M. Paşca, S. Harabagiu and M. Surdeanu, “Performance Issues and Error Analysis in an Open-Domain Question Answering System,” In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 2001.
[26] M. L. Nguyen, T. T. Nguyen and A. Shimazu, “Subtree Mining for Question Classification Problem,” In IJCAI, 2007.
[27] E. Riloff and M. Thelen, “A Rule-based Question Answering System for Reading Comprehension Tests,” In Proceedings of the ANLP/NAACL 2000 Workshop on Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems, 2000.
[28] Y. Sasaki, “Question Answering as Question-Biased Term Extraction: A New Approach toward Multilingual QA,” In Proceedings of the 43rd Annual Meeting of the Association of Computational Linguistics, 2005.
[29] Y. Sasaki, C.-J. Lin, K.-H. Chen and H.-H. Chen, “Overview of the NTCIR-6 Cross-Lingual Question Answering (CLQA) Task,” In NTCIR-6, 2006.
[30] M. Strube and S. P. Ponzetto, “WikiRelate! Computing Semantic Relatedness Using Wikipedia,” In Proceedings of AAAI, 2006.
[31] M. S. Sun, D. Y. Shen and B. K. Tsou, “Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data,” In Proceedings of the 36th Annual Meeting on Association for Computational Linguistics, 1998.
[32] J. Suzuki, H. Taira, Y. Sasaki and E. Maeda, “Question Classification using HDAG Kernel,” In Proceedings of the ACL 2003 Workshop on Multilingual Summarization and Question Answering, 2003.
[33] K. Tanaka-Ishii and H. Nakagawa, “A Multilingual Usage Consultation Tool Based on Internet Searching -More than a Search Engine, Less than QA-,” In Proceedings of the 14th International World Wide Web Conference (WWW 2005), 2005.
[34] K. Tanaka-Ishii, “Entropy as an indicator for context boundaries -an experiment using a web search engine-,” In IJCNLP, 2005.
[35] M. Völkel, M. Krötzsch, D. Vrandecic, H. Haller and R. Studer, “Semantic Wikipedia,” In Proceedings of the 15th International World Wide Web Conference (WWW 2006), 2006.
[36] E. M. Voorhees, “Overview of the TREC 2001 Question Answering Track,” In Proceedings of the 10th Text Retrieval Conference, 2001.
[37] E. M. Voorhees and H. T. Dang, “Overview of the TREC 2005 Question Answering Track,” In Proceedings of the 14th Text Retrieval Conference, 2005
[38] H.-J. Zeng, Q.-C. He, Z. Chen, W.-Y. Ma and J. Ma, “Learning to Cluster Web Search Results,” In Proceedings of the 27th Annual International ACM SIGIR Conference, 2004.
[39] D. Zhang, W. S. Lee, “Question Classification using Support Vector Machines,” In Proceedings of the 26th Annual International ACM SIGIR Conference, 2003.
[40] Global Reach, “Global Internet statistics (by language).” [Online] September, 2004. Available: http://global-reach.biz/globstats/index.php3
[41] Unicode, Inc. “What is Unicode?” [Online] 2007. Available: http://www.unicode.org/standard/WhatIsUnicode.html
[42] Wikipedia, the free encyclopedia, “Wikipedia,” [Online] 2007. Available: http://en.wikipedia.org/wiki/Wikipedia
[43] WiQA, Question Answering Using Wikipedia, [Online] 2006. Available: http://ilps.science.uva.nl/WiQA/.