簡易檢索 / 詳目顯示

研究生: 蘇柏淮
Su, Po-Huai
論文名稱: 基於快速句子相似度匹配演算法之QA文本對話系統
Fast QA Pair Matching Algorithm Based on Sentence Similarity for Spoken Dialogue System
指導教授: 王駿發
Wang, Jhing-Fa
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2016
畢業學年度: 104
語文別: 英文
論文頁數: 54
中文關鍵詞: 資訊檢索句子相似度自動問答系統資訊擷取自然語言處理
外文關鍵詞: Information retrieval, Sentence similarity, Question answering system, Information extraction, Natural language processing
相關次數: 點閱:97下載:9
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究提出一個具有基於大數據之自然語言產生與快速文本匹配之對話系統,此系統從對話語句進行句子符號分析並檢索出對應回答且在資料庫不足的情況下,從Google大數據產生自然語言作為回答。將輸入的語音經由ASR轉換成文字之後,再透過CKIP斷詞系統進行斷詞,接著利用事先訓練好的詞袋將文句進行向量化。由於資料庫裡的文本皆已向量化並放置在向量空間模型中,只需要將向量化的輸入文句以及向量空間模型中的向量句子做餘弦相似度計算,並且選擇相似度高於閥值且相似度為最高者作為輸出回覆語句,而為了提高檢索之準確率,本篇改良了TF-IDF的權重方法,將短句子中詞頻大部分出現次數為一的問題考慮進去,最後經由文字轉語音,將檢索之回覆語句輸出給使用者。而當系統資料庫不足以回答使用者之輸入語句時,轉而將Google大數據當作回覆語句來源。將輸入語句透過CKIP Parser進行句型剖析,判斷完句型類別後,根據類別找出輸入語句之中心語 (Head Word),並且將中心語當作語句之關鍵字至Google大數據進行網路爬蟲,最後根據關鍵字得到回覆語句並將之進行語音合成輸出。檢索實驗結果顯示外部測試平均準確率達84.66%。另一方面自然語言產生回覆語句實驗結果顯示對於開放問題的正確回答率可達39.77%。

    This thesis presents a spoken dialog question answering system. The sentence is analyzed by symbolic similarity and then corresponding answer is extracted. In the case of insufficient dataset, natural language answer is generated from Google Big Data. The ASR transcription is processed through Chinese Knowledge and Information Processing (CKIP) Chinese words segmentation system, then the bag-of-words is used to vectorize the word set. Since the sentence in corpus has been vectorized into vector space model. The next only need to calculate cosine similarity between vectorized query and vectorized sentences, and the sentence with highest angle as a response statement. In order to improve the accuracy of retrieval, this thesis improved TF-IDF weighting. We take the term frequency occurrences one time of short sentence into account. The response statement of retrieval output to user through Text-to-Speech. When the structure database is not sufficient to answer user’s query, we turn to Google Big Data as a response answer source. The query which structure database cannot handle is processed syntactic analyzed through CKIP Parser. After determined the sentence category, the Head word as a keyword is taken according sentence categories. The reply statement is crawled back from Google Big Data, and output the reply statement through speech synthesis. The experimental result of information retrieval shows that the average accuracy rate of outside test is 84.66%. On the other hand, the average accuracy rate of non-structure query in open-domain test, the average correct rate can reach 39.77%.

    中文摘要 I Abstract II 誌謝 IV Content VI Table List VIII Figure List IX Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Objectives 3 1.4 Organization 3 Chapter 2 Related Works 4 2.1 Overview of Question Answering System 4 2.2 Overview of Sentence Similarity Measures 5 2.2.1 Symbolic Sentence Similarity based on Word Set 7 2.2.2 Symbolic Sentence Similarity based on Edit Distance 8 2.2.3 Structural Sentence Similarity based on Word Order 8 2.2.4 Semantic Sentence Similarity based on WordNet 9 2.3 Overview of Answer Validation 10 Chapter 3 Fast Answer Retrieval Matching System 12 3.1 System Overview 12 3.2 Pre-processing 13 3.2.1 Framework Overview 13 3.2.2 Tokenization 13 3.2.3 Vectorization 14 3.3 Closed-domain Information retrieval 16 3.3.1 Framework Overview 16 3.3.2 Vector Space Model 17 3.3.3 Sentence Similarity 23 3.4 Open-domain Answer Retrieval 25 3.4.1 Framework Overview 25 3.4.2 Document / Passage Retrieval 26 3.4.3 Knowledge Extraction from Google 28 3.4.4 Knowledge Extraction from Unstructured Texts 29 3.4.5 Answer Validation 33 Chapter 4 Experimental Results 36 4.1 Experiment for Closed-domain Information Retrieval 36 4.1.1 Corpus 36 4.1.2 Evaluation Methods 37 4.1.3 Experimental Results 37 4.2 Experiment for Open Domain Answer Retrieval 39 4.2.1 Corpus 39 4.2.2 Evaluation Methods 40 4.2.3 Experimental Results 41 4.3 Evaluation 42 4.3.1 Evaluation Methods 42 4.3.2 Experimental Results 45 Chapter 5 Conclusions and Future Works 48 5.1 Conclusions 48 5.2 Future Works 48 References 50 Appendix-A 53

    [1] Wilks, Y., & Catizone, R, “Human – computer conversation,” In A. Kent (Ed.), Encyclopedia of library and information science. Vol. 69. New York: Dekker, 2001
    [2] T. Winograd, “Understanding natural language,” Cognitive psychology, 3.1: 1-191, 1972.
    [3] G. Salton, M. J. McGill, J. Michael, “Introduction to modern information retrieval,” 1986.
    [4] M. Banko, M. J. Cafarella, et al, “Open Information Extraction from the Web,” In: IJCAI, pp. 2670-2676, 2007.
    [5] S. Abney, M Collins, A Singhal, “Answer extraction,” In Proceedings of the sixth conference on Applied natural language processing. Association for Computational Linguistics, pp. 296-301, 2000.
    [6] F. Hayes-Roth, D. A. Waterman. D. B. Lenat, “Building expert system,” 1983.
    [7] E. J. Wantroba, RAF. Romero, “An interactive question-answer system with dialogue for a receptionist avatar,” In: 2015 12th LARS-SBR. IEEE, pp. 360-365, 2015
    [8] D. Michie, “Return of the Imitation Game,” Electronic Trans. Artificial Intelligence, vol. 6, no.2, pp. 203-221, 2001.
    [9] J. F. Allen, B. W. Miller, E. K. Ringger, T. Sikorski, “A robust system for natural spoken dialogue,” Proceddings of the 34th annual meeting on Association for Computational Linguistics, pp. 62-70, 1996
    [10] LI, Yuhua, et al, “Sentence similarity based on semantic nets and corpus statistics,” IEEE transactions on knowledge and data engineering, 18.8, pp.1138-1150, 2006.
    [11] B. MagniniI, et al, “Is it the right answer?: exploiting web redundancy for Answer Validation,” In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, pp. 425-432, 2002.
    [12] “Ask Jeeves,” http://www.ask.com/
    [13] W. A. Woods, "Progress in Natural Language Understanding: an application to lunar geology," In Proceedings of the June 4-8, national computer conference and exposition, pp. 441-450, 1973.
    [14] V. Hatzivassiloglou, J. L. Klavans, and E. Eskin, “Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning,” In Proc. Joint SIGDAT Conf. Empirical Methods in NLP and Very Large Corpora, pp.203-212, 1999.
    [15] P. W. Foltz, W. Kintsch, and T. K. Landauer, “The Measurement of Textual Coherence with Latent Semantic Analysis,” Discourse Processes, vol. 25, no.2-3, pp. 285-307, 1998.
    [16] C. T. Meadow, B. R. Boyce, and D. H. Kraft, “Text Information Retrieval Systems,” second ed. Academic Press, 2000.
    [17] C. T. Meadow, B. R. Boyce, and D. H. Kraft, “Text information retrieval systems,” San Diego: Academic Press, 1992.
    [18] N. Okazaki, Y. Matsuo, N. Matsumura, and M. Ishizuka, “Sentence Extraction by Spreading Activation through Sentence Similarity,” IEICE Trans. Information and Systems, vol. E86D, no. 9, pp. 1686-1694, 2003.
    [19] J. H. Chiang and H. C. Yu, “Literature Extraction of Protein Functions Using Sentence Pattern Mining,” IEEE Trans. Knowledge and Data Eng., vol. 17, no.8, pp. 1088-1098, Aug. 2005.
    [20] D. Jurafsky and J. H. Martin, “Speech and Language Processing: An Introduction to Natural Language Processing,” Computational Linguistics, and Speech Recognition, Prentice Hall, 2000.
    [21] P. W. Foltz, W. Kintsch, T. K. Landauer, “The measurement of textual coherence with latent semantic analysis,” Discourse processes, vol. 25, no.2-3, pp.285-307, 1998.
    [22] C. Burgess, K. Livesay, and K. Lund, “Explorations in Context Space: Words, Sentences, Discourse,” Discourse Processes, vol. 25, no.2-3, pp. 211-257, 1998.
    [23] J. L. McClelland and A. H. Kawamoto, “Mechanisms of Sentence Processing: Assigning Roles to Constituents of Sentences,” Parallel Distributed Process 2, pp. 318-362, 1986.
    [24] V. Hatzivassiloglou, J. Klavans, and E. Eskin, “Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning,” In Proc. Joint SIGDAT Conf. Empirical Methods in NLP and Very Large Corpora, 1999.
    [25] V. Hatzivassiloglou, J. Klavans, and E. Eskin, “Detecting Similarity by Applying Leaning over Indicators,” Proc. 37th Ann. Meeting of the Assoc. for Computational Linguistics, 1999.
    [26] A. K. Patidar, J. Agrawal, N. Mishra, “Analysis of different similarity measure functions and their impacts on shared nearest neighbor clustering approach,” International Journal of Computer Applications (0975–8887), vol. 40, 2012.
    [27] P. Jaccard, “The distribution of the flora in the alpine zone,” New phytologist, vol. 11, no.2, pp. 37-50, 1912.
    [28] L. R. Dice “Measures of the amount of ecologic association between species,” Ecology, vol. 26, no.3, pp. 297-302, 1945.
    [29] M. Norouzi, D. J. Fleet, R. R. Salakhutdinov, “Hamming distance metric learning,” In Advances in neural information processing systems, pp. 1061-1069, 2012.
    [30] E. Ukkonen, “Finding approximate patterns in strings,” Journal of algorithms, vol. 6, no.1, pp. 132-137, 1985.
    [31] G. A. Miller, “WordNet: a lexical database for English,” Communications of the ACM, vol. 38, no. 11, pp. 39-41, 1995
    [32] S. M. Harabagiu, M. A. Pasca, S. J. Maiorano, “Experiments with open-domain textual question answering,” In Proceedings of the 18th conference on Computational linguistics-Volume 1. Association for Computational Linguistics, pp. 292-298, 2000.
    [33] S. Harabagiu and S. Maiorano, “Finding Answers in Large Collections of Texts: Paragraph Indexing + Abductive Inference,” In Proceedings of the AAAI Fall Symposium on Question Answering Systems, pp 63–71, 1999.
    [34] C. Clarke, G. Cormack, T. Lynam, C. Li & G. McLearn, “Web Reinforced Question Answering” In Proc. of the TREC- 10 Conference, pp. 620-626, 2001.
    [35] E. Brill, J. Lin, M. Banko, S. Dumais, & A. Ng, “Data Intensive Question Answering,” In Proc. of the TREC-10 Conference, Gaithesbourg, MD, 2001.
    [36] D. R. Radev, H. Qi, Z. Zheng, S. Blair-Goldensohn, Z. Zhang, W. Fan, & J. Prager, “Mining the Web for Answers to Natural Language Questions,” In Proc. of 2001 ACM CIKM, Atlanta, Georgia, USA, pp. 143-150, 2001.
    [37] Mann, G. S., “A Statistical Method for Short Answer Extraction,” In Proc. of the ACL-2001 Workshop on Open-Domain Question Answering, Toulouse, France, 2001.
    [38] J. Zhang, et al, “Calculating statistical similarity between sentences,” Journal of Convergence Information Technology, vol. 6, no.2, 2011.
    [39] Z. S. Harris, “Distributional structure,” Word, vol. 10, nos. 2-3, pp. 146-162, 1954.
    [40] C. D. Manning, P. Raghavan, H. Schtze, “Document and query weighting schemes,” Introduction to Information Retrieval, vol. 128, 2008.
    [41] G. Salton, A. Wong, C.-S. Yang, “A vector space model for automatic indexing,” Communications of the ACM, vol. 18, no. 11, pp. 613-620, 1975.
    [42] P. D. Turney, et al, “From frequency to meaning: Vector space models of semantics,” Journal of artificial intelligence research, vol. 37, no. 1, pp. 141-188, 2010.
    [43] C. E. Shannon, “A mathematical theory of communication,” ACM SIGMOBILE Mobile Computing and Communications Review, vol. 5, no.1, pp. 3-55, 2001.
    [44] M. Subbotin and S. Subbotin, “Patterns of Potential Answer Expressions as Clues to the Right Answers,” In Proceeding of the TREC-10 Conference, Gaithesburg, MD, pp. 175–182, 2001.
    [45] R. Zajac, “Towards Ontological Question Answering,” In Proceedings of the ACL-2001 Workshop on OpenDomain Question Answering, Toulouse, France, July 2001.
    [46] Brin, S. & Page, L., “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” Proc. of the 7th International World Wide Web Conference, Brisbane, Australia, 1998.
    [47] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” Journal of the ACM, vol. 46, no. 5, pp. 604-632, 1999.
    [48] K.-J. Chen, C.-R. Huang, F.-Y. Chen, C.-C. Luo, M.-C. Chang, C.-J. Chen, and Z.-M. Gao, “Sinica Treebank: Design Criteria, Representational Issues and Implementation,” In Anne Abeille (Ed.) Treebanks Building and Using Parsed Corpora. Language and Speech series, Dordrecht:Kluwer, pp. 231-248, 2003.
    [49] K.-J. Chen, et al. “The CKIP Chinese Treebank: Guidelines for Annotaion.” ATALA Workshop – Treebanks, Paris, June 18-19, pp. 85-96, 1999.
    [50] P.-C. Lin, J.-D. Wang, J.-F. Wang, and L.-C. Wen, “Design and portable device implementation of feature-based partial matching algorithms for personal spolen sentence retrieval”, IET Signal Process, vol. 1, no.3, pp. 139-149, September 2007.
    [51] L. Xiaoying, Y. Zhou, R. Zheng, “Sentence similarity based on dynamic time warping”, In International Conference on Semantic Computing (ICSC 2007), IEEE, pp. 250-256, 2007.
    [52] D. Benyon, et al, “How Was Your Day? Evaluating a Conversational Companion”, IEEE Transactions on Affective Computing, vol. 4, no.3, pp. 299-311, 2013.
    [53] Y. Wilks, “Is there progress on talking sensibly to machines?”, Science, vol. 318, no.5852, pp. 927-928, 2007.
    [54] D. R. Traum, S. Robinson, J. Stephan, “Evaluation of Multi-party Virtual Reality Dialogue Interaction”, In LREC, 2004.
    [55] N. Webb, et al, “Evaluating human-machine conversation for appropriateness”, In LREC, 2010.

    下載圖示 校內:2021-08-01公開
    校外:2021-08-01公開
    QR CODE