簡易檢索 / 詳目顯示

研究生: 洪福生
Hung, Fu-Sheng
論文名稱: 基於改良式字典樹與TF-IDF之快速高準確度之QA問答系統-以台南美食文本為例
Fast and High Precision Question Answering Dialogue System Based on Enhanced Trie and TF-IDF - A Case Study of Tainan Delicacy Corpus
指導教授: 王駿發
Wang, Jhing-Fa
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 49
中文關鍵詞: 對話系統問答對話字典樹TF-IDF句子相似度
外文關鍵詞: Dialogue System, Question Answering, Trie, TF-IDF, Sentence Similarity
相關次數: 點閱:100下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究提出一個基於改良式字典樹與TF-IDF之快速高精準度對話系統。本系統為一個兩階段的對話系統,第一階段利用改良式字典樹將文本句子做篩選,以減少相似度匹配的次數,提升系統反應速度。第二階段透過TF-IDF之演算法來計算相似度分數。有別於使用神經網絡來實現相似度匹配,而是使用如改良式字典樹和TF-IDF之規則去實現。本方法解決了神經網絡需要大量時間以及資料庫作以訓練。此外在文本擴充的方便性以及結果的可預測性,皆更勝於神經網絡。整個對話系統分成五個步驟,1.斷詞 2.關鍵詞提取 3.句子篩選 4.TF-IDF 5.相似度分數計算。首先,文本句子經過斷詞、關鍵詞提取,並借鏡了TF-IDF中的DF概念,將關鍵詞作排序,再透過改良式字典樹建置出句子篩選器。句子篩選器從文本中篩選出部分的可能問句,篩選結果再與輸入語句計算相似度分數,最終將相似度分數最高的問句所對應到的回答藉由語音合成回應給使用者。最後測試之實驗結果準確率為95%,系統反應時間為2.692秒,匹配時間為0.076秒,成功地即時做出正確的回應。

    This study proposes a fast and high-precision dialogue system based on enhanced Trie and TF-IDF. The system is a two-stage dialogue system. In the first stage, the enhanced Trie is used to filter sentences in the corpus to reduce the number of similarity matching times and improve the system response speed. In the second stage, calculate the similarity score through the TF-IDF algorithm. Different from using neural networks to achieve similarity matching, we uses rules such as enhanced Trie and TF-IDF, which solve the problem that neural network requires a lot of time and database for training. In addition, the convenience of text expansion and the predictability of results are much better than neural networks. The entire dialogue system is divided into five steps, 1. Word Segmentation 2. Keyword Extraction 3. Sentence Filtering 4. TF-IDF 5. Similarity Score Calculation. First, the sentences in the corpus are processed through the word segmentation and keyword extraction. The DF concept in TF-IDF is used to sort the keywords. Then, build a sentence filter through the enhanced Trie. Some possible questions are filtered from the corpus by the sentence filter, and calculate the similarity score between the input sentence and them. Finally, the answer corresponding to the question with the highest similarity score is taken as the output. The system converts it into speech by speech synthesis and responds to the user. In the experimental result, the accuracy was 95%, the system response time was 2.692s, and the matching time was 0.076s. The system successfully make the right response instantly.

    中文摘要 I Abstract II 誌謝 IV Contents V Table List VII Figure List IX Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Thesis Objective 2 1.4 Thesis Organization 3 Chapter 2 Related Works 4 2.1 Overview of Question Answering 4 2.2 Rule-based System vs Machine Learning 6 2.3 Trie 7 2.4 TF-IDF 9 2.4.1 TF 10 2.4.2 DF 10 2.4.3 IDF 11 2.5 The Survey of Sentences Similarity 12 Chapter 3 Fast and High Precision Question Answering Dialogue System 14 3.1 System Overview 14 3.1.1 Pre-Processing 15 3.1.2 Sentence Filter 15 3.1.3 Sentence Similarity 15 3.2 Pre-Processing 16 3.2.1 Automatic Speech Recognition (ASR) 16 3.2.2 Word Replace Template 16 3.2.3 Word Segmentation 16 3.2.3.1 Word Expansion of Jieba 18 3.2.3.2 Dataset of Word Expansion of Jieba 19 3.2.4 Stop Word 21 3.3 Sentence Filter 21 3.3.1 Keyowrd Extraction 21 3.3.2 Trie Filter 26 3.3.2.1 Building Trie 26 3.3.3.2 Filtering Sentences with Trie 32 3.4 Sentence Similarity 35 3.4.1 TF-IDF 36 3.4.2 Distance Similarity 39 3.4.3 Text to Speech (TTS) 40 Chapter 4 Experimental Results 41 4.1 Experiment for QA System in Tainan Delicacy 41 4.1.1 Corpus 41 4.1.2 Evaluation methods 41 4.1.3 Experimental Results 43 4.2 Evaluation for the Proposed System 45 Chapter 5 Conclusions and Future Works 47 References 48

    [1] S. Young, M. Gasic, B. Thomson, J. D. Williams, "POMDP-based statistical spoken dialog systems: A review", Proceedings of the IEEE, vol. 101, pp. 1160-1179, 2013.
    [2] R. Morante, M. Krallinger, A. Valencia, and W. Daelemans, "Machine Reading of Biomedical Texts about Alzheimer’s Disease 1," 2012.
    [3] S. Zhao, Y. Zheng, C. Zhu, T. Zhao, and S. Li, "Semantic computation in geography question answering," in 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), 2016, pp. 1572-1576.
    [4] A. Abdi, N. Idris, and Z. Ahmad, "QAPD: an ontology-based question answering system in the physics domain," Soft Computing, vol. 22, no. 1, pp. 213-230, 2018.
    [5] A. Ansari, M. Maknojia, and A. Shaikh, "Intelligent question answering system based on Artificial Neural Network," in 2016 IEEE International Conference on
    [6] L. Chiticariu, Y. Li, F.R. Reiss, "Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!," in EMNLP, 2013: pp. 827–832.
    [7] Jianlin Shi, John F. Hurdle, "Trie-based rule processing for clinical NLP: A use-case study of n-trie, making the ConText algorithm more efficient and scalable," Journal of Biomedical Informatics, vol. 85, pp. 106-113, 2018.
    [8] Trie - Wikipedia. Available: https://en.wikipedia.org/wiki/Trie
    [9] 劍指Offer——Trie樹(字典樹) - IT閱讀. Available: https://www.itread01.com/articles/1476615322.html
    [10] Trie Data Structure in Java | Baeldung. Available: https://www.baeldung.com/trie-java
    [11] E1-Khair IA, TF-IDF[M], Springer US, vol. 13, no. 12, pp. 3085-3086, 2009.
    [12] Lu Song, Li Xiaoli BaiShuo, "An Improved Approach to Weighting Terms in Text", [J]. Journal of Chinese Information Processing, vol. 14, no. 6, pp. 8-13, 2000.
    [13] G. Forman, "BNS characteristic Scaling: “An Improved Representation over tf -idf for SVM Text Classification", [C]. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, pp. 263-270, 2008.
    [14] N. Balasubramanian, J. Allan, and W. B. Croft, "A comparison of sentence retrieval techniques," in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 2007, pp. 813-814: ACM.
    [15] D. Metzler, Y. Bernstein, W. B. Croft, A. Moffat, and J. Zobel, "Similarity measures for tracking information flow," in Proceedings of the 14th ACM international conference on Information and knowledge management, 2005, pp. 517-524: ACM.
    [16] R. Mihalcea, C. Corley, and C. Strapparava, "Corpus-based and knowledge-based measures of text semantic similarity," in AAAI, 2006, vol. 6, pp. 775-780.
    [17] B. H. Su, T. W. Kuan, S. P. Tseng, J. F. Wang, and P. H. Su, "Improved TF-IDF weight method based on sentence similarity for spoken dialogue system," in 2016 International Conference on Orange Technologies (ICOT), 2016, pp. 36-39.
    [18] Q. Le and T. Mikolov, "Distributed representations of sentences and documents," in International Conference on Machine Learning, 2014, pp. 1188-1196.
    [19] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, "From word embeddings to document distances," in International Conference on Machine Learning, 2015, pp. 957-966.
    [20] X. Xu and F. Ye, "Sentences similarity analysis based on word embedding and syntax analysis," in 2017 IEEE 17th International Conference on Communication Technology (ICCT), 2017, pp. 1896-1900.

    無法下載圖示 校內:2022-08-01公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE