| 研究生: |
洪福生 Hung, Fu-Sheng |
|---|---|
| 論文名稱: |
基於改良式字典樹與TF-IDF之快速高準確度之QA問答系統-以台南美食文本為例 Fast and High Precision Question Answering Dialogue System Based on Enhanced Trie and TF-IDF - A Case Study of Tainan Delicacy Corpus |
| 指導教授: |
王駿發
Wang, Jhing-Fa |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 英文 |
| 論文頁數: | 49 |
| 中文關鍵詞: | 對話系統 、問答對話 、字典樹 、TF-IDF 、句子相似度 |
| 外文關鍵詞: | Dialogue System, Question Answering, Trie, TF-IDF, Sentence Similarity |
| 相關次數: | 點閱:100 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究提出一個基於改良式字典樹與TF-IDF之快速高精準度對話系統。本系統為一個兩階段的對話系統,第一階段利用改良式字典樹將文本句子做篩選,以減少相似度匹配的次數,提升系統反應速度。第二階段透過TF-IDF之演算法來計算相似度分數。有別於使用神經網絡來實現相似度匹配,而是使用如改良式字典樹和TF-IDF之規則去實現。本方法解決了神經網絡需要大量時間以及資料庫作以訓練。此外在文本擴充的方便性以及結果的可預測性,皆更勝於神經網絡。整個對話系統分成五個步驟,1.斷詞 2.關鍵詞提取 3.句子篩選 4.TF-IDF 5.相似度分數計算。首先,文本句子經過斷詞、關鍵詞提取,並借鏡了TF-IDF中的DF概念,將關鍵詞作排序,再透過改良式字典樹建置出句子篩選器。句子篩選器從文本中篩選出部分的可能問句,篩選結果再與輸入語句計算相似度分數,最終將相似度分數最高的問句所對應到的回答藉由語音合成回應給使用者。最後測試之實驗結果準確率為95%,系統反應時間為2.692秒,匹配時間為0.076秒,成功地即時做出正確的回應。
This study proposes a fast and high-precision dialogue system based on enhanced Trie and TF-IDF. The system is a two-stage dialogue system. In the first stage, the enhanced Trie is used to filter sentences in the corpus to reduce the number of similarity matching times and improve the system response speed. In the second stage, calculate the similarity score through the TF-IDF algorithm. Different from using neural networks to achieve similarity matching, we uses rules such as enhanced Trie and TF-IDF, which solve the problem that neural network requires a lot of time and database for training. In addition, the convenience of text expansion and the predictability of results are much better than neural networks. The entire dialogue system is divided into five steps, 1. Word Segmentation 2. Keyword Extraction 3. Sentence Filtering 4. TF-IDF 5. Similarity Score Calculation. First, the sentences in the corpus are processed through the word segmentation and keyword extraction. The DF concept in TF-IDF is used to sort the keywords. Then, build a sentence filter through the enhanced Trie. Some possible questions are filtered from the corpus by the sentence filter, and calculate the similarity score between the input sentence and them. Finally, the answer corresponding to the question with the highest similarity score is taken as the output. The system converts it into speech by speech synthesis and responds to the user. In the experimental result, the accuracy was 95%, the system response time was 2.692s, and the matching time was 0.076s. The system successfully make the right response instantly.
[1] S. Young, M. Gasic, B. Thomson, J. D. Williams, "POMDP-based statistical spoken dialog systems: A review", Proceedings of the IEEE, vol. 101, pp. 1160-1179, 2013.
[2] R. Morante, M. Krallinger, A. Valencia, and W. Daelemans, "Machine Reading of Biomedical Texts about Alzheimer’s Disease 1," 2012.
[3] S. Zhao, Y. Zheng, C. Zhu, T. Zhao, and S. Li, "Semantic computation in geography question answering," in 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), 2016, pp. 1572-1576.
[4] A. Abdi, N. Idris, and Z. Ahmad, "QAPD: an ontology-based question answering system in the physics domain," Soft Computing, vol. 22, no. 1, pp. 213-230, 2018.
[5] A. Ansari, M. Maknojia, and A. Shaikh, "Intelligent question answering system based on Artificial Neural Network," in 2016 IEEE International Conference on
[6] L. Chiticariu, Y. Li, F.R. Reiss, "Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!," in EMNLP, 2013: pp. 827–832.
[7] Jianlin Shi, John F. Hurdle, "Trie-based rule processing for clinical NLP: A use-case study of n-trie, making the ConText algorithm more efficient and scalable," Journal of Biomedical Informatics, vol. 85, pp. 106-113, 2018.
[8] Trie - Wikipedia. Available: https://en.wikipedia.org/wiki/Trie
[9] 劍指Offer——Trie樹(字典樹) - IT閱讀. Available: https://www.itread01.com/articles/1476615322.html
[10] Trie Data Structure in Java | Baeldung. Available: https://www.baeldung.com/trie-java
[11] E1-Khair IA, TF-IDF[M], Springer US, vol. 13, no. 12, pp. 3085-3086, 2009.
[12] Lu Song, Li Xiaoli BaiShuo, "An Improved Approach to Weighting Terms in Text", [J]. Journal of Chinese Information Processing, vol. 14, no. 6, pp. 8-13, 2000.
[13] G. Forman, "BNS characteristic Scaling: “An Improved Representation over tf -idf for SVM Text Classification", [C]. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM, pp. 263-270, 2008.
[14] N. Balasubramanian, J. Allan, and W. B. Croft, "A comparison of sentence retrieval techniques," in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 2007, pp. 813-814: ACM.
[15] D. Metzler, Y. Bernstein, W. B. Croft, A. Moffat, and J. Zobel, "Similarity measures for tracking information flow," in Proceedings of the 14th ACM international conference on Information and knowledge management, 2005, pp. 517-524: ACM.
[16] R. Mihalcea, C. Corley, and C. Strapparava, "Corpus-based and knowledge-based measures of text semantic similarity," in AAAI, 2006, vol. 6, pp. 775-780.
[17] B. H. Su, T. W. Kuan, S. P. Tseng, J. F. Wang, and P. H. Su, "Improved TF-IDF weight method based on sentence similarity for spoken dialogue system," in 2016 International Conference on Orange Technologies (ICOT), 2016, pp. 36-39.
[18] Q. Le and T. Mikolov, "Distributed representations of sentences and documents," in International Conference on Machine Learning, 2014, pp. 1188-1196.
[19] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, "From word embeddings to document distances," in International Conference on Machine Learning, 2015, pp. 957-966.
[20] X. Xu and F. Ye, "Sentences similarity analysis based on word embedding and syntax analysis," in 2017 IEEE 17th International Conference on Communication Technology (ICCT), 2017, pp. 1896-1900.
校內:2022-08-01公開