| 研究生: |
吳克松 Wu, Ko-Sung |
|---|---|
| 論文名稱: |
以主題偵測與追蹤建置階層式知識檢索方法 Hierarchical Knowledge Retrieval Based On Topic Detection and Tracking |
| 指導教授: |
王惠嘉
Wang, Hei-Chia |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 工業與資訊管理學系碩士在職專班 Department of Industrial and Information Management (on the job class) |
| 論文出版年: | 2015 |
| 畢業學年度: | 103 |
| 語文別: | 中文 |
| 論文頁數: | 61 |
| 中文關鍵詞: | 文字探勘 、主題偵測與追蹤 、知識檢索 、特徵選取 、文件分群 |
| 外文關鍵詞: | Text Mining, Topic Detection and Tracking, Knowledge Retrieval, Feature Selection, Document Clustering |
| 相關次數: | 點閱:101 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
知識是企業重要的資產,隨著網際網路、資訊硬體設備快速的發展,儲存於系統中的非數據化知識越來越多且越複雜,導致使用者在利用傳統關鍵字查詢時,雖有找到符合的資料,但往往因為筆數過多、無法快速找到真正所需的資訊。面對這樣的資訊超載、無法有效檢索的窘境,主題相關的概念紛紛被提出應用,所謂的相關是指檢索詞彙與文章內文之間的一種吻合關係,雖然由主題的觀點來探討相關,較能滿足使用者的檢索需求,但大多是以全文為分析對象,忽略了文件特定部份的重要性,而且分析所得的主題多為單詞、不具關聯等特性。
為了協助使用者能更容易檢索所需資訊及了解相關主題,本研究提出建置階層式知識檢索的方法,以公司Notes資料彙辦系統的文件庫資料為資料集,針對案件主旨、說明欄位及附加檔案的文字,進行自然語言處理,利用詞彙加權等特徵選取組成文件向量,透過主題偵測與追蹤技術,依據文件之間的相似度,以二階段分群方法,建立階層主題關聯資訊,新進文件搭配分類方法,檢索結果依特徵詞彙權重排序,並以主題方式呈現,幫助使用者在面對大量資訊的同時,能快速檢索其所需的資訊,了解相關主題。
實驗結果證明,利用本研究方法的精確率為53.9%,相較於現行系統全文檢索的33.2%,精確率提高了二成。在整體表現上,本研究方法的F-measure為63%,也較現行系統的44.9%高出18.1%,顯示利用本研究的方法能改善檢索的成效。
With the more and more complex document-digitizing, the ability to find the desired information and related topics accurately is becoming more critical and difficult. This study proposes a novel approach of hierarchical knowledge retrieval based on topic detection and tracking, which retrieves relevant information from large volumes of documents and extracts the main topics to users. The part of speech is combined with bigram to obtain meaningful compound terms in data preprocessing. Different from other practice of feature selection, this method considers term weighting for different term of fields. Then calculates the similarity between documents, the hierarchy-related topics are generated after Single-Pass and AHC clustering. Results from our system are evaluated against the system of full text search on the intranet, indicating that this approach has improved not only the precision rate but also the F-measure. It's advantageous in moving up the efficiency of knowledge retrieval.
參考文獻
Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications (Vol. 27, No. 2, pp. 94-105). ACM.
Alavi, M., & Leidner, D. E. (2001). Review: Knowledge management and knowledge management systems: Conceptual foundations and research issues. MIS quarterly, 107-136.
Albanese, M., Capasso, P., Picariello, A., & Rinaldi, A. M. (2005). Information retrieval from the web: an interactive paradigm Advances in Multimedia Information Systems (pp. 17-32): Springer.
Aljaber, B., Stokes, N., Bailey, J., & Pei, J. (2010). Document clustering of scientific texts using citation contexts. Information Retrieval, 13(2), 101-131.
Allan, J. (2002). Topic detection and tracking: event-based information organization (Vol. 12, No. 5, pp. 87-101): Springer.
Allan, J., Lavrenko, V., & Jin, H. (2000). First story detection in TDT is hard. Paper presented at the Proceedings of the ninth international conference on Information and knowledge management.
Baeza-Yates, R. (2003). Information retrieval in the web: beyond current search engines. International Journal of Approximate Reasoning, 34(2), 97-104.
Berkhin, P. (2006). A survey of clustering data mining techniques Grouping multidimensional data (pp. 25-71): Springer.
Bosch, A. V. D. (2010). Hidden Markov Models. In C. Sammut & G. Webb (Eds.), Encyclopedia of Machine Learning (pp. 493-495): Springer US.
Carthy, J. (2004). Lexical Chains versus Keywords for Topic Tracking. In A. Gelbukh (Ed.), Computational Linguistics and Intelligent Text Processing (Vol. 2945, pp. 507-510): Springer Berlin Heidelberg.
Carthy, J., & Sherwood-Smith, M. (2002, 6-9 Oct. 2002). Lexical chains for topic tracking. In Systems, Man and Cybernetics, 2002 IEEE International Conference on (Vol. 7, pp. 5-pp). IEEE.
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational linguistics, 16(1), 22-29.
Cortright, J., Bosworth, B., Dabson, B., Mayer, H., Munnich, L., & Waits, M. J. (2002). 21st Century Economic Strategy: Prospering in a Knowledge-based Economy. Prepared for the Oregon Business Council.
Dai, X.-Y., Chen, Q.-C., Wang, X.-L., & Xu, J. (2010, 11-14 July 2010). Online topic detection and tracking of financial news based on hierarchical clustering. Paper presented at the Machine Learning and Cybernetics (ICMLC), 2010 International Conference on (Vol. 6, pp. 3341-3346). IEEE.
Doddington, G. (2000). Topic Detection and Tracking - Introduction and Overview. from http://www.itl.nist.gov/iad/mig/tests/tdt/2000/Papers-n-slides/NIST-overview/2000.11-Meeting.Overview/index.htm
Ester, M., Kriegel, H.-P., Sander, J., Wimmer, M., & Xu, X. (1998). Incremental clustering for mining in a data warehousing environment. Paper presented at the VLDB (Vol. 98, pp. 323-333).
Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine learning, 2(2), 139-172.
Gennari, J. H., Langley, P., & Fisher, D. (1989). Models of incremental concept formation. Artificial intelligence, 40(1), 11-61.
Hai, D., Hussain, F. K., & Chang, E. (2008, 26-29 Feb. 2008). A survey in traditional information retrieval models. In 2008 2nd IEEE International Conference on Digital Ecosystems and Technologies (pp. 397-402).
Halkidi, M. (2009). Hierarchial Clustering. In L. Liu & M. T. ÖZsu (Eds.), Encyclopedia of Database Systems (pp. 1291-1294): Springer US.
Halliday, M. A. K., & Hasan, R. (1976). Cohesion in english: 288-289.
Han, J., Kamber, M., & Pei, J. (2006). Data mining: concepts and techniques: Morgan kaufmann.
Hinneburg, A., & Keim, D. A. (1998). An efficient approach to clustering in large multimedia databases with noise. Paper presented at the KDD.
Kauffman, L., & Rousseeuw, P. (1990). Finding groups in data. An introduction to cluster analysis. New York: John Willey & Sons.
Kowalski, G. J., & Maybury, M. T. (2000). Information storage and retrieval systems: theory and implementation (Vol. 8, pp. 156-157): Springer.
Kozima, H. (1993). Text segmentation based on similarity between words. Paper presented at the Proceedings of the 31st annual meeting on Association for Computational Linguistics.
Li, S., Lv, X., Li, Y., & Shi, S. (2010b, 23-25 June 2010). Study on feature selection algorithm in topic tracking. Paper presented at the Software Engineering and Data Mining (SEDM), 2010 2nd International Conference on.
Li, S., Lv, X., Li, Y., & Shi, S. (2010d, 14-15 Aug. 2010). Study on Key Technology of Topic Tracking Based on SVM. Paper presented at the Information Engineering (ICIE), 2010 WASE International Conference on.
Li, S., Lv, X., Wang, T., & Shi, S. (2010c, 9-10 Oct. 2010). The key technology of topic detection based on K-means. Paper presented at the Future Information Technology and Management Engineering (FITME), 2010 International Conference on.
Li, S., Lv, X., Zhou, Q., & Shi, S. (2010a, 20-23 June 2010). Study on key technology of topic tracking based on VSM. In Information and Automation (ICIA), 2010 IEEE International Conference on (pp. 2419-2423). IEEE.
Li, S., Xia, C., Li, S., & Zhang, W. (2011, 24-26 Dec. 2011). Topic tracking based on Naive bayes. In Computer Science and Network Technology (ICCSNT), 2011 International Conference on (Vol. 2, pp. 1046-1049). IEEE.
Liu, N. (2009). Topic Detection and Tracking. In L. Liu & M. T. ÖZsu (Eds.), Encyclopedia of Database Systems (pp. 3121-3124): Springer US.
Liu, R., & Guo, W. (2011, 10-12 June 2011). HMM-based state prediction for Internet hot topic. In Computer Science and Automation Engineering (CSAE), 2011 IEEE International Conference on (Vol. 1, pp. 157-161). IEEE.
Liu, X., Ren, F., & Yuan, C. (2010, 21-23 Aug. 2010). Use relative weight to improve the kNN for unbalanced text category. In Natural Language Processing and Knowledge Engineering (NLP-KE), 2010 International Conference on (pp. 1-5). IEEE.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval (Vol. 1, p. 496). Cambridge: Cambridge University Press.
Marchionini, G. (2004). From information retrieval to information interaction. In S. McDonald & J. Tait (Eds.), Advances in Information Retrieval, Proceedings (Vol. 2997, pp. 1-11).
Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). The DET curve in assessment of detection task performance. NATIONAL INST OF STANDARDS AND TECHNOLOGY GAITHERSBURG MD.
Ng, R. T., & Han, J. (1994). Efficient and Effective Clustering Methods for Spatial Data Mining. Paper presented at the Proc. 20th Int. Conf. on Very Large Data Bases, 144-155. Santiago, Chile.
Omar, A. H., & Salleh, M. N. M. (2013). Modeling Unstructured Document Using N-gram Consecutive and WordNet Dictionary. Paper presented at the pie (Vol. 77, p. 1).
Papka, R., & Allan, J. (1998). On-line new event detection using single pass clustering. UMass Computer Science.
Patra, B. K., Hubballi, N., Biswas, S., & Nandi, S. (2010). Distance based fast hierarchical clustering method for large datasets. Paper presented at the Rough Sets and Current Trends in Computing (pp. 50-59). Springer Berlin Heidelberg.
Qiu, L.-Q., Pang, B., & Zhao, L.-P. (2008). An event detection algorithm based on improved STC. 2008 IEEE International Conference on Networking, Sensing and Control (ICNSC '08), 528-532.
Raman, S., Chaurasiya, V., & Venkatesan, S. (2012). Performance comparison of various information retrieval models used in search engines. In Communication, Information & Computing Technology (ICCICT), 2012 International Conference on (pp. 1-4). IEEE.
Rui, X., & Wunsch, D., II. (2005). Survey of clustering algorithms. Neural Networks, IEEE Transactions on, 16(3), 645-678.
Ruocco, M., & Ramampiaro, H. (2010). Event Clusters Detection on Flickr Images Using a Suffix-tree Structure. Proceedings 2010 IEEE International Symposium on Multimedia (ISM 2010), 41-48.
Salton, G., Wong, A., & Yang, C.-S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
Schiuma, G. (2012). Managing knowledge for business performance improvement. Journal of Knowledge Management, 16(4), 515-522.
Shah, C., Croft, W. B., & Jensen, D. (2006). Representing documents with named entities for story link detection (SLD). In Proceedings of the 15th ACM international conference on Information and knowledge management (pp. 868-869). ACM.
Steinley, D., & Brusco, M. J. (2007). Initializing k-means batch clustering: A critical evaluation of several techniques. Journal of Classification, 24(1), 99-121.
Velmurugan, T., & Santhanam, T. (2010). Computational Complexity between K-Means and K-Medoids Clustering Algorithms for Normal and Uniform Distributions of Data Points. Journal of Computer Science, 6(3).
Wang, W., Yang, J., & Muntz, R. (1997). STING: A statistical information grid approach to spatial data mining. Paper presented at the VLDB (Vol. 97, pp. 186-195).
Wei, Y.-q., Liu, P.-y., & Zhu, Z.-f. (2008, 6-8 Oct. 2008). A Feature Selection Method based on Improved TFIDF. In Pervasive Computing and Applications, 2008. ICPCA 2008. Third International Conference on (Vol. 1, pp. 94-97). IEEE.
Xu, J., & Croft, W. B. (1999). Cluster-based language models for distributed retrieval. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 254-261). ACM.
Xue, Z., Li, G., Zhang, W., Pang, J., & Huang, Q. (2014). Topic detection in cross-media: a semi-supervised co-clustering approach. International Journal of Multimedia Information Retrieval, 1-13.
Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 46-54). ACM.
Zhang, D., & Li, S. (2011, 9-11 Sept. 2011). Topic detection based on K-means. In Electronics, Communications and Control (ICECC), 2011 International Conference on (pp. 2983-2985). IEEE.
Zhang, X. (2010). Support Vector Machines. In C. Sammut & G. Webb (Eds.), Encyclopedia of Machine Learning (pp. 941-946): Springer US.
Zhe, G., Zhe, J., Shoushan, L., Bin, T., Xinxin, N., & Yang, X. (2011, 24-26 Dec. 2011). An adaptive topic tracking approach based on Single-Pass clustering with sliding time window. In Computer Science and Network Technology (ICCSNT), 2011 International Conference on (Vol. 2, pp. 1311-1314). IEEE.
校內:2020-08-31公開