簡易檢索 / 詳目顯示

研究生: 李淑娟
Li, Shu-Chuan
論文名稱: 以語意基礎之期刊文獻主題分群方法
Document Topic Detection Based On Semantic Feature
指導教授: 王惠嘉
Wang, Hei-Chia
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2007
畢業學年度: 95
語文別: 中文
論文頁數: 60
中文關鍵詞: 語彙鏈文件分群語意相似度
外文關鍵詞: document cluster, lexical chain, semantic similarity
相關次數: 點閱:97下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 期刊電子化已成為趨勢,加上網際網路的發達,人們使用網路電子期刊的數量逐日攀升。每月皆有新的電子期刊出版,使得查詢到期刊的資訊量以遠遠超過人們所能處理的能力。面對此一問題,現今已有許多大量的電子期刊資料庫提出keyword搜尋方式,以減輕使用者在搜尋期刊論文上所需花費的精力及時間。僅管如此,使用者仍須面對龐大的搜尋結果,並無較佳的呈現方式來協助使用者快速過濾相關文獻。如何在電子期刊資料庫上正確且快速地檢索與擷取使用者所需要的資訊,並完整且友善地呈現搜尋結果來改善文獻搜集上所耗費的時間,已成為現今重要的研究議題。
    因現今的電子期刊資料庫皆是透過使用者所輸入的關鍵字,搜尋並擷取資料庫中所有包含此關鍵字的電子期刊,並單純以列表的方式呈現搜尋結果。若能歸納出文獻主題,並將搜尋結果以此文獻主題加以區隔,則可以「主題-文獻」的階層方式呈現搜尋結果給使用者,改善傳統列表呈現模式的問題。雖已有學者利用主題偵測(topic detection)的技術來逹到此目標,但他們的方法是以全文為基礎來應用於一般網路文件。此方法並不適用於文獻期刊,因期刊有文獻結構上的特性(如題目、關鍵字及摘要),同時目前方法在比對上也純粹以「字」的方式進行比對,而忽略了「語意」的重要性。有鑑於此,本研究將提出以文獻結構特性和語意的方式擷取出每篇文獻的重要字集及分群運算,並抽取出每一群的主題加以呈現搜尋結果。期使用者能透過「主題 - 群」的階層呈現方式縮短文獻收集的時間,正確找到其所需之資訊。

    Because of the development of Internet, electronic journals have become a trend. The quantities of people use electronic journals more than before. The amount of electronic journal articles grow faster than before, it leads that information generated over the ability that people can deal with. In order to deal with this problem, a lot of electronic periodical databases have proposed keyword search methods to lighten user's effort and time spending in searching the journal papers. However, the users still have to face the huge search results. Currently, there is no better way of representing to help users to speed up filtering relevant documents. How to provide a efficient search, i.e. present the search result in categories, have became an important research topic now.
    Today’s electronic journal databases apply keywords search which user inputs interested terms and search engine find the papers which contains keywords and show the searching results by the way of tabulating. If these results can be generalized and classify by their topics, then we can show the search results to users by the topic which should be able to improve tradition display method.
    Though scholars have employed topic detection method to achieve this goal in full-text documents on network. However, literatures have the structure properties, such as title, keyword, and abstract, not only full-text. Simultaneously, traditional topic detection method only uses the word frequency feature, ignores the importance of semantic. Therefore, the proposed research designs a method which is based on literature structure and semantic properties to extract important words and cluster to each literature. It can retrieve each group topics and display the searching results by these topics. Expect users can reduce literature collection time and find correctly information by topic-cluster display way.

    1. 緒論 1 1.1 研究背景 1 1.2 研究動機及目的 2 1.3 研究流程 3 1.4 研究範圍及限制 5 1.5 論文架構 5 2. 文獻探討 6 2.1 資訊擷取(Information Retrieval, IR) 6 2.2 主題偵測與追蹤(Topic Detection and Tracking, TDT) 8 2.2.1 基本概念 8 2.2.2 研究任務 9 2.2.3 評估方法 11 2.2.4 主題偵測與追蹤的應用 12 2.3 主題偵測(Topic detection) 12 2.3.1 語彙鏈(Lexical Chain) 12 2.3.2 單純貝氏分類器(Naïve Bayes classifier) 14 2.3.3 階層式分群演算法(Hierarchical Clustering Algorithms) 14 2.3.4 正規概念分析法(Formal Concept Analysis;FCA) 16 2.3.5 文句關係地圖 18 2.3.6 各主題偵測方法的比較 19 2.4 分群的方法 20 2.4.1 分割式分群演算法(Partitioning Clustering Algorithm) 20 2.4.2 階層式分群演算法(Hierarchical Clustering Algorithms) 21 2.4.3 密度式分群演算法(Density-Based Partitioning) 21 2.4.4 網格式分群演算法(Grid-based Clustering Algorithm) 22 2.4.5 模型式分群演算法(Model-based Clustering Algorithm) 22 2.4.6 小結 22 2.5 潛在語意分析(latent semantic indexing) 23 3. 研究方法 24 3.1 研究架構 24 3.2 資訊收集模組 25 3.3 文件代表模組 27 3.4 語意分群模組 32 3.4.1 計算語意相似度 33 3.4.2 語意分群 35 4. 系統建置與驗證 38 4.1 系統實作設計 38 4.1.1 Pre-process 39 4.1.2 Document Representative 39 4.1.3 Semantic Cluster 40 4.2 實驗方法 41 4.2.1 資料來源 42 4.2.2 比較對象 42 4.2.3 評估指標的選擇 42 4.2.4 實驗方法設計 43 4.3 實驗結果與分析 44 4.4 系統畫面範例 52 5. 結論及未來研究方向 53 5.1 研究成果 53 5.2 未來研究方向 54 參考文獻 56 附錄一 59

    英文文獻
    Agrawal. R., Gehrke, J., Gunopulos, D., and Raghavan, P.. Automatic Subspace Clustering of High Dimensional Data for Data Mining Application. Proceedings of the 1998 ACM SIGMOD international conference on Management of data, Seattle, Washington, United States, 94-105, 1998.
    Allan, J., Papka, R., and Lavrenko, V.. On-line New Event Detection and Tracking. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 37-45, 1998.
    Baeza-Yates, R., & Ribeiro-Neto, B.. Modern Information Retrieval. New York: The ACM Press, 1999.
    Barzilay, R. , and Elhadad, M.. Using Lexical Chains for Text Summarization. ACL/EACL Workshop on Intelligent Scalable Text Summarization, 1997.
    Berkhin, P.. Survey of Clustering Data Mining Techniques. In: Accrue Sotware, 2002.
    Chali, Y.. Topic detection of unrestricted texts: Approaches and evaluations. Applied Artificial Intelligence, 19(2), 119-136, 2005.
    Edmunds, A., and Morris, A.. The problem of information overload in business organizations: A review of the literature. International Journal of Information Management, 20(1), 17-28, 2000.
    Ester, M., Kriegel, H.P., Sander, J. and Xu, X.. Adensity-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference no Knowledge Discovery and Data Mining, Portland, orgon, 226-231, 1996.
    Farhoomand, A. F., and Drury, D. H.. Managerial information overload. Communications of the ACM, 45(10), 127-131, 2002.
    Fattore, M. , and Arrigo, P.. Topical clustering of biomedical abstract by self organizing maps. Bioinformatics of Genome Regulation and Structure II ,481-490, 2006.
    Fisher, D.. Improving Inference through Conceptual Clustering. Proceedings of 1987 AAAI Conferences, Seattle, Washington, United States, 461-465, 1987.
    Halliday, M., and Hasan, R.. Cohesion in En-glish. London: Longman, 1976.
    Han, E. H., Karypis G., Kumar, V., and Mobasher B.. Hypergraph based clustering in high-dimensional data sets: A summary of results. IEEE Bulletin of the Technical Committee on Data Engineering, 21(1), 15–22, 1998.
    Hatch, P., Stokes, N., and Carthy, J.. Topic Detection, a New Application for Lexical Chaining?. In the Proceedings of BCS- IRSG 2000, the 22nd Annual Colloquim on Information Retrieval Research, Cambridge, 94-103, 2000.
    Hotho, A., Nürnberger A., and Paaß G.. A brief Survey of Text Mining. LDV Forum - GLDV Journal for Computational Linguistics and Language Technology, 2005.
    Kantardzic, M.. Data Mining. Hoboken: Wiley Inter-Science, 2003.
    Karypis, G., Han, E. H., and Kumar, V.. CHAMELEON: A hierarchical  clustering algorithm using dynamic modeling. IEEE Computer, 32(8), 68–75, 1999.
    Kaufman, L., and Rousseeuw, P. J.. Finding groups in data: an Introduction to cluster analysis. NewYork: John Wiley & Sons, 1990.
    Khan, M. S. , and Khor, S. W.. Web document clustering using a hybrid neural network. Applied Soft Computing, 4(4), 423-432, 2004.
    Kohonen, T. . The self-organizing map. Proceedings of the IEEE, 78(9),1464-1480, 1990.
    Landauer, T.K., Foltz, P.W. and Laham, D.. Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284, 1998.
    Lee, M., Wang, W., and Yu, H.. Exploring supervised and unsupervised methods to detect topics in biomedical text. BMC Bioinformatics , 7(140) , 2006.
    Lin, S. H., Shin, C. S., Chen, M. C., Ho, J. M., Ko, M. T. and Huang, Y. M.. Extracting Classification Knowledge of Internet Documents with Mining Term Associations: A Semantic Approach. Proceedings of the 21 st annual international ACM SIG1R conference on Research and development in information retrieval (SIGIR-98), Melbourne, Australia, 1998.
    Manning, C. D., and Schutze, H.. Foundations of statistical natural language processing, The MIT Press, 1999.
    Martin, A., Doddington, T. K. G.., Ordowski, M., and Przybocki, M.. The DET curve in assessment of detection task performance. In Proceedings of EuroSpeech’97, 4, 1895-1898, 1997.
    Morris, J., and Hirst, G.. Lexical cohesion com-puted by thesaural relations as an indicator of thestructure of text. Computational Linguistics, 17(1), 21–43, 1991.
    Priss, U.. Formal concept analysis in information science. Annual Review of Information Science and Technology, 40, 521-543, 2006.
    Salton, G., and McGill, M.. Introduction to Modern Information Retrieval. New York: McGraw-Hill, 1983.
    Shah, P.K., Perez-Iratxeta, C., Bork, P., and Andrade, M.A.. Information extraction from full text scientific articles: where are the keywords? BMC Bioinformatics, 4(1), 20-28, 2003.
    Sheikholeslami, G., Chatterjee, S., and Zhang, A.. WaveCluster: A multi-resolution clustering approach for very large spatial databases. Proceedings of 24th International Conference on Very Large Data Bases, New York, United States, 428-439, 1998.
    Wang, W., Jiong, Y., and Richard, M.. STING: a statistical information grid approach to spatial data mining. Proceedings of the 23rd VLDB Conference, Athens, Greece, 186-195, 1997.
    Wayne, C. L.. Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation. Language Resources and Evaluation Conference(LREC), 1487-1494, 2000.
    Wille, R. Restructuring lattice theory: an approach based on hierarchies of concepts. In: Rival, I. (ed.) Ordered Sets.445-470. Dordrecht-Boston, Reidel, 1982.
    Wolff, K. E.. A first course in formal concept analysis - how to understand line diagrams. In F. Faulbaum (Ed.), SoftStat’93, Advances in statistical software, 4, Gustav Fischer Verlag, 429-438, 1994.
    Xu, J.. Solving the Word Mismatch Problem Through Automatic Text Analysis. PhD Thesis, University of Massachusetts at Amherst, 1997.
    網站資料
    TDT (Topic Detection and Tracking), http://www.nist.gov/speech/tests/tdt/
    Vivisimo, http://www.vivisimo.com
    Formal Concept Analysis Homepage,
    http://www.upriss.org.uk/fca/fca.html#einleitung

    下載圖示 校內:2010-06-21公開
    校外:2010-06-21公開
    QR CODE