簡易檢索 / 詳目顯示

研究生: 羅明瑾
Lo, Ming-Chin
論文名稱: 應用階層式語意關係於半結構化文件之分類
Semi-structured Document Categorization with Hierarchical Semantic Relatedness
指導教授: 李昇暾
Li, Sheng-Tun
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2010
畢業學年度: 98
語文別: 英文
論文頁數: 46
中文關鍵詞: 知識管理啟發式樹狀階層文件分類知識呈現樹狀結構
外文關鍵詞: Knowledge management, information retrieval, semantic relatedness, tree-based hierarchical knowledge structure, text classification
相關次數: 點閱:151下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著資訊爆炸(information explosion)以及知識經濟時代的來臨,以及電腦和網際網路的普及,人們可以輕易的取得並分享知識及資訊。因此,數位化文件的數量正以指數性的方式在成長,這也使得手動的搜尋以及管理資訊變得相當困難且耗時,所以如何自動化的搜尋及分類文件就成了很重要的議題。另外,再加上知識管理(knowledge management, KM)的重要性,如何從大量的資訊當中萃取有用的知識,並且將知識視覺化的呈現,也是相當重要的。自動化的文件分類以及知識呈現(knowledge representation)減輕了使用者在管理大量知識上所需耗費的成本,而在知識呈現的議題當中,樹狀結構(knowledge representation)是最常被使用的工具。
    在本篇論文中,我們採用啟發式樹狀階層(heuristic tree-hierarchy)來建構關鍵字之間的關係。啟發式樹狀階層是一個單繼承的樹狀結構,呈現了字詞或概念之間的語意關係以及階層關係。我們選擇了Reuters-21578以及20Newsgroups當作研究對象,針對各個類別選取最具有代表性的關鍵字。接著,根據字詞在WordNet資料庫中的關係,來定義關鍵字之間的原始相似度,並且利用原始的相似度矩陣來進行啟發式樹狀階層的建構。最後,利用啟發式樹狀階層,轉換出關鍵字之間的階層式相似度矩陣,並且根據以原始以及階層的相似度矩陣,分別進行文件的分類。實驗結果證實,由於考量了關鍵字之間的階層關係,利用階層式相似度矩陣進行的文件分類,能夠得到較準確的結果。另外,本研究也證實了啟發式樹狀階層明確地提供了字詞或概念之間的階層關係。

    In the generation of knowledge economy, knowledge is regarded as an important asset of both individual and organization. Due to the popularity of computer, knowledge such as digital documents can be diffused easily through Internet and the quantity of knowledge in digital form grows in an exponential manner. Thus, knowledge management (KM) has become an important issue. The works in KM are getting tedious and infeasible because of the huge corpora of documents. Therefore, automatic text classification is employed to facilitate document categorization and the visualization of knowledge becomes another subject. There are many formats used for knowledge representation, the most common one is tree-based hierarchical knowledge structure.
    In this paper, we apply a methodology for constructing a tree structure called heuristic tree-hierarchy which is based on semantic relatedness between words or concepts. First, 20 Newsgroups and Reuters-21578 are used in our research and key terms are selected using technique of information retrieval. Furthermore, the computation of initial semantic relatedness is conducted based on WordNet ontology. Finally, we exploit this tree structure to define hierarchical semantic relatedness and implement the task of text classification after the construction of heuristic tree-hierarchy. Compared to the original classifier, the hierarchical classifier classifies documents better and is more specific, which means the classifier can recognize the actual negatives very well.

    Abstract ................................................ i 摘要..................................................... ii 致謝.................................................... iii Table of contents ...................................... iv List of tables ......................................... vi List of figures ....................................... vii Chapter 1 Introduction .................................. 1 1.1 Research background ................................. 1 1.2 Research motivation.................................. 2 1.3 Objective ........................................... 2 1.4 Limitations ......................................... 3 1.5 Structure of thesis.................................. 4 Chapter 2 Literature review.............................. 6 2.1 Text classification.................................. 6 2.1.1 Term weighting..................................... 6 2.1.2 Term selection .................................... 8 2.2 Knowledge representation............................. 8 2.2.1 Lattice-based knowledge structure.................. 9 2.2.2 Tree-based knowledge structure.....................11 2.3 Semantic relatedness ............................... 13 2.3.1 Edge counting-based approach...................... 14 2.3.2 Information theory-based approach ................ 15 Chapter 3 Methodology................................... 17 3.1 Concept codification ............................... 18 3.2 WordNet similarity.................................. 19 3.3 Heuristic tree-hierarchy............................ 21 3.3.1 Heuristic tree-hierarchy construction algorithm (HTCA). 22 3.3.2 Hierarchiness assessment (HA)..................... 24 3.4 Semantic relatedness ............................... 26 3.5 Document classification ............................ 27 Chapter 4 Experiments and results....................... 29 4.1 Implementation...................................... 29 4.2 Data collection..................................... 29 4.3 Classification experiments ......................... 31 4.3.1 Experiment 1 – JCn measure........................ 32 4.3.2 Experiment 2 – Lin measure ....................... 33 4.4 Evaluation and results ............................. 33 Chapter 5 Conclusion and future works................... 40 5.1 Conclusion.......................................... 40 5.1 Future works........................................ 42 References ............................................. 43

    Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern Information Retrieval. Edinburgh Gate: Addison Wesley.
    Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The Semantic Web. Scientific American.
    Bradley, J. H., Paul, R., & Seeman, E. (2006). Analyzing the structure of expert knowledge Information & Management, 43(1), 77-91.
    Budanitsky, A., & Hirst, G. (2006). Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Computational Linguistics, 32(1), 13 - 47.
    Carpineto, C., & Romano, G. (2004). Exploiting the Potential of Concept Lattices for Information Retrieval with CREDO. . Journal of Universal Computer Science, 10(8), 985-1013.
    Chen, C.-Y. (2008). Heuristic-based Approach for Constructing Hierarchical Knowledge Structure. National Cheng Kung University, Tainan, Taiwan.
    Eppler, M. J. (2008). Classifying Knowledge Maps: Typologies and Application Examples Knowledge Management Strategies: A Handbook of Applied Technologies: IGI Global.
    Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research 3, 1289 - 1305
    Formica, A., & Missikoff, M. (2004). Inheritance Processing and Conflicts in Structural Generalization Hierarchies. ACM Computing Surveys (CSUR), 36(3), 263-290.
    Ganter, B., Wille, R., & Franzke, C. (1999). Formal Concept Analysis: Mathematical Foundations. New York: Springer-Verlag.
    Gu, H., & Zhou, K. (2006). Text Classification Based on Domain Ontology. Journal of Communication and Computer, 3(5), 29-32.
    Jiang, J. J., & Conrath, D. W. (1997). Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. Paper presented at the Proceedings of International Conference Research on Computational Linguistics (ROCLING X), Taiwan.
    Kruskal, J. B. (1956). On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem Paper presented at the Proceedings of the American Mathematical Society.
    López, M. F., Gómez-Pérez, A., Sierra, J. P., & Sierra, A. P. (1999). Building a Chemical Ontology Using Methontology and the Ontology Design Environment. Intelligent Systems, 14(1), 37-46.
    Leacock, C., & Chodorow, M. (1998). Combining Local Context and WordNet Similarity for Word Sense Identification In C. Fellbaum (Ed.), WordNet: an electronic lexical database (pp. 265-283). Cambridge, MA: The MIT Press.
    Lewis, D. D., & Gale, W. A. (1994). A Sequential Algorithm for Training Text Classifiers. Paper presented at the Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, Dublin, Ireland
    Li, Y., Bandar, Z. A., & McLean, D. (2003). An Approach for Measuring Semantic Similarity between Words Using Multiple Information Sources. IEEE transactions on Knowledge and Data Engineering, 15(4), 871-882.
    Lin, D. (1998). An Information-Theoretic Definition of Similarity. Paper presented at the Proceedings of the 15th International Conference on Machine Learning (ICML-98) Madison, Wisconsin USA.
    Liu, Y., Loh, H. T., & Sun, A. (2009). Imbalanced text classification: A term weighting approach Expert Systems with Applications, 36(1), 690-701
    Miller, G. A. (1995). WordNet: A Lexical Database for English. Communications of the ACM 38(11), 39-41.
    Murphy, G. L., & Lassaline, M. E. (1997). Hierarchical structure in concepts and the basic level of categorization Knowledge, Concepts, and Categories (pp. 93-129). Cambridge, MA The MIT Press.
    Nonaka, I. (1994). A Dynamic Theory of Organizational Knowledge Creation Organization Science, 5(1), 14-37.
    Priss, U. (2006). Formal Concept Analysis in Information Science Annual review of information science and technology (ARIST) (Vol. 40, pp. 521-543). New Jersey: Medford, Information Today.
    Provost, F., Fawcett, T., & Kohavi, R. (1998). The Case against Accuracy Estimation for Comparing Classifiers Paper presented at the Proceedings of the 15th International Conference on Machine Learning, Madison, Wisconsin.
    Rada, R., Mili, H., Bicknell, E., & Blettner, M. (1989). Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics, 19(1), 17-30.
    Resnik, P. (1995). Using Information Content to Evaluate Semantic Similarity in Taxonomy. Paper presented at the Proceedings of 14th International Joint Conference on Artificial Intelligence, Montreal.
    Resnik, P. (1999). Semantic Similarity in aTaxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research, 11, 95-131.
    Ruiz, M. E., & Srinivasan, P. (2002). Hierarchical Text Categorization Using Neural Networks Information Retrieval, 5(1), 87-118.
    Salahlis, M. A. (2009). AN APPROACH FOR MEASURING SEMANTIC RELATEDNESS BETWEEN WORDS VIA RELATED TERMS. Mathematical and Computational Applications, 14(1), 55-63.
    Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513-523.
    Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47.
    Truyen, E., Joosen, W., Jørgensen, B. N., & Verbaeten, P. (2004). A Generalization and Solution to the Common Ancestor Dilemma Problem in Delegation-Based Object Systems. Paper presented at the Proceedings of Dynamic Aspects Workshop (DAW04), Lancaster, England.
    Wolff, K. E. (1993). A First Course in Formal Context Analysis. Paper presented at the SoftStat'93.
    Wu, Z., & Palmer, M. (1994). Verbs semantics and lexical selection. Paper presented at the Proceedings of the 32nd annual meeting on Association for Computational Linguistics Morristown, NJ, USA
    Yang, H.-C., & Lee, C.-H. (2004). A text mining approach on automatic generation of web directories and hierarchies. Expert Systems with Applications 27(4), 645-663

    下載圖示 校內:2020-06-30公開
    校外:2020-06-30公開
    QR CODE