簡易檢索 / 詳目顯示

研究生: 葉祐欣
Ye, You-Xin
論文名稱: 概念集群法於自動化文件分類之研究
Text Categorization using Conceptual Clustering Approach
指導教授: 李昇暾
Li, Sheng-Tun
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2011
畢業學年度: 99
語文別: 中文
論文頁數: 53
中文關鍵詞: 文件分類模糊正規概念分析資訊擷取
外文關鍵詞: text categorization, fuzzy formal concept analysis, information retrieval
相關次數: 點閱:101下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 現今的世界已經進入一個資訊爆炸的時代,隨著網際網路驚人的成長以及硬體的大幅進步使得數位化文件數量急遽增加,一個有效的文件自動分類方法將能幫助使用者從大量的文件集之中迅速得到想要的資訊並且進一步管理這些文件。
    本研究提出基於概念集群的自動化文件分類法。透過模糊正規概念分析(Fuzzy Formal Concept Analysis, FFCA)強而有力的知識發現能力找出文件中隱含的概念知識,並藉由文件的概念化清楚了解文件中蘊含了那些概念。概念的資訊不僅能幫助加快人工分類的效率,更能以自動化的方式對文件進行分類。由於不須領域專家參與其中,使得此一分類法能有效推廣並應用至不同領域。
    在績效評估方面,將以文件分類領域中廣為使用的Reuters 21578新聞資料集以及20 Newsgroups e-mail資料集作為驗證對象以檢視分類之成效,實驗結果顯示以概念資訊進行文件分類能夠取到較佳的績效。

    In the information exploration world, it accompanies with the booming Internet world and the dramatic improvement of hardware performance, digital documents accumulation. Therefore, efficiency classification method is becoming one of the important techniques to help users acquiring useful information easily from the text depository.
    This study proposes a method to effectively classify documents into a set of predefined categories by utilizing a conceptual clustering technique. Fuzzy Formal Concept Analysis (FFCA) is such a powerful tool to perform conceptual clustering and extract hidden knowledge from documents. Once the discovered knowledge has been clearly defined, applying the conceptualize procedure could make each document’s conceptual knowledge available. It will help experts not only having a better insight of the document, but also leading to an enhanced efficiency of manually tagging. Furthermore, conceptual knowledge also could be suitable for automatic text categorization task without expert involved.
    In this study, the well-known Reuters 21578 news article collection and 20 Newsgroups e-mail collection are chosen for evaluating the performance of the proposed method. Experiments present a better result that classification using conceptual information.

    摘要 II Abstract III 誌謝 IV 目錄 V 表目錄 VII 圖目錄 VIII 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 3 1.3 研究步驟與流程 3 第二章 文獻回顧與探討 5 2.1 文件分類 5 2.1.1 監督式與非監督式學習 5 2.1.2 文件分類流程 6 2.1.2 文件呈現與預處理 7 2.1.3 特徵選取 7 2.1.4 向量空間模型 9 2.1.5 詞頻-逆向文件頻率 9 2.1.6 文件分類技術 12 2.2 模糊集合理論 15 2.2.1 模糊集合 16 2.2.2 模糊邏輯 16 2.2.2 模糊α截集 16 2.2.3 基本模糊運算元 18 2.2.4 模糊集合相似度 18 2.3 正規概念分析 18 2.3.1 正規情境 19 2.3.2 正規概念 19 2.3.3 概念網路 20 2.3.4模糊正規概念分析 22 第三章 研究方法 23 3.1 研究流程 23 3.2 建構文件向量 24 3.2.1 文件預處理 24 3.2.2 特徵選取 24 3.2.3 計算特徵詞權重 25 3.3 建構概念網路 25 3.4 文件概念化 27 3.4.1 計算概念隸屬度 27 3.4.2 學習與分類 29 第四章 實驗與分析 33 4.1 實驗方法 33 4.1.1 資料集介紹 33 4.1.2 分類效能評估指標 34 4.2 實驗結果 35 4.2.1 敏感度分析 39 4.2.2 實驗結果比較 42 4.2.3 統計檢定 45 第五章 結論與未來展望 47 5.1 結論 47 5.2 應用價值 47 5.3 未來展望 48 參考文獻 49 附錄一:Reuters 21578下未概念化SVM與本研究方法之30次實驗結果 52 附錄二:20 Newsgroups下未概念化SVM與本研究方法之30次實驗結果 53

    Abebe, A. J., Guinot, V., & Solomatine, D. P. (2000). Fuzzy alpha-cut vs. Monte Carlo techniques in assessing uncertainty in model parameters. In Proc. 4th International Conference on Hydroinformatics, Iowa City, USA.
    Baoli, L., Qin, L., & Shiwen, Y. (2004). An adaptive k-nearest neighbor text categorization strategy. ACM Transactions on Asian Language Information Processing, 3(4), 215-226.
    Bayes, M., & Price, M. (1763). An Essay towards Solving a Problem in the Doctrine of Chances. Philosophical Transactions, 53, 370 -418.
    Burusco, A., & Fuentes-González, R. (1998). Construction of the L-fuzzy concept lattice. Fuzzy Sets and systems, 97(1), 109-114.
    Carpineto, C., Michini, C., & Nicolussi, R. (2009). A concept lattice-based kernel for SVM text classification. Formal Concept Analysis, 237-250.
    Carpineto, C., & Romano, G. (2004). Exploiting the potential of concept lattices for information retrieval with CREDO. Journal of Universal Computer Science, 10(8), 985-1013.
    Chang, C. C., & Lin, C. J. (2001). LIBSVM : a library for support vector machines. Retrieved from http://www.csie.ntu.edu.tw/~cjlin/libsvm
    Chung, E., Miksa, S., & Hastings, S. K. (2010). A framework of automatic subject term assignment for text categorization: An indexing conception-based approach. Journal of the American Society for Information Science and Technology, 61, 688-699.
    Clark, P., & Boswell, R. (1991). Rule induction with CN2: Some recent improvements. In Machine learning—EWSL-91, pp. 151-163.
    Clark, P., & Niblett, T. (1989). The CN2 Induction Algorithm. MACHINE LEARNING, 3, 261-283.
    Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.
    Cross, V. (2003). Uncertainty in the Automation of Ontology Matching. In Uncertainty Modeling and Analysis, International Symposium, Vol. 0, p. 135. Los Alamitos, CA, USA: IEEE Computer Society.
    Eklund, P., & Wormuth, B. (2005). Restructuring help systems using formal concept analysis. Formal Concept Analysis, 129-144.
    Everts, T. J., Park, S. S., & Kang, B. H. (2006). Using formal concept analysis with an incremental knowledge acquisition system for web document management. In Proceedings of the 29th Australasian Computer Science Conference - Volume 48, ACSC '06 (pp. 247-256). Darlinghurst, Australia, Australia: Australian Computer Society, Inc. Retrieved from http://portal.acm.org/citation.cfm?id=1151699.1151727
    Hamill, K. A., & Zamora, A. (1980). The use of titles for automatic document classification. Journal of the American Society for Information Science, 31(6), 396-402.
    Han, E. H., Karypis, G., & Kumar, V. (2001). Text categorization using weight adjusted k-nearest neighbor classification. Advances in Knowledge Discovery and Data Mining, 53-65.
    Heckerman, D. (1997). Bayesian networks for data mining. Data mining and knowledge discovery, 1(1), 79-119.
    Huang, Y. (1998). a theoretic and empirical research of cluster indexing for mandarin chinese full text document. The Journal of Library and Information Science, 24, 1023-2125.
    Joachims, T. (1997). Text categorization with support vector machines: Learning with many relevant features. Machine Learning: ECML-98, 137-142.
    Kass, G. V. (1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data. Journal of the Royal Statistical Society. Series C (Applied Statistics), 29(2), 119-127.
    Kim, M., & Compton, P. (2004). Evolutionary document management and retrieval for specialized domains on the web. International Journal of Human-Computer Studies, 60(2), 201-241.
    Klir, G. J., & Yuan, B. (1995). Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall.
    Kwok, K. L. (1975). The use of title and cited titles as document representation for automatic classification. Information Processing & Management, 11(8-12), 201-206.
    Lee, H. M., Chen, C. M., & Hwang, C. W. (2000). A neural network document classifier with linguistic feature selection. Intelligent Problem Solving. Methodologies and Approaches, 555-560.
    Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
    Maron, M. E. (1961). Automatic Indexing: An Experimental Inquiry. Journal of the ACM, 8(3), 404-417.
    Masand, B., Linoff, G., & Waltz, D. (1992). Classifying news stories using memory based reasoning. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, Copenhagen, Denmark, pp. 59-65.
    Porter, M. (1980). An Algorithm for Suffix Stripping. Program, 14(3), 130-137.
    Quan, T. T., Hui, S. C., Fong, A. C., & Cao, T. H. (2006). Automatic fuzzy ontology generation for semantic web. IEEE Transactions on Knowledge and Data Engineering, 842-856.
    Quan, T. T., Hui, S. C., & Cao, T. H. (2005). A fuzzy FCA-based approach for citation-based document retrieval. In Cybernetics and Intelligent Systems, 2004 IEEE Conference, Vol. 1, pp. 578-583.
    Quinlan, J. R. (1987). Simplifying decision trees. International Journal of Man-Machine Studies, 27(3), 221-234.
    Quinlan, J. R. (1996). Bagging, Boosting, and C4.5. In Proceedings Of The Thirteenth National Conference On Artificial Intelligence, 725-730.
    Richard Jeffrey, C. (2000). The Management and Visualisation of Document Collections Using Formal Concept Analysis.
    Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.
    Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
    Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47.
    Shein, K. P. P., & Nyunt, T. T. S. (2010). Sentiment Classification Based on Ontology and SVM Classifier. In 2010 Second International Conference on Communication Software and Networks, pp. 169-172. Presented at the 2010 Second International Conference on Communication Software and Networks, Singapore.
    Sullivan, D. (2001). Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales. John Wiley & Sons.
    Wang, T., & Chiang, H. (2007). Fuzzy support vector machine for multi-class text categorization. Information Processing & Management, 43(4), 914-929.
    Wille, R., & Mathematik, F. (1982). Restructuring Lattice Theory: an Approach Based on Hierarchies of Concepts.
    Wolff, K. E. (1993). A first course in formal concept analysis. SoftStat, 93, 429-438.
    Zadeh, L. A. (1965). Fuzzy sets. Information and Control, (8), 338-353.

    無法下載圖示 校內:2023-12-18公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE