| 研究生: |
葉祐欣 Ye, You-Xin |
|---|---|
| 論文名稱: |
概念集群法於自動化文件分類之研究 Text Categorization using Conceptual Clustering Approach |
| 指導教授: |
李昇暾
Li, Sheng-Tun |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2011 |
| 畢業學年度: | 99 |
| 語文別: | 中文 |
| 論文頁數: | 53 |
| 中文關鍵詞: | 文件分類 、模糊正規概念分析 、資訊擷取 |
| 外文關鍵詞: | text categorization, fuzzy formal concept analysis, information retrieval |
| 相關次數: | 點閱:101 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
現今的世界已經進入一個資訊爆炸的時代,隨著網際網路驚人的成長以及硬體的大幅進步使得數位化文件數量急遽增加,一個有效的文件自動分類方法將能幫助使用者從大量的文件集之中迅速得到想要的資訊並且進一步管理這些文件。
本研究提出基於概念集群的自動化文件分類法。透過模糊正規概念分析(Fuzzy Formal Concept Analysis, FFCA)強而有力的知識發現能力找出文件中隱含的概念知識,並藉由文件的概念化清楚了解文件中蘊含了那些概念。概念的資訊不僅能幫助加快人工分類的效率,更能以自動化的方式對文件進行分類。由於不須領域專家參與其中,使得此一分類法能有效推廣並應用至不同領域。
在績效評估方面,將以文件分類領域中廣為使用的Reuters 21578新聞資料集以及20 Newsgroups e-mail資料集作為驗證對象以檢視分類之成效,實驗結果顯示以概念資訊進行文件分類能夠取到較佳的績效。
In the information exploration world, it accompanies with the booming Internet world and the dramatic improvement of hardware performance, digital documents accumulation. Therefore, efficiency classification method is becoming one of the important techniques to help users acquiring useful information easily from the text depository.
This study proposes a method to effectively classify documents into a set of predefined categories by utilizing a conceptual clustering technique. Fuzzy Formal Concept Analysis (FFCA) is such a powerful tool to perform conceptual clustering and extract hidden knowledge from documents. Once the discovered knowledge has been clearly defined, applying the conceptualize procedure could make each document’s conceptual knowledge available. It will help experts not only having a better insight of the document, but also leading to an enhanced efficiency of manually tagging. Furthermore, conceptual knowledge also could be suitable for automatic text categorization task without expert involved.
In this study, the well-known Reuters 21578 news article collection and 20 Newsgroups e-mail collection are chosen for evaluating the performance of the proposed method. Experiments present a better result that classification using conceptual information.
Abebe, A. J., Guinot, V., & Solomatine, D. P. (2000). Fuzzy alpha-cut vs. Monte Carlo techniques in assessing uncertainty in model parameters. In Proc. 4th International Conference on Hydroinformatics, Iowa City, USA.
Baoli, L., Qin, L., & Shiwen, Y. (2004). An adaptive k-nearest neighbor text categorization strategy. ACM Transactions on Asian Language Information Processing, 3(4), 215-226.
Bayes, M., & Price, M. (1763). An Essay towards Solving a Problem in the Doctrine of Chances. Philosophical Transactions, 53, 370 -418.
Burusco, A., & Fuentes-González, R. (1998). Construction of the L-fuzzy concept lattice. Fuzzy Sets and systems, 97(1), 109-114.
Carpineto, C., Michini, C., & Nicolussi, R. (2009). A concept lattice-based kernel for SVM text classification. Formal Concept Analysis, 237-250.
Carpineto, C., & Romano, G. (2004). Exploiting the potential of concept lattices for information retrieval with CREDO. Journal of Universal Computer Science, 10(8), 985-1013.
Chang, C. C., & Lin, C. J. (2001). LIBSVM : a library for support vector machines. Retrieved from http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chung, E., Miksa, S., & Hastings, S. K. (2010). A framework of automatic subject term assignment for text categorization: An indexing conception-based approach. Journal of the American Society for Information Science and Technology, 61, 688-699.
Clark, P., & Boswell, R. (1991). Rule induction with CN2: Some recent improvements. In Machine learning—EWSL-91, pp. 151-163.
Clark, P., & Niblett, T. (1989). The CN2 Induction Algorithm. MACHINE LEARNING, 3, 261-283.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.
Cross, V. (2003). Uncertainty in the Automation of Ontology Matching. In Uncertainty Modeling and Analysis, International Symposium, Vol. 0, p. 135. Los Alamitos, CA, USA: IEEE Computer Society.
Eklund, P., & Wormuth, B. (2005). Restructuring help systems using formal concept analysis. Formal Concept Analysis, 129-144.
Everts, T. J., Park, S. S., & Kang, B. H. (2006). Using formal concept analysis with an incremental knowledge acquisition system for web document management. In Proceedings of the 29th Australasian Computer Science Conference - Volume 48, ACSC '06 (pp. 247-256). Darlinghurst, Australia, Australia: Australian Computer Society, Inc. Retrieved from http://portal.acm.org/citation.cfm?id=1151699.1151727
Hamill, K. A., & Zamora, A. (1980). The use of titles for automatic document classification. Journal of the American Society for Information Science, 31(6), 396-402.
Han, E. H., Karypis, G., & Kumar, V. (2001). Text categorization using weight adjusted k-nearest neighbor classification. Advances in Knowledge Discovery and Data Mining, 53-65.
Heckerman, D. (1997). Bayesian networks for data mining. Data mining and knowledge discovery, 1(1), 79-119.
Huang, Y. (1998). a theoretic and empirical research of cluster indexing for mandarin chinese full text document. The Journal of Library and Information Science, 24, 1023-2125.
Joachims, T. (1997). Text categorization with support vector machines: Learning with many relevant features. Machine Learning: ECML-98, 137-142.
Kass, G. V. (1980). An Exploratory Technique for Investigating Large Quantities of Categorical Data. Journal of the Royal Statistical Society. Series C (Applied Statistics), 29(2), 119-127.
Kim, M., & Compton, P. (2004). Evolutionary document management and retrieval for specialized domains on the web. International Journal of Human-Computer Studies, 60(2), 201-241.
Klir, G. J., & Yuan, B. (1995). Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall.
Kwok, K. L. (1975). The use of title and cited titles as document representation for automatic classification. Information Processing & Management, 11(8-12), 201-206.
Lee, H. M., Chen, C. M., & Hwang, C. W. (2000). A neural network document classifier with linguistic feature selection. Intelligent Problem Solving. Methodologies and Approaches, 555-560.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
Maron, M. E. (1961). Automatic Indexing: An Experimental Inquiry. Journal of the ACM, 8(3), 404-417.
Masand, B., Linoff, G., & Waltz, D. (1992). Classifying news stories using memory based reasoning. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, Copenhagen, Denmark, pp. 59-65.
Porter, M. (1980). An Algorithm for Suffix Stripping. Program, 14(3), 130-137.
Quan, T. T., Hui, S. C., Fong, A. C., & Cao, T. H. (2006). Automatic fuzzy ontology generation for semantic web. IEEE Transactions on Knowledge and Data Engineering, 842-856.
Quan, T. T., Hui, S. C., & Cao, T. H. (2005). A fuzzy FCA-based approach for citation-based document retrieval. In Cybernetics and Intelligent Systems, 2004 IEEE Conference, Vol. 1, pp. 578-583.
Quinlan, J. R. (1987). Simplifying decision trees. International Journal of Man-Machine Studies, 27(3), 221-234.
Quinlan, J. R. (1996). Bagging, Boosting, and C4.5. In Proceedings Of The Thirteenth National Conference On Artificial Intelligence, 725-730.
Richard Jeffrey, C. (2000). The Management and Visualisation of Document Collections Using Formal Concept Analysis.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47.
Shein, K. P. P., & Nyunt, T. T. S. (2010). Sentiment Classification Based on Ontology and SVM Classifier. In 2010 Second International Conference on Communication Software and Networks, pp. 169-172. Presented at the 2010 Second International Conference on Communication Software and Networks, Singapore.
Sullivan, D. (2001). Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales. John Wiley & Sons.
Wang, T., & Chiang, H. (2007). Fuzzy support vector machine for multi-class text categorization. Information Processing & Management, 43(4), 914-929.
Wille, R., & Mathematik, F. (1982). Restructuring Lattice Theory: an Approach Based on Hierarchies of Concepts.
Wolff, K. E. (1993). A first course in formal concept analysis. SoftStat, 93, 429-438.
Zadeh, L. A. (1965). Fuzzy sets. Information and Control, (8), 338-353.
校內:2023-12-18公開