| 研究生: |
江蕙民 Chiang, Huei-Min |
|---|---|
| 論文名稱: |
支援向量機運用在文件分類上的研究 Applying Support Vector Machine for Text Categorization Problems |
| 指導教授: |
王泰裕
Wang, Tai-yue |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
管理學院 - 工業與資訊管理學系 Department of Industrial and Information Management |
| 論文出版年: | 2009 |
| 畢業學年度: | 97 |
| 語文別: | 英文 |
| 論文頁數: | 90 |
| 中文關鍵詞: | 多類別分類 、模糊歸屬度 、多重分類 |
| 外文關鍵詞: | multi-class, multi-label, GMM, MOAO-SVM, OAO-FSVM, OAA-FSVM, fuzzy membership function |
| 相關次數: | 點閱:90 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著網際網路不斷進步,人們越來越依賴網路取得資訊與傳遞資訊,為了發展可以自動瀏覽網路資訊的方法,俾以過濾人們有興趣的部分,必須先解決「文件分類」的課題。隨著資訊越來越浮濫,文件分類已經不可能由領域專家分類,文件分類必須藉助於自動文件分類系統才行。過去有相當多的文件分類的研究集中於網路文件的多類別分類問題,雖然對於分類器的改善多有著墨,但是對於資料在類別中的權重不一時,其分類器如何改善則未有深入探討。因此在本項研究中,我們將設計一套多類別文件分類系統,此系統包含了兩個模組,一為處理模組,另一為分類模組,處理模組的目的在如何將網路文件萃取出代表文件的關鍵字並且將每篇文章以關鍵字的向量形式表示,分類模組的目的為訓練分類器然後測試分類器的效益。而在分類的模組中,我們使用(one-against-all fuzzy support vector machine) OAA-FSVM 與(one-against-one fuzzy support vector machine) OAO-FSVM 為其多類別之分類器,當模糊理論概念加入多類別分類器中,樣本中高度不確定性的影響可被減少。隨後我們使用macro-average指標指評估不同參數下的績效表現,同時使用McNemar’s test做統計的顯著性測試。本研究再針對路透社資料集設計一個多重文件分類系統稱為MOAO-SVM系統,此系統可將文件根據其內容分類到可能的類別。運用方法為透過SVM輸出值將高維度的資料轉換成低維度的資料來計算,如此可減少了計算之複雜度,同時利用SVM輸出值亦可用來計算分類到可能類別之機率。最後我們提出實證研究,經由實驗證明:系統自動分類的能力接近領域專家人工分類,表現良好的分類效率。
Nowadays, people more rely on the Internet to obtain and transport information in the progress of the Internet. The text categorization problem, therefore, must be resolved before one can filter his required information from the automatic web-browsing system.
It is impossible for people to do the text categorization tasks because overflowing documents are excessively generated daily, which can not be done by domain expert; but by an automatic classification system. In the past few years, many classification methods have been introduced focusing on the multi-class on-line document classification problem. They concentrated on the choices of different classifiers; however, scarcely focused on the choices of classifier for the difference of instance weights.
This dissertation presents a more sophisticated text categorization system to solve the multi-class classification problem. It consists of two modules: the processing one and the classifying one. The purpose of the processing module is to extract the keywords from network text, where the articles can be represented by the keywords and the keywords can be denoted by vector space model. The purpose of the classifying module is to train the classifying system and use it efficiently. We specifically propose two multi-class methods, (one-against-all fuzzy support vector machine) OAA-FSVM classifier and (one-against-one fuzzy support vector machine) OAO-FSVM classifier to be implemented multi-class classification systems. While the fuzzy set theory is incorporated into the classifying module, the influence of the samples with high uncertainty can be decreased as the fuzzy membership functions. Consequently, we employ macro-average performance indices to evaluate the performances with different parameter settings as well as the statistical significance test is examined by the McNemar’s test. Furthermore, a comparative study is performed on multi-label approaches based on Reuter’s data sets for a modified one-against-one support vector machine (MOAO-SVM) system also developed to classify multi-label documents in this dissertation. Data mapping is performed to transform data in a high-dimensional space into those in a lower-dimensional space with paired SVM output values, thus lowering the complexity of the computation. A pair-wise comparison approach is applied to set the membership function in each predicted class to judge all possible classified classes. Finally, by offering empirical studies, the ability of automatic text categorization systems is manifested to approximate to that of domain expert manpower classification as expected and it can achieve comparable classification efficiency.
[1] Abe, S., & Inoue, T., “Fuzzy support vector machines for multiclass problems”, In proceedings of 10th European Symposium on Artificial neural networks, pp. 113-118, Bruges, Belgium, 2002.
[2] Allwein, E., Schapire, R. & Singer, Y., “Reducing multiclass to binary: a unifying approach for margin classifiers”, AT&T Corp Press, 2000.
[3] Apte, C., Damerau, F. & Weiss, S. M., “Automated learning of decision rules for text categorization”, ACM Transactions on Information Systems, 12(3), 233-251, 1994.
[4] Boutell, M.R., Luo, J., Shen, X. & Brown, C.M., “Learning multi-label scene classification”, Pattern Recognition, 37(9), 1757-1771, 2004.
[5] Breiman, L. & Spector, P., “Submodel defection and evaluation in regression. The x-random case”, International Statistical Review, 60 (3), 291-319, 1992.
[6] Breiman, L., Friedman, J., Olshen, R. A., & Stone, C. J., “Classification and regression trees”, Belmont, CA: Wadsworth International Group, 1984.
[7] Cawley, G., MATLAB Support Vector Machine Toolbox (svm_v0.54), 2000.
[8] Chen, J., & Chen, C., “Fuzzy kernel perceptron”, IEEE Transactions on Neural Networks, 13(6), 1364-1373, 2002.
[9] Chiang, D. A. & Lin, N. P., “Correlation of fuzzy sets”, Fuzzy Set and Systems, 102, 221-226, 1999.
[10]Chung, F. L., Wang, S., Deng, Z. & Hu, D., “Fuzzy kernel hyperball perceptron”, Applied Soft Computing, 5, 67-74, 2004.
[11] Chuang, L.Y., Yang, C.H. & Jin, L.C., “Classification of multiple cancer types using fuzzy support vector machines and outlier detection methods”, Biomedical Engineering Application, Basis & Communications, 17(6), 300-308, 2005.
[12] Cohen, W. W. & Singer, Y., “Context-sensitive learning methods for text categorization”, ACM Transactions on Information Systems, 17(2), 141–173, 1999.
[13] Cristianini, C.N., & Schawe-Taylor, J., An introduction to support vector machines, Cambridge: Cambridge University Press, 2000.
[14] Dietterich, T. G.,“Approximate statistical tests for comparing supervised classification learning algorithms“, Neural Computation, 10(7), 1895-1924,1998.
[15] Diplaris, S., Tsoumakas, G., Mitkas, P. & Vlahavas, I., “Protein classification with multiple Algorithms”, Proceedings of the 10th Panhellenic Conference on Informatics (PCI), Volos, Greece, November 21-23, Springer-Verlag, 2005.
[16] Dumais, S., Platt, J., Heckerman, D., & Sahami, M., “Inductive learning algorithms and representations for text categorization”, Proceedings of the Seventh International Conference on Information and Knowledge Management Table of Contents, pp.148-155, Bethesda, Maryland, United States, 1998.
[17] Fuhr, N., Hartmanna, S., Lustig, G., Schwantner, M., & Tzeras, K.,“ A rule-based multi-stage indexing system for large subject fields”, In Proceedings of RIAO’91, pp. 606-623, 1991.
[18] Foody, G. M., “Thematic map comparison: Evaluation the statistical significance of differences in classification accuracy”, Photogramm. Eng. Remote Sens, 70, 627-633, 2004.
[19] Friedman, J., “Another approach to polychotomus classification”, Technical report, Department of Statistics, Stanford University, Available at http://www-stat.standford.edu.tw/report/friedman/poly.ps.Z, 1996.
[20] Ghahramani, Z., & Jordan, M. I., “Supervised learning from incomplete data via an EM approach”, In J. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems 6, pp. 255-262, San Mateo, CA: Morgan Kaufmann Publisfers, 1994.
[21] Haykin, S., Neural Networks. Practice-Hall Press. New Jersey:,1999.
[22] Huang, H. P., & Liu, Y. H., “Fuzzy support vector machines for pattern recognition and data mining”, International Journal of Fuzzy Systems, 4(3), 826-835, 2002.
[23] Huang, Y. L., “A theoretic and empirical research of cluster indexing for mandarin Chinese full text document”, The Journal of Library and Information Science, 24, 1023-2125 (in Chinese), 1998.
[24] Inoue, T., & Abe, S., “Fuzzy support vector machines for Pattern Classification”, Proceedings of the International Joint Conference on Neural Networks, pp. 1449-1454, Washington, DC, USA, 2001.
[25] Jayadeva, K., & Suresh, C., “Fast and robust learning through fuzzy linear proximal support vector machines”, Neurocomputing, 61, 401-411, 2004.
[26] Jiang, X. ,Yi, Z. & Lv, J.C., “Fuzzy SVM with a new fuzzy membership function”, Neural Computing & Applications, 15, 168-276, 2006.
[27] Joachims, T., “Text categorization with support machines: learning with many features”, In Proceedings 10th Europen Conference on Machine Learning (ECML) Chemnitz: Springer-Verlag, pp. 137-142, 1998.
[28] Kao, T. H., “Advanced parametric mixture model for multi-label text categorization”, a dissertation submitted in partial fulfillment of the requirements for the degree of master of science on national Taiwan University, 2006.
[29]Keller, J.M., Hunt, D.J., “Incorporating fuzzy membership functions into the perceptron algorithm”, IEEE Trans. Pattern Analysis Machine Intelligence, 6, 693-699, 1985.
[30] Kikuchi, T., & Abe, S., “Comparison between error correcting output and fuzzy support vector machines”, Pattern Recognition Letters, 26, 1937–1945, 2005.
[31] Kim, H. J. & Lee, S. G., “Building topic hierarchy based on fuzzy relation”, Neurocomputing, 51, 481-486, 2003.
[32] Knorr, E.M., Ng, R.T. & Tucakov, V. “Distance-based outliers: algorithms and applications”, The VLDB Journal, 8(3), 237-253, 2000.
[33] Kohavi, R., “A study of cross-validation and bootstrap for accuracy estimation and model selection”, In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1137-1143, 1995.
[34] Lee, C. & Lee, G. G., “Information gain and divergence-based feature selection for machine learning-based text categorization”, Information Processing and Management, 42, 155-165, 2006.
[35] Lewis, D. D. & Ringuette, M., “A comparison of two learning algorithms for text categorization”, In Third Annual Symposium on Document Analysis and Information Retrieval, 81-93, 1994.
[36] Lin, C. F., & Wang, S. E., “Fuzzy support vector machines”, IEEE Transactions on Neural Networks, 13(2), 464-471, 2002.
[37] Mao, Y., Zhou, X., Pi, D., Sun, Y.&, Wong, T.C., “Multiclass cancer classification by using fuzzy support vector machine and binary decision tree with gene selection”, Journal of Biomedicine and Biotechnology, 2, 160-171, 2005.
[38] McCallum, A., & Nigam, K., “A comparison of event models for Naive Bayes text classification”, In AAAI’98 Workshop on Learning for Text Categorization, pp. 41–48., 1998.
[39] McCallum, A. K., “Multi-label text classification with classification with a mixture model trained by EM”, In proceedings of the AAAI’ 99 Workshop on Text Learning,. 1999.
[40] Mill, J., Inoue, A., “An application of fuzzy support vector machines”, Proceeding of the 22nd North American Fuzzy Information Processing Society, pp.302-306, Chicago, Illinois, July 24-26, 2003.
[41] Ormoneit, D., & Tresp, V., “Average, maximum penalized likelihood and Bayesain Transactions on Neural Networks, 9(4), 639-650, 1998.
[42] Peng, F., & McCallum, A., “Information extraction from research papers using conditional random fields”, Information Processing and Management, 42(4), 963-979., 2006.
[43] Platt, J., “Fast training of support vector machines using sequential minimal optimization”, Advances in kernel methods: support vector learning. Cambridge: MIT Press, pp.185-208, 1998.
[44] Rocchio, J. J., Relevance feedback in information retrieval. In G. Salton (Ed.), The SMART Retrieval System: Experiments in automatic document processing Englewood cliffs, NJ: Practice-Hall, pp. 313-323, 1971.
[45] Saito, K., “Multiple Topic Detection by Parametric Mixture Models (PMM)—Automatic Web Page Categorization for Browsing”, NTT Technical Review, 3(3), 2005.
[46] Salton, G., & Buckley, C., “Term weighting approaches in automatic text retrieval”, Information Processing and Management, 24, 513–523, 1988.
[47] Salton, G., “Developments in automatic text retrieval”, Science, 30, 974-980, 1991.
[48] Schütze, H., Hull, D. & Pedersen, J. O., “A comparison of classifiers and document representations for the routing problem”, In SIGIR ’95: Proceedings of the18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , pp. 229-237, 1995.
[49] Schwenker, F., “Hierarchical support vector machines for multi-class pattern recognition”, IEEE 4th International Conference on Knowledge-based Intelligence Engineering system & applied Technologies, pp. 561-565, Brighton, UK., 2000.
[50] Shao, J., “Linear model selection by cross-validation”, Journal of the American Statistical Association, 88, 486-494, 1993.
[51] Shannon, C.E., “The communication theory of secrecy systems”, Bell System Technical Journal, 28(4), 656-715, 1949.
[52] Tao, Q., & Wang, J., “A new fuzzy support vector machine based on the weighted margin”, Neural Processing Letters, 20, 139-150, 2004.
[53] Takahashi, F., & Abe, S., “Decision-tree-based multiclass support vector machines”, Proceedings of the 9th International Conference on Neural Information Processing, pp. 1418-1422, 2002.
[54] Tsujinishi, D., & Abe, S., “Fuzzy least squares support vector machines for multiclass problems”, Neural Networks, 16(5), 785-792, 2003.
[55] Vapnik, V., The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995.
[56] Ueda, N. & Saito, K., “Parametric mixture models for multi-labeled text”, Advances in Neural Information Processing Systems, 15, 721-728, 2003.
[57] Wang, X. & Wu, C., “Using membership functions to improve multiclass SVM”, In Proceedings 7th International Conference on Signal Processing, pp.1459-1462, 2004.
[58] Wang, T.Y., Chiang, H.M., “Fuzzy support vector machine for multi-class text categorization”, Information Processing and Management, 43(4), 914-929, 2007.
[59] Wei, L. L., Long, W. J., & Zhang W. X., “Fuzzy data domain description using support vector machines”, Proceedings of the second international conference on machine learning and cybernetics, pp. 3082-3085, Xi’an, November 2-5,2003.
[60] Weiss, S. M. & Indurkhya, N., Decision tree pruning: Biased or optimal. In proceeding of the Twelfth International Conference on Artificial iItelligence, pp. 626-632, AAAI Press and MIT Press, 1994.
[61] Weston, J., & Watkins C., “Multi-class support vector machines”, Technical Report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, 1998.
[62] Wiener, E., Pedersen, J.O., & Weigend, A. S., “A neural network approach to topic spotting”, In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval, pp. 317-332, 1995.
[63] Yang, Y., “Expert network: effective and efficient learning from human decisions in text categorization and retrieval”, SIGIR '94: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 13-22, 1994.
[64] Yang, Y. & Chute, C. G. “An example-based mapping method for text categorization and retrieval”, ACM Transactions on Information Systems, 12(3), 252-277, 1994.
[65] Yang, Y., & Pedersen, J. O., ”A comparative study on feature selection in text categorization”, In proceedings of ICML-97, 14th International Conference on Machine Learning, pp. 412-420, 1997.
[66] Yang, Y., “An evaluation of statistical approaches to text categorization”. Information Retrieval, 1, 69-90, 1999.
[67] Yang, Y., & Liu, X., “A re-examination of text categorization methods”, 22nd Annual International SIGIR, pp. 42-49, 1999.
[68] Zheng, C. H., & Jiao, L. C., “Fuzzy pre-extracting method for support vector machine”, Proceedings of the First International Conference on Machine Learning and Cybernetics, Beijing, November 4-5, pp. 4-5, 2002.