| 研究生: |
郭冠忠 Kuo, Kuan-Chung |
|---|---|
| 論文名稱: |
利用混合式中文特徵選取法於知識文件分類 A Hybrid Chinese Feature Selection Method for Knowledge Document Classification |
| 指導教授: |
王惠嘉
Wang, Hei-Chia |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 工業與資訊管理學系碩士在職專班 Department of Industrial and Information Management (on the job class) |
| 論文出版年: | 2013 |
| 畢業學年度: | 101 |
| 語文別: | 中文 |
| 論文頁數: | 45 |
| 中文關鍵詞: | 中文分詞 、特徵選取 、SVM 、混合式分類 |
| 外文關鍵詞: | Chinese Segmentation, feature selection, SVM, hybrid classification |
| 相關次數: | 點閱:95 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
一般企業都會擁有自己的知識管理系統,作為教育訓練、經驗傳承之用,蒐集該產業相關的知識文件是許多企業處理外顯知識的重要工作之一,然而隨著資訊科技的發展,資訊量急速的增加,資訊取得便利更加容易,知識文件分類變成是企業管理資訊、知識相當的重要的一項工作。在進行知識文件分類之前,必須先對文件進行文字前處理,以方便擷取特徵值,特徵值選取的好壞影響著分類正確率。然而,文字前處理的方法、難度,會因為語言而有所不同,其中,中文由於字詞間沒有空白的關係,在進行分詞時,困難度較高,目前主要有兩種處理方式,一種是依靠詞庫的輔助,另外一種是藉由純統計的方式。
以詞庫為主的中文分詞系統會有字詞涵蓋率的問題,新的字詞不斷出現,而且每個中文分詞系統所使用的語料庫所蒐集的字詞不盡相同,因此本研究提出一個混合式的中文特徵選取法,將使用詞庫的Stanford Word Segmenter和CKIP(Chinese Knowledge Information Processing)中文分詞系統所獲得的特徵子集合,加上純統計的n-grams方法所獲得特徵子集合,做為最終的特徵集合。每個特徵子集合都是透過TF-ICF(Term Frequency-Inverse Category Frequency)進行權重分析所獲得,最後藉由SVM分類器來進行驗證。
實驗後發現,本研究與單純只使用單一中文分詞法相比,文件分類正確率能夠有效提升。本研究所使用的TF-ICF,考量了類別間的差異,成效也比TF、TF-IDF(Term Frequency-Inverse Document Frequency)好。利用本研究所提出的方法,能夠幫助企業更準確的進行中文知識文件的分類。
Enterprises have knowledge management systems for training employees, and the knowledge documents of industries are very important sources of explicit knowledge. Knowledge documents classification is a significant work for enterprises today. For selecting features which affecting the accuracy of classification, it is necessary to do text pre-processing before classifying knowledge documents. Unfortunately, Chinese sentences are not easy to segment in text pre-processing phase, because there is no white space between two Chinese terms. Currently, there are two common methods to do Chinese segmentation: One is based on dictionary, the other is based on statistics.
Unknown term is always a problem of the Chinese segmentation system based on dictionary. A dictionary could not cover all terms, because the newest terms are created without end. For resolving this problem, this study used two dictionary-based Chinese segmentation systems, Stanford Chinese Word Segmenter and CKIP segmentation system, and one statistical-based method, n-grams method, and calculating the TF-ICF(Term Frequency-Inverse Category Frequency) score of terms to select the final features, then, classifying and validating with SVM classifier. This study found that the hybrid Chinese feature selection method has better accuracy of classification, compared with the method using single Chinese segmentation system. The performance of TF-ICF is better than TF and TF-IDF. The hybrid Chinese feature selection can improve the accuracy of Chinese knowledge documents classification.
Baharudin, B., Lee, L. H. & Khan, K. 2010. A Review of Machine Learning Algorithms for Text-Documents Classification. Journal of Advances in Information Technology, 1, 4-19.
Boser, B. E., Guyou, I. M. & Vapnik, V. N. A training algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on Computational learning theory, 1992 Pittsburgh, Pennsylvania, United States.
Carlberger, J. & Kann, V. 1999. Implementing an efficient part-of-speech tagger. Software-Practice & Experience, 29, 815-832.
Changa, P.-C., Tsengb, H., Jurafskya, D. & Manninga, C. D. 2009. Discriminative reordering with chinese grammatical relations features. to appear in NAACL 2009 Third Workshop on Syntax and Structure in Statistical Translation.
Chen, Y., Miao, D., Wang, R. & Wu, K. 2011. A rough set approach to feature selection based on power set tree. Knowledge-Based Systems, 24, 275-281.
Cheng, Y., Asahara, M. & Matsumoto, Y. 2005. Machine Learning-based Dependency Analyzer for Chinese. Journal of Chinese Language and Computing, 15, 13-24.
Chiu, D.-Y., Lee, C.-C. & Pan, Y.-C. 2010. An Automated Error Detection for News Webpages of Chinese Portal. Journal of Software, 5, 1334-1341.
Chu, C., Nakazawa, T., Kawahara, D. & Kurohashi, S. 2012. Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation. the 16th EAMT Conference, 28-30.
Cohn, T. & Blunsom, P. 2009. A Bayesian Model of Syntax-Directed Tree to String Grammar Induction. Conference on Empirical Methods in Natural Language Processing, 352-361.
Cordon, O., Herrera-Viedma, E., Lopez-Pujalte, C., Luque, M. & Zarco, C. 2003. A review on the application of evolutionary computation to information retrieval. International Journal of Approximate Reasoning, 34, 241-264.
Fragoudis, D., Meretakis, D. & Likothanassis, S. 2005. Best terms: an efficient feature-selection algorithm for text categorization. Knowledge and Information Systems, 8, 16-33.
Group, T. S. N. L. P. 2012. Chinese Natural Language Processing and Speech Processing [Online]. Stanford University. Available: http://nlp.stanford.edu/projects/chinese-nlp.shtml.
Hao, P.-Y., Chiang, J.-H. & TU, Y.-K. 2007. Hierarchically SVM classification based on support vector clustering method and its application to document categorization. Expert Systems with Applications, 33, 627-635.
Kao, L.-J., Chiu, C.-C. & Chiu, F.-Y. 2012. A Bayesian latent variable model with classification and regression tree approach for behavior and credit scoring. Knowledge-Based Systems, 36, 245-252.
Kumar, M. A., & Gopal, M. 2010. A hybrid SVM based decision tree. Pattern Recognition, 43(12), 3977-3987.
Lazaro-Gredilla, M., Gomez-Verdejo, V. & Parrado-Hernandez, E. 2012. Low-cost model selection for SVMs using local features. Engineering Applications of Artificial Intelligence, 25, 1203-1211.
Lee, L. H., Wan, C. H., Rajkumar, R. & Isa, D. 2012. An enhanced support vector machine classification framework by using euclidean distance function for text document categorization. Applied Intelligence, 37, 80-99.
Levy, R. & Manning, C. 2003. Is it harder to parse Chinese, or the Chinese Treebank? Proceedings of ACL 2003.
Ma, W.-Y. & Chen, K.-J. 2003. Introduction to CKIP Chinese word segmentation system for the first international Chinese Word Segmentation Bakeoff. Proceedings of ACL, Second SIGHAN Workshop on Chinese Language Processing, 17, 168-171.
McLachlan, G. J., Do, K.-A., & Ambroise, C. (2004). Analyzing Microarray Gene Expression Data, Wiley-Interscience.
Mengle, S. S. R. & Goharian, N. 2009. Ambiguity measure feature-selection algorithm. Journal of the American Society for Information Science and Technology, 60, 1037-1050.
Ogura, H., Amano, H. & Kondo, M. 2009. Feature selection with a measure of deviations from Poisson in text categorization. Expert Systems with Applications, 36, 6826-6832.
Pawlak, Z. 1982. Rough sets. International Journal of Computer & Information Sciences, 11, 341-356.
Ray, S. & Chandra, N. 2012. A Technique for Proper Feature Selection with Automated Text Categorization in the Vector Space Model. International Journal of Emerging Technology and Advanced Engineering, 2, 243-246.
Salton, G., Wong, A. & Yang, C. S. 1975. A vector space model for automatic indexing. Communications of the ACM, 18, 613-620.
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys, 34, 1-47.
Subasi, A. 2013. Classification of EMG signals using PSO optimized SVM for diagnosis of neuromuscular disorders. Computers in Biology and Medicine, 43, 576-586.
Sun, A., Lim, E.-P., & Liu, Y. (2009). On strategies for imbalanced text classification using SVM: A comparative study. Decision Support Systems, 48(1), 191-201.
Sun, J. & Li, H. 2012. Financial distress prediction using support vector machines: Ensemble vs. individual. Applied Soft Computing, 12, 2254-2265.
Vapnik, V. N. 1995. The nature of statistical learning theory, New York, NY, USA, Springer-Verlag New York, Inc.
Wang, T.-Y. & Chiang, H.-M. 2007. Fuzzy support vector machine for multi-class text categorization. Information Processing & Management, 43, 914-929.
Wei, Z., Miao, D., Chauchat, J.-H., Zhao, R. & Li, W. 2009. N-grams based feature selection and text representation for Chinese Text Classification. International Journal of Computational Intelligence Systems, 2, 365-374.
Yang, J., Liu, Y., Liu, Z., Zhu, X. & Zhang, X. 2011. A new feature selection algorithm based on binomial hypothesis testing for spam filtering. Knowledge-Based Systems, 24, 904-914.
Yang, J., Liu, Y., Zhu, X., Liu, Z. & Zhang, X. 2012. A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Information Processing & Management, 48, 741-754.
Yang, Y. & Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 14th International Conference on Machine Learning, 412–420.
Youn, E. & Jeong, M. K. 2009. Class dependent feature scaling method using naive Bayes classifier for text datamining. Pattern Recognition Letters, 30, 477-485.
Yuan, L.-C. 2012. Improved hidden Markov model for speech recognition and POS tagging. Journal of Central South University of Technology, 19, 511-516.
Yuxia Sun, Weiguang Qu, Junsheng Zhou, Xuri Tang, YIng Di & Wu, W. 2011. An improved feature selection method in chinese text categorization. International Journal of Knowledge and Language Processing, 2, 48-55.
Zhang, H. & Ren, F. 2010. Chinese POS tagging using restricted maximum entropy model. Chinese Journal of Electronics, 19, 39-42.
Zhao, H., Huang, C.-N., Li, M. & Lu, B.-L. 2010. A Unified Character-Based Tagging Framework for Chinese Word Segmentation. ACM Transactions on Asian Language Information Processing, 9, 1-32.
謝佑明. 2012. 具有新詞辨識能力的中文斷詞系統 [Online]. 台灣中央研究院 資訊科學所 中文組實驗室 中文詞知識庫小組. Available: http://rocling.iis.sinica.edu.tw/CKIP/wordsegment.htm.