簡易檢索 / 詳目顯示

研究生: 連子建
Lien, Tzu-Chien
論文名稱: 結合混合型離散化和挑選式簡易貝氏特徵選取來改善簡易貝氏分類器正確率之方法
Feature selection methods with hybrid discretization for naive Bayesian classifiers
指導教授: 翁慈宗
Wong, Tzu-Tsung
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2012
畢業學年度: 100
語文別: 中文
論文頁數: 45
中文關鍵詞: 簡易貝氏分類器混合型離散化特徵選取挑選式簡易貝氏特徵選取方法
外文關鍵詞: feature selection, hybrid discretization, naïve Bayesian classifier, selective naïve Bayes
相關次數: 點閱:104下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 分類的工具當中,簡易貝氏分類器具備運作簡單而且快速的優勢,並且有著良好的分類正確率,因此被人們經常使用。簡易貝氏分類器在運作時,屬性是離散型才可發揮其快速的優點,因此遇到連續型的屬性一般是先經過離散化才可輸入,而混合型離散化藉由找到各連續型屬性較佳的離散型態,相較於屬性全採同一種離散化方法,更能提升分類正確率。特徵選取對於分類同樣十分重要,若能適當地採用便可大幅提升運算效率,挑選式簡易貝氏特徵選取方法是簡易貝氏分類器中最常採用的特徵選取機制,因為其運作原理簡易,挑選之後的結果也確實可以提升分類正確率。在過往的研究當中,將離散化與特徵選取結合的方法已屬少數,由於混合型離散化提出的時間距今仍近,因此結合混合型離散化與特徵選取並未有研究探討。本研究的目的便是嘗試結合混合型離散化與特徵選取,觀察將這二個機制結合後對於分類正確率是否有顯著的提升。本研究提出了三個方法,方法一為特徵選取完之後再行混合型離散化,執行速度快。方法二與方法三大體而言均為先做混合型離散化再做特徵選取,而基於運算效率與分類正確率的考量,兩個方法在設計上有所不同。方法二在混合型離散化時並未將所有的屬性可能全部納入;相對的,方法三付出了較高的運算複雜度,以納入所有可能的屬性組合。經過實驗證明,相較於資料檔的屬性均採用同一種離散方法離散化後再行簡易貝氏特徵選取,本研究提出的三個方法在分類正確率上都有較好的表現,說明了混合型離散化與簡易貝氏特徵選取的結合是有效益的,其中又以方法三的分類表現最佳。

    Naïve Bayesian classifier is widely used for classification problems, because of its computational efficiency and competitive accuracy. Discretization is one of the major approaches for processing continuous attributes for naïve Bayesian classifier. Hybrid discretization sets the method for discretizing each continuous attribute individually. A previous study found that hybrid discretization is a better approach to improve the performance of the naïve Bayesian classifier than unified discretization. Selective naïve Bayes, abbreviated as SNB, is an important feature selection method for naïve Bayesian classifiers. It improves the efficiency and the accuracy by reducing redundant and irrelevant attributes. The object of this study is to develop methods composed of hybrid discretization and feature selection, and three methods for this purposed are proposed. Method one that is the most efficient executes hybrid discretization after feature selection. Methods two and three generally perform hybrid discretization first followed by feature selection. Method two transforms continuous attributes without considering discrete attributes, while method three determines the best discretization methods for each continuous attribute by searching all possibilities. The experimental results shows that in general, the three methods with hybrid discretization and feature selection all have better performance than the method with unified discretization and feature selection, and method three is the best.

    摘要 I 致謝 III 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 2 1.3 研究流程與架構 3 第二章 文獻回顧 4 2.1 簡易貝氏分類器 4 2.2 連續型屬性的離散化 6 2.3 特徵選取 8 2.3.1離散型特徵選取 8 2.3.2連續型特徵選取 10 2.4 離散化與特徵選取的結合 12 第三章 研究方法 14 3.1 研究方法一 14 3.2 研究方法二 18 3.3 研究方法三 20 3.4 離散化方法 24 3.4.1 十等分區間離散化方法 24 3.4.2 等頻率離散化方法 25 3.4.3 最小化熵值離散化方法 25 3.4.4比例式離散化方法 26 3.5 無母數方法衡量連續型屬性重要性 26 3.6 方法評估 28 第四章 實證研究 30 4.1 資料檔屬性 30 4.2 方法比較 32 4.2.1 先行統一離散化方法 32 4.2.2 先行混合型離散化方法 36 4.2.3 研究方法一與研究方法三 39 4.3 小結 39 第五章 結論與未來發展 40 5.1 結論 40 5.2 未來發展 41 參考文獻 42

    Asuncion, A. and Newman, D.J. (2007). UCI machine learning repository http://www.ics.uci.edu/~mlearn/MLRepository.html. Irvine, CA: University of California, School of Information and Computer Science.

    Cestnik, B. (1990). Estimating probabilities: A crucial task in machine learning, Proceedings of the 9th European Conference on Artificial Intelligence, Stockholm, Sweden, 147-150.

    Clark, P. and Niblett, T. (1989). The CN2 induction algorithm, Machine Learning, 3, 261-283.

    Domingos, P. and Plazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero one loss, Machine Learning, 29, 103-130.

    Dougherty, J., Kohavi, R., and Sahami, M. (1995). Supervised and unsupervised discretization of continuous features, Proceedings of the 12th International Conference on Machine Learning, San Francisco, 192-202.

    Fayyad, U. M. and Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the 13th International Joint Conference on Artificial Intelligence, Chambery, France, 1022-1027.

    Ferreira, A. and Figueiredo, M. (2011). Unsupervised joint feature discretization and selection. Pattern Recognition and Image Analysis, 6669, 200-207.

    John, G. H., Kohavi, R., and Pfleger, K. (1994). Irrelevant features and the subset selection problem. Proceedings of ICML-94, 11th International Conference on Machine Learning, New Bruswick, NJ, 121-129.

    John, G., H., and Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, Montreal, Canada, 338-345.

    Kerber, R. (1992). Chimerge: Discretization of numeric attributes. Proceeding of the Tenth National Conference on Artificial Intelligence, San Jose, CA, 123-128.

    Kwak, N. and Choi, C. H. (2002). Input feature selection by mutual information based on Parzen window. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 1667-1671.

    Langley, P., Iba, W., and Thompson, K. (1992). An analysis of Bayesian classifiers, Proceedings of the Tenth National Conference on Artificial Intelligence, San Jose, CA, 223-228.

    Langley, P. and Sage, S. (1994). Induction of selective Bayesian classifiers. Proceedings of the UAI-94, 10th International Conference on Uncertainty in Artificial Intelligence, Seattle, WA, 399-406.

    Li, Y., Hu, S. J., Yang, W. J., Sun, G. Z., Yao, F. W., and Yang, G. (2009). Similarity-based feature selection for learning from examples with continuous Values. Advances in Knowledge Discovery and Data Mining, Spring, 5476, 957-964.

    Liu, H. and Setiono, R. (1995). Feature selection and discretization of numeric attribute. Proceedings of the 7th IEEE International Conference on Tools with Artificial Intelligence, 388-391.

    Mejia-Lavalle, M., Morales, E. F., and Rodriguez, G. (2006). Fast feature selection method for continuous attributes with nominal class. Proceedings of 5th Mexican International Conference on Artificial Intelligence (MICAI'06), 142-150.

    Peng, H., Long, F., and Ding, C.(2005). Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 1226-1238.

    Pernkopf, F. (2005). Bayesian network classifiers versus selective k-NN classifier. Pattern Recognition, 38, 1–10.

    Pudil, P., Novovicova, J., and Kittler, J. (1994). Floating search methods in feature selection, Pattern Recognition Letters, 15, 1119–1125.

    Ribeiro, X. M., Traina, A. J. M., and Traina, C. Jr. (2008). A new algorithm for data discretization and feature selection. Proceedings of the 2008 ACM symposium on Applied computing, New York, USA, 953-954.

    Senthilkumar, J., Mnjula, D., and Krishnamoorthy, R. (2009). NANO: A new supervised algorithm for feature selection with discretization. Proceedings of IEEE International conference on Advanced Computing (IACC 2009), Patiala, India, 1515-1520.

    Wong, T. T., (2012). A hybrid discretization method for naïve Bayesian classifiers, Pattern Recognition, 45, 2321-2325.

    Yang, Y. and Webb, G. I. (2009). Discretization for naive-Bayes learning: managing discretization bias and variance, Machine Learning, 74, 39-74.

    Zhang, M. L., Jose, M. P., and Robles, V. (2009). Feature selection for multi-label naïve Bayes classification. Information Sciences: an International Journal, 179, 3218-3229.

    下載圖示 校內:2015-07-04公開
    校外:2015-07-04公開
    QR CODE