簡易檢索 / 詳目顯示

研究生: 陳家駒
Chen, Chia-Chu
論文名稱: 簡易貝氏分類器之個別屬性離散化設定方法
Individual Attribute Discretization Setting for Improving the Performance of Naïve Bayesian Classifiers
指導教授: 翁慈宗
Wong, Tzu-Tsung
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業與資訊管理學系
Department of Industrial and Information Management
論文出版年: 2010
畢業學年度: 98
語文別: 中文
論文頁數: 49
中文關鍵詞: 屬性間相關性離散化方法簡易貝氏分類器
外文關鍵詞: correlation, discretization method, naïve Bayesian classifiers
相關次數: 點閱:92下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 分類一直是資料探勘中相當重要的一部分,透過資料所提供的資訊有助於人們去預測未來可能發生之事件,因此不同的分類器都有其優勢去處理不同特性之分類問題;其中,簡易貝氏分類器更是常用的分類器之一,其具有效率快且分類結果穩定等特性,因而在實務中也被廣泛使用。在過去相關於離散化之文獻中,主要研究目的在於如何發展出更適合簡易貝氏分類器的離散化方法以運用於同一個資料檔中之所有屬性;相對於此,本研究則是考量到不同屬性有其不同之特性,因此若使用同一種離散化方法去處理資料檔中之所有連續型屬性是不合理的,更正確的來講,為了使簡易貝氏分類器發揮更大的效益,應考量連續型屬性之特性以調整適當的離散化方法作搭配。首先,本研究挑選了文獻中能使簡易貝氏分類器表現較佳的幾個離散化技術作為基礎,並發展屬性採用個別離散化之調整機制,以期望能夠增加簡易貝氏分類器之績效。而在實證部分,則從UCI站上挑選到15個資料檔作測試,且測試結果皆呈現一致性的趨勢,即屬性採用個別離散化後所得到之分類正確率確實會比所有屬性採用相同之離散化方法時還要來得高,這樣的趨勢也顯示了當屬性採用個別離散化時對於簡易貝氏分類器是有幫助的。最後,為了探討分類績效提升的原因,本研究亦分析個別離散化前後對於資料檔之平均屬性相關性的影響,並推測分類正確率上升之因素在於屬性間相關性的降低。

    Classification is one of the most important topics in data mining. It collects information to predict the class of a new instance. Naïve Bayesian classifiers are a widely used classification tool because of its computational efficiency. However, the naïve Bayesian classifier is not designed to process data with numerical attributes. Researchers therefore focus on developing discretization methods to process numerical attributes for naïve Bayesian classifiers. Since every attribute is unique, it is not totally reasonable to apply the same discretization technique on all numeric attributes in a data set. In this study, numeric attributes in a data set are allowed to be discretized by different methods to improve the performance of the naïve Bayesian classifier. An attribute discretization setting algorithm is proposed to find the appropriate discretization method for each numerical attribute. The experimental results on 15 data sets form UCI data repository demonstrate that our attribute discretization setting algorithm can generally achieve a higher prediction accuracy than the approach in which all numeric attributes are processed by the same discretization method. The major reason could be that the correlations among numeric attributes are reduced when they are processed by different discretization methods.

    目錄 摘要 I 第一章 緒論 1 1.1研究動機與背景 1 1.2研究目的 2 1.3研究流程 3 第二章 文獻回顧 4 2.1 簡易貝氏分類器 4 2.2 離散化方法 6 2.3 屬性重要性排序 11 第三章 個別離散化測試 13 3.1 離散化方法 13 3.1.1 十等分區間離散化方法 13 3.1.2 最小化熵值離散化方法 14 3.1.3 比例式離散化方法 15 3.1.4 固定頻率離散化方法 15 3.2 分類評估方式 16 3.3 實驗結果 16 3.4 小結 21 第四章 研究方法 22 4.1 屬性重要性排序 24 4.2 個別離散化方法調整與設定 26 4.3 屬性相關性之衡量 27 4.4 小結 28 第五章 實證結果 29 5.1 資料檔屬性 29 5.2 分類正確率與屬性離散化方法 31 5.2.1 第二階段分類正確率與其對應之離散化方法 31 5.2.2 整合之分類正確率與離散化方法 33 5.3 屬性間相關性分析 36 5.3.1 屬性平均相關性-連續型屬性 37 5.3.2 屬性平均相關性-混合型屬性 39 5.4 小結 40 第六章 結論與未來發展 42 6.1 研究成果 42 6.2 研究貢獻 43 6.3 未來發展 44 參考文獻 45 中文 45 英文 45

    中文

    黃偉碩 (2005),「利用貝氏分類與因子分析法於半導體製程錯誤偵測與診斷」,中華大學/科技管理研究所碩士論文。
    張良豪 (2009),利用貝氏屬性挑選法與先驗分配提升簡易貝氏分類器之效能,國立成功大學工業管理研究所碩士班碩士論文。

    英文

    Asuncion, A. and Newman, D. (2007). UCI machine learning repository: http://www.ics.uci.edu/~mlearn/MLRepository.html .
    Bacardit, J. and Garrell J.M.(2004) Analysis and improvements of the Adaptive Discretization Intervals knowledge representation, Lecture Notes in Computer Science, 3103, 726-738
    Biesiada, J., Duch, W., Kachel, A., Maczka, K., and Palucha, S. (2005). Feature ranking methods based on information entropy with parzen window, International Conference on Research in Electrotechnology and Applied Informatics, Katowice, Poland, 109-118.
    Bay, S. D. (2000). Multivariate discretization of continuous variables for set mining. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Irvine, CA, 315–319.
    Cestnik, B. (1990). Estimating probabilities: A crucial task in machine learning, Proceedings of the 9th European Conference on Artificial Intelligence, Stockholm, Sweden, 147-150.
    Chmielewski, M. R. and Grzymala-Busse, J. W. (1996). Global discretization of continuous attributes as preprocessing for machine learning. International Journal of Approximate Reasoning, 15, 319-331.
    Clark, P. and Niblett, T. (1989). The CN2 induction algorithm, Machine Learning, 3, 261-283.
    Domingos, P. and Plazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero one loss, Machine Learning, 29, 103-130.
    Dudoit, S., Fridlyand, J., and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data, Journal of the American Statistical Association, 97, 77-87.
    Fayyad, U. M. and Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the 13th International Joint Conference on Artificial Intelligence, Chambéry, France, 1022-1027.
    Hsu, C. N., Huang, H. J., and Wong, T. T. (2000). Why discretization works for naive Bayesian classifiers. Proceedings of the 17th International Conference on Machine Learning, Palo Alto, CA, 309-406.

    Hsu, C. N., Huang, H. J., and Wong, T. T. (2003) Implications of the Dirichlet assumption for discretization of continuous attributes in naïve Bayesian classifiers. Machine Learning, 53, 235–263
    John, G. H., Kohavi, R., and Pfleger, K. (1994). Irrelevant features and the subset selection problem. Proceedings of ICML-94, 11th International Conference on Machine Learning (New Brunswick, NJ), 121–129.
    John, G., H., and Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, Montreal, Canada, 338-345.
    Kerber, R. (1992). Chimerge. Discretization of numeric attributes. Proceeding of the Tenth National Conference on Artificial Intelligence, San Jose, CA, 123-128.
    Ktsiantis, S. and Kanellopoulos, D. (2006). Discretization techniques: a recent survey. GESTS International Transactions on Computer Science and Engineering, 32(1), 47-58.
    Langley, P., Iba, W., and Thompson, K. (1992). An analysis of Bayesian classifiers, Proceedings of the Tenth National Conference on Artificial Intelligence, San Jose, CA, 223-228.
    Langley, P. and Sage, S. (1994). Induction of Selective Bayesian Classifiers. Proceedings of UAI-94, 10th International Conference on Uncertainty in Artificial Intelligence, Seattle, WA, 399–406.

    Lopez de Mantaras, R. (1991). A distance-based attribute selecting measure for decision tree induction. Machine Learning, 6, 81-92.
    Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. (1988). Numerical Recipies in C, Cambridge University Press, Cambridge.
    Quinlan, J. R. (1986). Induction of decision trees, Machine Learning, 1, 81-106.
    Remco, R., Bouckaert. (2004). Naive Bayes classifiers that perform well with continuous variables, Lecture Notes in Computer Science, 3339, 1089-1094.
    Schneider, K. M. (2005). Techniques for improving the performance of naïve Bayes for text classification, Lecture Notes in Computer Science, 3406, 682-693.
    Yang, Y. and Webb, G. I. (2001). Proportional k-interval discretization for naive-Bayes classifiers, Proceedings of the Twelfth European Conference on Machine Learning, Freiburg, Germany, 564–575.
    Yang, Y. and Webb, G. I. (2002). A Comparative Study of Discretization Methods for Naïve Bayes Classifiers. Proceedings of Pacific Rim Knowledge Acquisition Workshop, Tokyo, Japan, 159-173
    Yang, Y. and Webb, G. I. (2002). Non-disjoint discretization for naive-Bayes classifiers. Proceedings of the Nineteenth International Conference on Machine Learning, Sydney, Australia, 666-673.
    Yang, Y., Webb, G. I., and Wu, X. (2006). Discretization Methods. Data Mining and Knowledge Discovery Handbook, 6, 113–130.
    Yang, Y. and Webb, G. I. (2009). Discretization for naive-Bayes learning: managing discretization bias and variance, Machine Learning, 74, 39-74.

    下載圖示 校內:2012-07-19公開
    校外:2012-07-19公開
    QR CODE