| 研究生: |
陳志濬 Chen, Zhi-Jun |
|---|---|
| 論文名稱: |
探討屬性值排序方法對具有廣義狄氏先驗分配的簡易貝氏分類器之影響 Investigating the Effect of Attribute Value Ranking Methods on Naive Bayes Classifier with Generalized Dirichlet Priors |
| 指導教授: |
翁慈宗
Wong, Tsz-Tzung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 工業與資訊管理學系 Department of Industrial and Information Management |
| 論文出版年: | 2010 |
| 畢業學年度: | 98 |
| 語文別: | 中文 |
| 論文頁數: | 41 |
| 中文關鍵詞: | 廣義狄氏分配 、簡易貝氏分類器 、屬性值排序 |
| 外文關鍵詞: | generalized Dirichlet distribution, naïve Bayesian classifier, attribute value sorting |
| 相關次數: | 點閱:101 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由於簡易貝氏分類器運算效率高且分類表現良好,在資料探勘中已經被廣泛的使用。為了提高簡易貝氏分類器的增加操控彈性已提升分類正確率,常會假設屬性的可能值服從先驗分配,一般會考量的先驗分配為狄氏分配或廣義狄氏分配,廣義狄氏分配放寬了狄氏分配限制需求,因此廣義狄氏分配作為先驗分配時,雖然會花費較多的運算時間,但是大多能達到較佳的分類正確率,廣義狄氏分配調整屬性時,需要將對應屬性值之先驗分配依序調整參數,目前調整參數時不對屬性值做任何排序,調整順序為屬性值在資料檔之說明文件中出現順序,因此本研究探討依照屬性值重要性將屬性值排序是否對分類表現有所影響,所以本研究提出三種考量屬性值重要性的屬性值調整排序方法,與目前不特意決定屬性值調整順序比較分類正確率,屬性值排序對探討分類表現之影響。本研究從UCI 資料存放站上找出20個資料檔來做分析,根據實驗結果,最高正確率幾乎皆為屬性值排序後資料檔,當找到適合的屬性值排序法時分類表現會優於不排序,證明屬性值排序對分類表現有較佳的效果。當使用廣義狄氏分配為先驗分配時對正確率需求較高時,推薦使用頻率排序法,因頻率排序法在連續型與離散型屬性資料檔表現較佳,且運算複雜度最低,而如使用廣義狄氏分配為先驗分配時對於運算便利性較在意者,則不需使用屬性值排序。
Naïve Bayesian classifiers have been widely used for data classification because of its computational efficiency and competitive accuracy. In a naïve Bayesian classifier, the prior distributions of an attribute are generally assumed to be Dirichlet or generalized Dirichlet distributions. The generalized Dirichlet distribution can release the restrictions of the Dirichlet distribution, and usually results in higher classification accuracy. However, the order of the variables in a generalized Dirichlet random vector is generally not arbitrary. In this study, three methods for determining the order of attribute values are proposed to study their impact on the performance of the naïve Bayesian classifiers with noninformative generalized Dirichlet priors. The experimental results on 20 data sets from UCI data repository demonstrate that when attribute values are properly ordered, the classification accuracy can be slightly improved with respect to nonordered attribute values. When computational efficiency is a major concern, ordering attribute values for employing noninformative generalized Dirichlet priors will not be necessary.
1.張良豪 (2009),利用貝氏屬性挑選法與先驗分配結合提升簡易貝氏分類器之效能,國立成功大學工業與資訊管理學系碩士班碩士論文。
2.Addin, O., Sapuan, S.M., Mahdi., and E.,Othman, M. (2007). A naive-Bayes classifier for damage detection in engineering materials, Materials and Design, 28, 2379–238.
3.Antonak, A.C. and Sfakianak, M.E.(2009). Assessing naive Bayes as a method for screeningcredit applicants, Journal of Applied Statistics, 36(5), 537-545.
4.Arthur, A. and David, N.(2007). UCI machine learning repository:
http://www.ics.uci.edu/~mlearn/MLRepository.html .
5.Biesiada, J., Duch, W., Kachel, A., Maczka, K., and Palucha, S. (2005). Featurerankingmethods based on information entropy with parzen window, International Conference onResearch in Electrotechnology and Applied Informatics, Katowice, Poland, 109-118.
6.Bouckaert, R. R. (2004). Naive Bayes classifiers that perform well withcontinuousvariables. Lecture notes in computer science, 3339, 1089-1094.
7.Chen, J., Huang, H., Tian, S.F., and Youli, Qu. (2009). Feature selection for textclassification with Naïve Bayes, Expert Systems with Applications, 36(3),5432-5435.
8.Connor, R. J. and Mosimann, J. E. (1969). Concepts of independence for proportions with ageneralization of the Dirichlet distribution. Journal of the American Statistical Association,64, 194-206.
9.Dougherty, J., Kohavi, R., and Sahami, M. (1995). Supervised andunsuperviseddiscretization of continuous features, Proceedings of the 12th international conference onmachine learning , San Francisco, Ca, 194–202.
10.Fan, L.W., Poh , K.L., and Zhou, P. (2009). A sequential feature extraction approach fornaïve bayes classification of microarray data, Expert Systems with Applications, 36(6),9919-9923.
11.Fang, K. T., Kotz, S., and Ng, K. W. (1990). Symmetric Multivariate andRelatedDistributions, New York: Chapman and Hall.
12.Fayyad, U. M. and Irani, K. B. (1993). Multi-interval discretization ofcontinuous-valuedattributes for classification learning. Proceedings of the 13thinternational joint conferenceon artificial intelligence, Chambéry, France,1022–1027.
13.Gama, J. , Torgo, L., and Soares, C. (1998). Dynamic discretization of continuous attributes.Proceedings of the 6th Ibero-American conference on AI , Lisbon, Portugal, 160–169.
14.John, G. H., Kohavi, R., and Pfleger, K. (1994). Irrelevant features and the subset selection problem. Proceedings of ICML-94, 11th International Conference on Machine Learning (New Brunswick, NJ), 121–129.
15.John, G.H. and Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. Proceedings of the 11th conference on uncertainty in artificial intelligence,Quebec, Canada, 338–345.
16.Kerber, R. (1992) ,ChiMerge: discretization of numeric attributes, The Ninth ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France,123-128.
17.Kohavi, R. and Sahami, M.(1996). Error-based and entropy-based discretization of continuous features, Proceedings of the Second International Conference on KnowledgeDiscovery and Data Mining (KDD), Portland, OR, 114-119.
18.Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. (1988). Numerical Recipies in C, Cambridge University Press, Cambridge.
19.Ho, K.M. and Scott, P.D. (1997). Zeta: A global method for discretization of continuousvariables. In KDD97:3rd International Conference of Knowledge Discovery and DataMining, Newport Beach, CA, 191–194.
20.Hsu, C. N., Huang, H. J., and Wong, T. T. (2003). Implications of the Dirichlet assumptionfor discretization of continuous attributes in naïve Bayesian classifiers, Machine Learning,53, 235-263.
21.Langley, P. and Sage, S. (1994). Induction of selective Bayesian classifiers. 10th International Conference on Uncertainty in Artificial Intelligence, Seattle, WA, 399–406.
22.Lopez de Mantaras, R. (1991). A distance-based attribute selecting measure for decision tree induction, Machine Learning, 6, 81-92.
23.Sridhar, D. V., Bartlett, E. B., and Seagrave, R. C. (1998). Information theoretic subset selection, Computers in Chemical Engineering, 22, 613-626.
24.Thiago, S.G. and Walmir, M. C. (2009). Review: A review of machine learning approachesto spam filtering, Expert Systems with Applications, 36(7), 10206-10222.
25.Wong, T.T. (2009). Alternative prior assumptions for improving the performance ofnaïveBayesian classifier, Data Mining and Knowledge Discovery, 18, 183-213.
26.Yang, Y. and Webb, G. I. (2001). Proportional k-interval discretization for naive-Bayes classifiers. Proceedings of the 12th European Conference on Machine Learning, Freiburg,Germany, 564–575.
27.Yang, Y. and Webb, G. I. (2009). Discretization for naive-Bayes learning:managing discretization bias and variance, Machine Learning , 74, 39–74.