| 研究生: |
姚靜姍 Yao, Ching-Shan |
|---|---|
| 論文名稱: |
簡易貝氏分類器在不平衡資料集上效能改善之研究 A Study on the Performance Improvement of Naive Bayesian Classifier on Imbalanced Data Sets |
| 指導教授: |
翁慈宗
Wong, Tzu-Tsung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2018 |
| 畢業學年度: | 106 |
| 語文別: | 中文 |
| 論文頁數: | 56 |
| 中文關鍵詞: | 簡易貝氏分類器 、不平衡資料集 、屬性排序 、特徵選取 、廣義狄氏分配 |
| 外文關鍵詞: | feature selection, generalized Dirichlet distribution, imbalanced data set, naive Bayesian classifier, selective naive Bayes |
| 相關次數: | 點閱:166 下載:15 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在眾多分類方法中,由於簡易貝氏分類器具有使用簡便、運算效能佳以及分類預測正確率高之優點,因此被廣泛地應用在許多分類任務上。然而許多分類方法都會有一個基本假設,就是資料的分布並非高度傾斜,因此當我們把分類方法應用於資料時,在大多數的情況下可以取得理想的結果,但是在一資料集當中,大多數的資料會集中在某一類別,而所關注的則是占少數資料的類別,就形成了所謂不平衡資料集。由於簡易貝氏分類器本身的學習機制是將類別值機率與所有屬性條件機率相乘,因此在不平衡資料集中,因兩類別的資料筆數相差甚遠,致使類別值機率差異較大,可能會導致簡易貝氏分類器在學習的過程中,誤將少數類別資料預測為多數類別,因此將測試類別值機率對簡易貝氏分類器的影響,此外,透過貝氏屬性挑選法對屬性做重要性排序之外,亦會做特徵選取,並導入先驗分配來調整屬性參數,以此提升簡易貝氏分類器的分類效能。從UCI資料存放站下載10個資料集,並將其處理成不平衡資料集來進行試驗,在實證結果中顯示考量類別值機率與否所造成的影響不大,但導入先驗分配可顯著提升簡易貝氏分類器在不平衡資料集上的效能,且此改善可使得簡易貝氏分類器與分類方法RIPPER相匹配,但是與Random Forest相比之下還是顯得稍為劣勢。
The number of positive instances is generally far larger than the number of negative instances in an imbalanced dataset. Since the number of positive instances is few, the probability estimates for calculating the classification probability of this class value can be unreliable in applying naïve Bayesian classifier. This could be the main reason for naïve Bayesian classifier to have a relatively poor performance on imbalanced data sets. This study first investigates whether the occurring probabilities of class values should be considered in calculating classification probabilities. Then attributes are ranked for introducing generalized Dirichlet priors to improve the performance of naïve Bayesian classifier on imbalanced data sets. The experimental results obtained from 10 data sets show that removing the occurring probabilities of class values in calculating classification probability is not necessary and that introducing priors for attributes can generally achieve a higher F-measure on imbalanced data sets. The naïve Bayesian classifier with priors can have a competitive performance with respective to RIPPER algorithm, while its F-measure is lower than that of Random Forest in the most imbalanced data sets.
Addin, O., Sapuan, S. M., Mahdi, E., & Othman, M. (2007). A naive-Bayes classifier for damage detection in engineering materials. Materials and Design, 28(8), 2379-2386.
Barandela, R., Sanchez, J. S., Garcia, V., & Rangel, E. (2003). Strategies for learning in class imbalance problems.pdf. Pattern Recognition., 36(3), 849-851.
Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 20-29.
Catlett, J. (1991). On changing continuous attributes into ordered discrete attributes. Proceedings of the Fifth European Working Session on Learning on Machine Learning, 164-178.
Cestnik, B. & Bratko, I. (1991). On estimating probabilities in tree pruning. Proceedings of the Fifth European Working Session on Learning on Machine Learning, 138-150.
Chawla, N.V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.
Chawla, N.V, Japkowicz, N., & Drive, P. (2004). Editorial : special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6(1), 1-6.
Chen, J., Huang, H., Tian, S., & Qu, Y. (2009). Feature selection for text classification with naive Bayes. Expert Systems with Applications, 36(3 PART 1), 5432–5435.
Cohen, G., Hilario, M., Sax, H., Hugonnet, S., & Geissbuhler, A. (2006). Learning from imbalanced data in surveillance of nosocomial infection. Artificial Intelligence in Medicine, 37(1), 7–18.
Connor, R. J. & Mosimann, J. E. (1969). Concepts of independence for proportions with a generalization of the Dirichlet distribution. Journal of the American Statistical Association, 64(325), 194-206.
Domingos, P. & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2), 103–130.
Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. Proceedings of the Twelfth International Conference on Machine Learning, 194-202.
Eitrich, T., Kless, A., Druska, C., Meyer, W., & Grotendorst, J. (2007). Classification of highly unbalanced CYP450 data of drugs using cost sensitive machine learning techniques. Journal of Chemical Information and Modeling, 47(1), 92–103.
Estabrooks, A., Jo, T., & Japkowicz, N. (2004). A multiple resampling method for learning from imbalanced data sets. Computational Intelligence, 20(1), 18–36.
Fayyad, U. & Irani, K.(1993). Multi-interval discretization of continuous-valued attributes
for classification learning. Proceedings of the Thirteenth International Joint Conference on
Artificial Intelligence, 1022–1027.
He, H. & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
Kerber, R. (1992). Chimerge: discretization of numeric attributes. Proceedings of the Tenth National Conference on Artificial Intelligence, 123–128.
Kohavi, R. & Sahami, M. (1996). Error-based and entropy-based discretization of continuous features. Journal of Microscopy, 237(3), 487–96.
Kotsiantis, S. B., & Pintelas, P. E. (2003). Mixture of expert agents for handling imbalanced data sets. Annals of Mathematics, Computing & Teleinformatics, 1(1), 46-55.
Kubat, M. & Matwin, S. (1997). Addressing the curse of imbalanced training sets: one sided selection. Proceedings of the Fourteenth International Conference on Machine Learning, 179–186.
Kubat, M., Holte, R. C., & Matwin, S. (1998). Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2), 195–215.
Kuncheva, L. I. & Rodríguez, J. J. (2014). A weighted voting framework for classifiers ensembles. Knowledge and Information Systems, 38(2), 259-275.
Langley, P. & Sage, S. (1994). Induction of selective bayesian classifiers. Proceedings of the Tenth International Conference on Uncertainty in Artificial Intelligence, 399–406.
Li, Y., Guo H., Liu, X., Li, Y., & Li, J. (2016). Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowledge-Based Systems, 94, 88-104.
López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113-141.
Maragoudakis, M., Kermanidis, K., Garbis, A., & Fakotakis, N. (2000). Dealing with imbalanced data using bayesian techniques. International Conference on Language Resources and Evaluation, 1045–1050.
Mitchell, T. M. (1997). Bayesian learning. Machine Learning, 154-199. New York: McGraw-Hill Companies.
Moreno-Torres, J. G. & Herrera, F. (2010). A preliminary study on overlapping and data fracture in imbalanced domains by means of genetic programming-based feature extraction. Proceedings of the Tenth International Conference on Intelligent Systems Design and Applications, 501–506.
Napierała K., Stefanowski J., & Wilk S. (2010). Learning from imbalanced data in presence of noisy and borderline examples. Proceedings of the Seventh International Conference on Rough Sets and Current Trends in Computing, 158-167.
Orriols-Puig, A., Bernadó-Mansilla, E., Goldberg, D. E., Sastry, K., & Lanzi, P. L. (2009). Facetwise analysis of XCS for problems with class imbalances. IEEE Transactions on Evolutionary Computation, 13(5), 1093–1119.
Rijsbergen, C. J. V. (1979). Information retrieval. Information Retrieval Group. Butterworths, London.
Schneider, K. M. (2003). A comparison of event models for naive Bayes anti-spam e-mail filtering. Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics, 307–314.
Sobran, N. M. M., Arfah, A., & Ibrahim, Z. (2013). Classification of imbalanced dataset using conventional naïve Bayes classifier. Proceedings of the International Conference on Artificial Intelligence and Computer Science (AICS2013), 35-42.
Sun, Y., Wong, A. K., & Kamel, M. S. (2009). Classification of imbalanced data: a review. International Journal of Pattern Recognition and Artificial Intelligence, 23(04), 687-719.
Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., & Zhou, Y. (2015). A novel ensemble method for classifying imbalanced data. Pattern Recognition, 48(5), 1623-1637.
Tahir, M. A., Kittler, J., Mikolajczyk, K., & Yan, F. (2009). A multiple expert approach to the class imbalance problem using inverse random under sampling. Proceedings of the Eighth International Workshop on Multiple Classifier Systems, 82-91.
Tan, P. N., Steinbach, M., & Kumar, V. (2006). Classification: alternative techniques. Introduction to Data Mining, 207-315.
UCI (2018). Centre for machine learning and intelligent systems. Retrieved from http://archive.ics.uci.edu/ml
Wang, J., You, J., Li, Q., & Xu, Y. (2012). Extract minimum positive and maximum negative features for imbalanced binary classification. Pattern Recognition, 45(3), 1136-1145.
Weiss, G. M. & Provost, F. (2003). Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 19, 315-354.
Weiss, G. M. (2009). Mining with rare cases. Data Mining and Knowledge Discovery Handbook, 747-757.
Wong, T. T. (1998). Generalized Dirichlet distribution in Bayesian analysis. Applied Mathematics and Computation, 97(2-3), 165-181.
Wong, T. T. (2009). Alternative prior assumptions for improving the performance of naïve Bayesian classifiers. Data Mining and Knowledge Discovery, 18(2), 183-213.
Wong, T. T. & Chang, L. H. (2011). Individual attribute prior setting methods for naïve Bayesian classifiers. Pattern Recognition, 44(5), 1041–1047.
Zadrozny, B., Langford, J., & Abe, N. (2003). Cost-sensitive learning by cost-proportionate example weighting. Proceedings of the Third IEEE International Conference on Data Mining, 435-442.
Zheng, Z., Wu, X., & Srihari, R. (2004). Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter, 6(1), 80-89.