| 研究生: |
許順富 Hsu, Shun-Fu |
|---|---|
| 論文名稱: |
不同處理方式對不平衡資料分類效能影響之研究 A Study on the Classification Performance of Various Approaches for Processing Imbalanced Data |
| 指導教授: |
翁慈宗
Wong, Tzu-Tsung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 工業與資訊管理學系碩士在職專班 Department of Industrial and Information Management (on the job class) |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 中文 |
| 論文頁數: | 44 |
| 中文關鍵詞: | 不平衡資料 、SMOTE 、RandomUnderSampling 、F-measure 、AUC |
| 外文關鍵詞: | imbalanced data, SMOTE, RandomUnderSampling, F-measure, AUC |
| 相關次數: | 點閱:75 下載:29 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在不平衡資料的二元分類中,由於多數類別與少數類別的資料數量比例落差很大,所以在分類模型預測時,通常都會偏向多數類別而忽略了比較重要的少數類別,這樣的情況會導致分類效益不佳,無法達到使用者的需求;有鑑於此本研究將利用調整分類錯誤成本與資料集的分佈,來確認何種方法對於不平衡資料的分類是比較有效益的?會使用三個分類方法,除了調整分類錯誤成本之外,還將用先分組再調整資料集與先調整資料集再分組的方式來做相互比較,確認不同方式之間存在何種差異,而進一步了解這些差異是否有可能會讓使用者產生誤判的現象。
本研究使用20個資料檔,實驗結果顯示,先調整資料集再分組的方式會造成使用者對分析結果的錯誤判定,使用時必須非常謹慎;使用的三個分類方法中有兩個分類方法其效果是不太理想的;另外在處理不平衡資料時,若現在有調整分類錯誤成本及SMOTE、RandomUnderSampling這兩種調整資料分佈的做法,我們比較建議的方法是SMOTE。
In binary classification with imbalanced data, there is a significant disparity in the number of samples between the majority class and the minority class. As a result, classification models tend to favor the majority class and overlook the relatively important minority class. This situation leads to poor classification performance, failing to meet user requirements. In light of this, the present study aims to identify the most effective method for handling imbalanced data by adjusting classification error costs and dataset distributions. Three different classification methods will be used. Apart from adjusting classification error costs, a comparison will also be made between grouping before adjusting the dataset and adjusting the dataset before grouping, to identify the differences between these various approaches. Furthermore, the study seeks to explore whether these differences could potentially lead to user misinterpretations. The study utilized 20 datasets, and the experimental results revealed that the approach of adjusting the dataset before grouping could lead to erroneous user interpretations, warranting cautious usage. Additionally, two out of the three classification methods employed did not yield satisfactory results. Moreover, when dealing with imbalanced data and considering adjusting classification error costs along with the methods of SMOTE and RandomUnderSampling to adjust data distribution, the study suggests that SMOTE is the preferable method.
Airola, A., Pahikkala, T., Waegeman, W., Baets, B.D., and Salakoski, T. (2011). An experimental comparison of cross-validation techniques for estimating the area under the ROC curve. Computational Statistics & Data Analysis, 55, 1828-1844.
Altincay, H. and Ergun, C. (2004). Clustering based undersampling for improving speaker verification decisions using AdaBoost. Lecture Notes in Computer Science, 3138, 698- 706.
Altman, NS. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46 (3), 175-185.
Bergstra, J. and Bengio, Y.(2012).Random search for hyperparameter optimization. Journal of Machine Learning Research, 13, 281305.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123-140.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P.(2002). SMOTE:synthetic minority over sampling technique. Journal of Artificial Intelligence Research, 321-357.
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 273-297.
Dewancker, I., McCourt, M., and Clark, S.(2018). Bayesian optimization primer. SIGOPT.
Freund, Y. and Schapire, R.E. (1996). Experiments with a new boosting algorithm. Proceedings of the Thirteenth International Conference on Machine Learning, 148-156.
Garcia, V., Mollineda, R. A., and Sanchez, J. S. (2014). A bias correction function for classification performance assessment in two-class imbalanced problems. Knowledge-Based Systems , 59 , 66-74.
He, H., Yang, B., Garcia, E.A., and Li, S. (2008). ‘ADASYN: Adaptive synthetic sampling approach for imbalanced learning’, 2008 IEEE International Joint Conference on Neural Networks, 1322-1328.
Kohavi, R., and Provost, F. (1998). Confusion matrix. Machine Learning, 30(2-3), 271-274.
Kubat, M. and Matwin, S. (1997). Addressing the curse of imbalanced training sets: one-sided selection. Machine Learning, 179-186.
Kubat, M., Holte, R.C., and Matwin S. (1998). Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30, 195–215.
Lewis, D.D. and Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. Proceedings of the 11th International Conference on Machine Learning ,144-156.
Lewis, D.D. (1995). Evaluating and optimizing autonomous text classification systems. Proceedings of the Eighteenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, 246-254.
Liu, X.Y., Wu, J., and Zhou, Z.H. (2009). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539-550.
Lin, Y., Dong, H., Wang, H., & Zhang, T. (2022). Bayesian invariant risk minimization. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16021-16030.
Quinlan, R. (1993). C4.5: Programs for Machine Learning, Morgan kaufmann, San Mateo,CA.
Sáez, J.A., Luengo, J., Stefanowski, J., & Herrera, F. (2015). Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Information Sciences, 291, 184-203.
Tianqi, C and Guestrin, C.(2016). XGBoost:A scalable tree boosting system. KDD’16Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785794.
Zhang, Y.P., Zhang, L.N., and Wang, Y.C. (2010). Cluster-based majority under-sampling approaches for class imbalance learning. 2nd IEEE International Conference on Information and Financial Engineering , 400-404.