| 研究生: |
鄭宇秦 Cheng, Yu-Chin |
|---|---|
| 論文名稱: |
不平衡資料於不同層面調整方法下的效能評估之研究 Evaluate Classification Efficiency in Different Adjustment of Imbalanced Data |
| 指導教授: |
翁慈宗
Wong, Tzu-Tsung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2020 |
| 畢業學年度: | 108 |
| 語文別: | 中文 |
| 論文頁數: | 59 |
| 中文關鍵詞: | 成本敏感分類方法 、不平衡資料 、超抽樣 、分類器效能評估 、欠抽樣 |
| 外文關鍵詞: | Cost-sensitive classification, imbalanced data, oversampling, performance evaluation, undersampling |
| 相關次數: | 點閱:102 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
現實生活中有許多蒐集到的資料為不平衡的資料,不平衡資料意旨資料集中資料分布的狀況是歪斜的。舉凡信用卡持卡人信用不良的資料數量很少或是僅有少部分比例的病人患有惡性腫瘤。在這些不平衡資料中,如何辨識少數類別資料,或稱陰性類別資料,是個重要的議題。多數的分類演算法無法讓陽性類別資料有較高的預測準確率。主要處理不平衡資料的作法有兩種,資料層面和演算法層面的做法。然而不存在一個客觀的程序可以比較兩種做法的分類表現。因此本研究將提出一個可以處理這種現象的程序。本研究將採用決策樹、邏輯式回歸與線性支持向量機作為分類模型。超抽樣與欠抽樣的做法會在不平衡資料中被用來平衡資料的分布。錯分成本會在模型訓練階段時被設定,用於演算法層面避免訓練出過於偏向多數類別的模型。評估指標為F-measure與AUC。使用十組資料的實驗結果顯示,AUC並非一個評估表現的敏感測度。不論資料層面或演算層面,最好根據不同資料集的特徵和分類方法去做判斷。
The data collect from many real applications are imbalanced; i.e., the class distribution in a data set is skewed. For example, consumers with bad credit are few, and only a small proportion of patients have malignant tumor. In those imbalanced data sets, how to identify the instances with the minor class value, or called positive instances, is a crucial issue. Most classification algorithms cannot have a high prediction accuracy on positive instances. There are two main approaches in processing imbalanced data: data level and algorithm level. However, there does not exist an objective procedure to compare the performance of the two approaches. This research will propose a procedure that can be used to perform this task. The classification algorithms considered in this thesis are decision tree, logistic function, and linear support vector machine. Both oversampling and undersampling methods are applied to balance the class distribution in a data set. Misclassification cost is set to avoid favoring the major class value in the learning scheme of an algorithm for the algorithm-level approach. Evaluation measures are F-measure and the area under the ROC curve. The experimental results on ten imbalanced data sets show that the area under the ROC curve is not a sensitive measure for performance evaluation. Whether data-level approach or algorithm-level approach is better depends on the characteristics of data sets and classification algorithms.
Bahnsen, A. C., Aouada, D., Ottersten, B. (2015). Example-dependent cost-sensitive decision trees. Expert Systems with Applications, 42(19), 6609-6619.
Barandela, R., Sánchez, J.S., García, V., Rangel E. (2003). Strategies for learning in class imbalance problems. Pattern Recognition, 36, 849–85.
Brown, I., Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring sets. Expert System with Applications, 39(3), 3446-3453
Cao, P., Zhao, D., Zaiane, O. (2013). An Optimized Cost-Seneitive SVM for Imbalanced Data Learning n: Pei J., Tseng V.S., Cao L., Motoda H., Xu G. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes in Computer Science, vol 7819. Springer, Berlin, Heidelberg.
Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Oversampling Technique. Journal of Artificial Intelligence Research, 16, 321-357.
Chawla, N. V., Lazarevic, A., Hall, L. O., Bowyer, K. W. (2003). SMOTEBoost: Improving Prediction of the Minority Class in Boosting. Knowledge Discovery in Databases: PKDD, 107-119.
Chai, X., Deng, L., Yang, Q., Ling, C. X. (2004). Test-cost sensitive naive Bayes classification. Fourth IEEE International Conference on Data Mining, ICDM'04, Brighton, UK.
Chen, J., Tang, Y. Y., Fang, B., Guo, C. (2012). In silico prediction of toxic action mechanisms of phenols for imbalanced data with Random Forest learner. Journal of Molecular Graphics and Modelling, 35, 21-27.
Darzi, M. R. K., Niaki, S. T. A., and Khedmati, M. (2019). Binary Classification of imbalanced datasets: The case of CoIL challenge 2000. Expert Systems with Applications, 128, 169-186.
Domingos, P. (1999). MetaCost: A General Method for Making Classifiers Cost- Sensitive. Knowledge Discovery and Data Mining, 155–164
Elkan, C. (2001, August). The Foundations of Cost-Sensitive Learning. Proceedings of the 17th international joint conference on Artificial Intelligence, San Francisco, CA, United States.
Fawcett, T. (2006). An Introduction to ROC Analysis. Pattern Recognition Letters, 27 (8), 861–874.
Fernández, A., Gracía, S., Herrera, F., Chawla, N. V. (2018) SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. Journal of Artificial Intelligence Research, 61, 863-905.
Freitas, A., Costa-Pereira, A., Brazdil, P. (2007). Cost-Sensitive Decision Trees Applied to Medical Data. In: Song I.Y., Eder J., Nguyen T.M. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2007. Lecture Notes in Computer Science, vol 4654. Springer, Berlin, Heidelberg
Hassan, A. K. I., Abraham, A. (2016). Modeling Insurance Fraud Detection Using Imbalanced Data Classification. Advances in Nature and Biologically Inspired Computing, 419, 117-127.
Gicić, A., Subasi, A. (2018). Credit scoring for a microcredit data set using the synthetic minority oversampling technique and ensemble classifiers. Expert Systems, 36(2):e12363.
Guo, H., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications ,73, 220-239.
He, H., Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284.
Hernández-Orallo, J., Ferri, C., Lachice, N., Flach, P. (2004). ROC Analysis in Artificial Intelligence. The 16th European Conference on Artificial Intelligence, Valencia, Spain.
Li, K., Xie, P. X., Zhai, J., Liu, W. (2017). An Improved Adaboost Algorithm for Imbalanced Data Based on Weighted KNN. IEEE 2nd International Conference on Big Data Analysis, Beijing, China.
Liu, X. Y., Wu, J., Zhou, Z. H. (2009). Exploratory Undersampling for Class-Imbalance Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 39(2), 539-550.
Lacute{o}pez, V., Fernacute{a}ndez, A., Moreno-Torres, J. G., Herrera, F. (2012). Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification, Open problems on intrinsic data characteristics. Expert System with Applications, 39(7), 6585-6608.
Maciejewski, T., Stefanowski, J. (2011). Local neighbourhood extension of SMOTE for mining imbalanced data. IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Paris, 104-111
Nitesh V. C. (2005). Data Mining For Imbalanced Datasets: An Overview. Data Mining and Knowledge Discovery Handbook, 853-867.
Quinlan, J. R., (1993). C4.5: programs for machine learning. San Francisco, California, United State: Morgan Kaufmann Publishers Inc.
Safavian, S. R., Landgrebe, D. (1991). A Survey of Decision Tree Classifier Methodology. IEEE Transaction on Systems, Man, and Cybernetics, 21(3), 660–674.
Sahin, Y., Bulkan, S., Duman E., (2013). A cost-sensitive decision tree approach for fraud detection. Expert System with Applications, 40(15), 5916-5923.
Sun, Y., Kamel, M. S., Wong, A. K.C., Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern recognition 40(12), 3358-3378.
Swets, J. A. (1988). Measuring the Accuracy of Diagnostic Systems. Science, 240(4857), 1285-1293.
Tharwat, A. (2018). Classification assessment methods. Applied Computing and Informatics.
Tsai, M.-F., and Yu, S.-S., (2015). Data Mining for Bioinformatics: Design with Oversampling and Performance Evaluation. Journal of Medical and Biological Engineering ,35, 775–782.
Wilcoxon, F., (1945). Individual Comparisons by Ranking Methods. Biometrics Bulletin, 1(6), 80-83
Wu, C.-C., Chen, Y.-L., Tang, K. (2019). Cost-sensitive decision tree with multiple resource constraints. Applied Intelligence, 49, 3765-3782.
Zheng, Z., Cai, Y., Li, Y. (2015). Oversampling method for imbalanced classification. Computing and Informatics, 34, 1017–1037.
Zadrozny, B., Langford, J., Abe, N. (2003). Cost-sensitive learning by cost-proportionate example weighting. Proceedings of the Third IEEE International Conference on Data Mining, Melbourne, FL.
校內:立即公開