| 研究生: |
蔡函璇 Tsai, Han-Hsuan |
|---|---|
| 論文名稱: |
混合資料層面與演算法層面技術之集成分類方法 An Ensemble Algorithm with Data-level and Algorithm-level Approaches |
| 指導教授: |
翁慈宗
Wong, Tzu-Tsung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 中文 |
| 論文頁數: | 51 |
| 中文關鍵詞: | 集成方法 、不平衡資料 、過取樣 、欠取樣 、成本敏感方法 |
| 外文關鍵詞: | Cost-sensitive strategy, ensemble algorithm, imbalance data, oversampling, undersampling |
| 相關次數: | 點閱:87 下載:15 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
不平衡資料中的少數類別往往是大多數研究關心的議題。過去在以訓練單一模型分類不平衡資料的方法,可分為資料層面與演算法層面,但都有著各自的侷限。而結合多個模型的集成方法上,有著基本模型每一個類別正確率都大於 0.5 的條件,若使用一般的學習方法來處理不平衡資料,容易因為少數類別正確率過低而不適合使用集成方法,因此本研究提出一個新的集成方法來解決此問題。過去也已有許多使用集成方法解決不平衡資料的學習問題,本研究是借鑑過去單一模型處理不平衡資料的方法,結合資料層面與演算法層面中三種不同產生基本模型的策略:過取樣、欠取樣與成本敏感方法,依比例來集成三種策略的基本模型成為集成模型,而三種策略有各自的優缺點,期望互補不同策略優缺點,滿足集成方法多樣性越高越好的條件,以達到更好的學習效果。本研究使用 40 個資料集進行實驗,結果也顯示在多數測度上本研究方法的分類性能有顯著優於單一策略的方法,分類性能也與近年提出的 CBWKELM 方法相當,且訓練時間平均只需 CBWKELM 方法不到一成的時間。
Analysts pay their attention on the correct prediction of the instances belonging to the minority class in an imbalanced data set. There are two common approaches for inducing single model to classify imbalanced data: data-level approach and algorithm-level approach. One requirement of ensemble learning, a technique that combines the predictions of multiple base models, is the accuracy of each base model must be greater than 0.5. Since the accuracy on the minority class is generally less than 0.5 for a base model, ensemble learning cannot be applied on classifying imbalanced data directly. This research aims to propose an ensemble algorithm that combines base models induced from oversampling and undersampling strategies belonging to data-level approach and cost-sensitive strategy belonging to algorithm-level approach. The proportions of the base models for the three strategies are set as parameters. The idea of the proposed method is that the base models induced by different strategies should be more diverse. The proposed method is tested on 40 imbalance data sets, and the experimental results show it significantly outperform the algorithm with only single strategy, regardless of the evaluation metric. In comparing with the state-of-the-art algorithm CBWKELM, the algorithm proposed in this study is 10 times faster and achieves an insignificant higher performance.
Alcala-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., Garcia, S., Sanchez, L., & Herrera, F. (2011). KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. Journal of Multiple-Valued Logic and Soft Computing, 17(2-3), 255-287.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123-140.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.
Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. Proceedings of 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, 107-119. Cavtat-Dubrovnik, Croatia.
Chen, L., Fang, B., Shang, Z., & Tang, Y. (2018). Tackling class overlap and imbalance problems in software defect prediction. Software Quality Journal, 26(1), 97-125.
Choudhary, R., & Shukla, S. (2021). A clustering based ensemble of weighted kernelized extreme learning machine for class imbalance learning. Expert Systems with Applications, 164, Article 114041.
Dietterich, T. G. (2000). Ensemble methods in machine learning. Proceedings of 1st International Workshop on Multiple Classifier Systems, 1-15. Cagliari, Italy.
Dua, D., & Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. University of California, Irvine, School of Information and Computer Sciences.
Díez-Pastor, J. F., Rodríguez, J. J., García-Osorio, C. I., & Kuncheva, L. I. (2015). Diversity techniques improve the performance of the best imbalance learning ensembles. Information Sciences, 325, 98-117.
Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). Learning from Imbalanced Data Sets. Springer Nature Switzerland AG.
Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121(2), 256-285.
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems Man and Cybernetics Part C-Applications and Reviews, 42(4), 463-484.
Guo, H. X., Li, Y. J., Shang, J., Gu, M. Y., Huang, Y. Y., & Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73, 220-239.
Hou, Y., Li, L., Li, B., & Liu, J. (2019). An anti-noise ensemble algorithm for imbalance classification. Intelligent Data Analysis, 23(6), 1205-1217.
Iranmehr, A., Masnadi-Shirazi, H., & Vasconcelos, N. (2019). Cost-sensitive support vector machines. Neurocomputing, 343, 50-64.
Lemaitre, G., Nogueira, F., Aridas, C. K., & Oliveira, D. V. R. (2016). Imbalanced dataset for benchmarking [Data set]. Zenodo. https://doi.org/10.5281/zenodo.61452
Li, H. X., Feng, A., Lin, B., Su, H. C., Liu, Z. X., Duan, X. L., Pu, H. B., & Wang, Y. F. (2021). A novel method for credit scoring based on feature transformation and ensemble model. PeerJ Computer Science, 18, Article e579.
Liu, X. Y., Wu, J. X., & Zhou, Z. H. (2009). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems Man and Cybernetics Part B-Cybernetics, 39(2), 539-550.
Ng, W., Xu, S., Zhang, J., Tian, X., Rong, T., & Kwong, S. (2020). Hashing-based undersampling ensemble for imbalanced pattern classification problems. IEEE Transactions on Cybernetics, 52(2), 1-11.
Raghuwanshi, B. S., & Shukla, S. (2018). UnderBagging based reduced kernelized weighted extreme learning machine for class imbalance learning. Engineering Applications of Artificial Intelligence, 74, 252-270.
Razavi-Far, R., Farajzadeh-Zanajni, M., Wang, B. Y., Saif, M., & Chakrabarti, S. (2021). Imputation-based ensemble techniques for class imbalance learning. IEEE Transactions on Knowledge and Data Engineering, 33(5), 1988-2001.
Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2010). RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems Man and Cybernetics Part a-Systems and Humans, 40(1), 185-197.
Sun, Y., Kamel, M. S., Wong, A. K. C., & Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12), 3358-3378.
Sun, Y. M., Wong, A. K. C., & Kamel, M. S. (2009). Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence, 23(4), 687-719.
Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., & Zhou, Y. (2015). A novel ensemble method for classifying imbalanced data. Pattern Recognition, 48(5), 1623-1637.
Tang, B., & He, H. B. (2017). GIR-based ensemble sampling approaches for imbalanced learning. Pattern Recognition, 71, 306-319.
Ting, K. M. (2002). An instance-weighting method to induce cost-sensitive trees. IEEE Transactions on Knowledge and Data Engineering, 14(3), 659-665.
Wang, L., Zhao, L., Gui, G., Zheng, B. Y., & Huang, R. C. (2017). Adaptive ensemble method based on spatial characteristics for classifying imbalanced data. Scientific Programming, 2017, Article 3704525.
Wang, S., & Yao, X. (2009). Diversity analysis on imbalanced data sets by using ensemble models. Proceedings of 2009 IEEE Symposium on Computational Intelligence and Data Mining, 324-331. Nashville, TN, USA.
Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241-259.
Yang, P. T., Wu, W. S., Wu, C. C., Shih, Y. N., Hsieh, C. H., & Hsu, J. L. (2021). Breast cancer recurrence prediction with ensemble methods and cost-sensitive learning. Open Medicine, 16(1), 754-768.
Zefrehi, H. G., & Altincay, H. (2020). Imbalance learning using heterogeneous ensembles. Expert Systems with Applications, 142, Article 113005.
Zhao, Y., Shrivastava, A. K., & Tsui, K. L. (2016). Imbalanced classification by learning hidden data structure. IIE Transactions, 48(7), 614-628.
Zong, W., Huang, G.-B., & Chen, Y. (2013). Weighted extreme learning machine for imbalance learning. Neurocomputing, 101, 229-242.