簡易檢索 / 詳目顯示

研究生: 黃愉兒
Huang, Yu-Erh
論文名稱: 抽樣方法結合隨機生成法應用於分類不平衡資料
Combining Sampling Techniques and Ensemble Learning with Randomly Generated Base Models for Classifying Imbalanced Data
指導教授: 翁慈宗
Wong, Tzu-Tsung
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2025
畢業學年度: 113
語文別: 中文
論文頁數: 69
中文關鍵詞: 資料抽樣集成學習不平衡資料簡易貝氏分類器隨機生成基本模型
外文關鍵詞: Data sampling, ensemble learning, imbalanced data, naive Bayes classifier, randomly generated base models
相關次數: 點閱:21下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 不平衡資料分類是機器學習分類問題中相當重要的議題之一,過去的研究廣泛將資料層面方法與集成學習方法共同用於處理不平衡資料問題。主流的集成方法在訓練基本模型時的訓練資料皆來自相同的資料集,使學習成果受限於原始資料分布的限制,隨機生成方法跳脫了此一限制,混合袋裝法時能夠提升二元不平衡資料分類表現。因此本研究聚焦於二元不平衡資料分類問題,透過結合資料抽樣方法與隨機生成方法,共同處理不平衡資料分類問題,在資料層面使用隨機欠抽樣方法與少數類別合成過抽樣方法改變資料分布;在演算法層面引入隨機生成方法與袋裝法等集成學習方法。實驗結果表明,加入隨機生成方法後隨機欠抽樣隨機生成法在G-mean、F-measure及MCC三個測度上提升了4%的測度值,少數類別合成過抽樣隨機生成法則提升了8%的測度值,然而在一些噪聲較高或是正類別數極少的資料集上的表現效果不明顯,隨機生成方法在運算時間上相較於其他集成方法需要更多的計算時間。

    Imbalanced data classification is a critical issue in machine learning, and many studies thus proposed the combination of data-level techniques and ensemble learning methods to address this issue. However, most ensemble methods induce base models from the same training data, which limits their performance due to the constraints of the original data distribution. Ensemble algorithms with randomly generated base models are thus proposed to overcome this limitation. This study focuses on binary imbalanced data classification by integrating data sampling techniques, such as random undersampling and synthetic minority oversampling, with ensemble algorithms that can randomly generate base models. The experimental results show that combining random undersampling with algorithms that randomly generated base models improves G-mean, F-measure, and MCC by approximately 4%, and that combining synthetic oversampling with random generation leads to an improvement of around 8% in these metrics. However, the improvement becomes less noticeable when datasets have high levels of noise or very few positive instances. Furthermore, the computational cost for the ensemble algorithms with randomly-generated base models are higher.

    摘要 I 致謝 VI 目錄 VII 表目錄 IX 圖目錄 X 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 2 1.3 研究架構 3 第二章 文獻探討 4 2.1 不平衡資料 4 2.2 不平衡資料處理方法 5 2.2.1資料層面處理方法 5 2.2.2演算法層面處理方法 7 2.3 集成方法 8 2.3.1 袋裝法 9 2.3.2 提升法 10 2.3.3 堆疊法 11 2.3.4 隨機生成法 11 2.4 資料層面方法結合集成方法 12 2.4.1 基於袋裝法的混合集成方法 13 2.4.2 基於提升法的混合集成方法 14 2.4.3 結合多種抽樣與集成方法的混合方法 15 2.4.4 資料處理方法結合集成學習方法之比較 15 2.5 評估指標 17 2.6 小結 19 第三章 研究方法 21 3.1 研究方法流程 23 3.2 資料前處理與資料分割 23 3.3 資料平衡抽樣 25 3.3.1 隨機欠抽樣 25 3.3.2 少數類別合成過抽樣 25 3.4 隨機生成簡易貝氏分類模型 26 3.5 粒子群演算法 27 3.6 實驗成果評估 30 3.6.1 比較對象 31 3.6.2 評估測度 32 第四章 實證研究 34 4.1 資料集介紹 34 4.2 粒子群適應值評估測度之測試 36 4.3 欠抽樣實驗結果 37 4.4 過抽樣實驗結果 41 4.5 運算效率之討論 45 4.6 小結 47 第五章 結論與建議 48 5.1 結論 48 5.2 建議 49 參考文獻 50

    王敏. (2024)。隨機生成基本模型之集成方法應用於不平衡資料分類。國立成功大學資訊管理研究所碩士論文。
    何政賢. (2024).。以粒子群最佳方法優化應用於二類別資料之隨機即成演算法。國立成功大學資訊管理研究所碩士論文。
    徐心縈. (2023)。用羅吉斯迴歸建構隨機分類模型之集成方法。國立成功大學資訊管理研究所碩士論文。
    黃中立. (2023)。以簡易貝氏分類器隨機生成基本模型之集成方法。國立成功大學資訊管理研究所碩士論文。
    Akbani, R., Kwek, S., & Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. Proceedings of the 15th European Conference on Machine Learning, Pisa, Italy, 15, 39-50.
    Alenazi, F. S., El Hindi, K., & AsSadhan, B. (2023). Complement-class harmonized naïve bayes classifier. Applied Sciences, 13(8), 4852.
    Barandela, R., Valdovinos, R. M., & Sánchez, J. S. (2003). New applications of ensembles of classifiers. Pattern Analysis & Applications, 6, 245-256.
    Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 20-29.
    Bian, S., & Wang, W. (2007). On diversity and accuracy of homogeneous and heterogeneous ensembles. International Journal of Hybrid Intelligent Systems, 4(2), 103-128.
    Breiman, L. (1994). Heuristics of Instability and Stabilization in Model Selection. The Annals of Statistics, 24(6), 2350-2383.
    Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
    Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32.
    Chai, X., Deng, L., Yang, Q., & Ling, C. X. (2004). Test-cost sensitive naive bayes classification. Proceedings of the Fourth IEEE International Conference on Data Mining, 51-58.
    Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.
    Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, 107-119.
    Chen, W., Yang, K., Yu, Z., Shi, Y., & Chen, C. (2024). A survey on imbalanced learning: latest research, applications and future directions. Artificial Intelligence Review, 57(6), 1-51.
    Chhabra, D., Juneja, M., & Chutani, G. (2023). An efficient ensemble based machine learning approach for predicting Chronic Kidney Disease. Current Medical Imaging, 20, 12.
    Chicco, D., & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21, 1-13.
    De Ville, B. (2013). Decision trees. Wiley Interdisciplinary Reviews: Computational Statistics, 5(6), 448-455.
    Dietterich, T. G. (2000). Ensemble methods in machine learning. Proceedings of the 1st International Workshop on Multiple Classifier Systems, 1-15.
    Díez-Pastor, J. F., Rodriguez, J. J., Garcia-Osorio, C., & Kuncheva, L. I. (2015). Random balance: ensembles of variable priors classifiers for imbalanced data. Knowledge-Based Systems, 85, 96-111.
    Fan, S., Zhang, X., & Song, Z. (2021). Imbalanced sample selection with deep reinforcement learning for fault diagnosis. IEEE Transactions on Industrial Informatics, 18(4), 2518-2527.
    Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121(2), 256-285.
    Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Proceedings of the Thirteenth International Conference on Machine Learning, 96, 148-156.
    Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463-484.
    García-Vicente, C., Chushig-Muzo, D., Mora-Jiménez, I., Fabelo, H., Gram, I. T., Løchen, M.-L., Granja, C., & Soguero-Ruiz, C. (2023). Evaluation of synthetic categorical data generation techniques for predicting cardiovascular diseases and post-hoc interpretability of the risk factors. Applied Sciences, 13(7), 4119.
    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139-144.
    Gu, Q., Cai, Z., Zhu, L., & Huang, B. (2008). Data mining on imbalanced data sets. Proceedings of the International Conference on Advanced Computer Theory and Engineering, 1020-1024.
    Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Proceedings of the International Conference on Intelligent Computing, 878-887.
    Hasib, K. M., Iqbal, M. S., Shah, F. M., Mahmud, J. A., Popel, M. H., Showrov, M. I. H., Ahmed, S., & Rahman, O. (2020). A survey of methods for managing the classification and solution of data imbalance problem. Journal of Computer Science, 16(11), 1546-1557.
    Hido, S., Kashima, H., & Takahashi, Y. (2009). Roughly balanced bagging for imbalanced data. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2(5‐6), 412-426.
    Hu, S., Liang, Y., Ma, L., & He, Y. (2009). MSMOTE: Improving classification performance when training data is imbalanced. Proceedings of the Second International Workshop on Computer Science and Engineering, 13-17.
    Hu, X. (2001). Using rough sets theory and database operations to construct a good ensemble of classifiers for data mining applications. Proceedings of the 2001 IEEE International Conference on Data Mining, 233-240.
    Iranmehr, A., Masnadi-Shirazi, H., & Vasconcelos, N. (2019). Cost-sensitive support vector machines. Neurocomputing, 343, 50-64.
    Juez-Gil, M., Arnaiz-González, Á., Rodríguez, J. J., & García-Osorio, C. (2021). Experimental evaluation ofensemble classifiers for imbalance in Big Data. Applied Soft Computing, 108, 107447.
    Kaur, H., Pannu, H. S., & Malhi, A. K. (2019). A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM Computing Surveys, 52(4), 1-36.
    Khalid, A. R., Owoh, N., Uthmani, O., Ashawa, M., Osamor, J., & Adejoh, J. (2024). Enhancing credit card fraud detection: an ensemble machine learning approach. Big Data and Cognitive Computing, 8(1), 6.
    Khan, A. A., Chaudhari, O., & Chandra, R. (2024). A review of ensemble learning and data augmentation models for class imbalanced problems: combination, implementation and evaluation. Expert Systems with Applications, 244(2), 122778.
    Khoshgoftaar, T. M., Seiffert, C., Van Hulse, J., Napolitano, A., & Folleco, A. (2007). Learning with limited minority class data. Proceedings of the Sixth International Conference on Machine Learning and Applications, 348-353.
    Kim, K. (2021). Noise avoidance SMOTE in ensemble learning for imbalanced data. IEEE Access, 9, 143250-143265.
    Krawczyk, B., Woźniak, M., & Schaefer, G. (2014). Cost-sensitive decision tree ensembles for effective imbalanced classification. Applied Soft Computing, 14, 554-562.
    Kumar, A. (2024). SOM-US: a novel under-sampling technique for handling class imbalance problem. Journal of Communications Software and Systems, 20(1), 69-75.
    Kumari, R., Singh, J., & Gosain, A. (2023). Impact of class imbalance ratio on ensemble methods for imbalance problem: A new perspective. Journal of Intelligent & Fuzzy Systems, 45(6), 1-12.
    Kuncheva, L. I., & Whitaker, C. J. (2003). Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51, 181-207.
    Larose, D. T., & Larose, C. D. (2014). Discovering Knowledge in Data: An Introduction to Data Mining (Vol. 4). John Wiley & Sons.
    Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2009). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539-550.
    Louk, M. H. L., & Tama, B. A. (2021). Exploring ensemble-based class imbalance learners for intrusion detection in industrial control networks. Big Data and Cognitive Computing, 5(4), 72.
    Matthews, B. W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405(2), 442-451.
    Metz, C. E. (1978). Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8(4), 283-298.
    Murphy, K. P. (2006). Naive bayes classifiers. University of British Columbia, 18(60), 1-8.
    Nam, G., Yoon, J., Lee, Y., & Lee, J. (2021). Diversity matters when learning from ensembles. Advances in Neural Information Processing Systems, 34, 8367-8377.
    Ning, Z., Jiang, Z., & Zhang, D. (2023). Sparse projection infinite selection ensemble for imbalanced classification. Knowledge-Based Systems, 262, 110246.
    Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of artificial intelligence research, 11, 169-198.
    Rahman, C. M., Farid, D. M., & Rahman, M. Z. (2011). Adaptive Intrusion Detection based on Boosting and Naïve Bayesian Classifier. International Journal of Computer Applications, 24(3), 12-19
    Razavi-Far, R., Farajzadeh-Zanajni, M., Wang, B., Saif, M., & Chakrabarti, S. (2021). Imputation-based ensemble techniques for class imbalance learning. IEEE Transactions on Knowledge and Data Engineering, 33(5), 1988-2001.
    Ren, Z., Lin, T., Feng, K., Zhu, Y., Liu, Z., & Yan, K. (2023). A systematic review on imbalanced learning methods in intelligent fault diagnosis. IEEE Transactions on Instrumentation and Measurement, 72, 1-35.
    Salehi, A. R., & Khedmati, M. (2024). A cluster-based SMOTE both-sampling (CSBBoost) ensemble algorithm for classifying imbalanced data. Scientific Reports, 14(1), 5152.
    Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2010). RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 40(1), 185-197.
    Tibshirani, R. J., & Efron, B. (1993). An introduction to the bootstrap. Monographs on Statistics and Applied Probability, 57(1), 1-436.
    Truica, C.-O., & Leordeanu, C. A. (2017). Classification of an imbalanced data set using decision tree algorithms. UPB Scientific Bulletin, Series C: Electrical Engineering and Computer Science, 79(4), 69-84.
    Wang, S., & Yao, X. (2009). Diversity analysis on imbalanced data sets by using ensemble models. Proceedings of the 2009 IEEE Symposium on Computational Intelligence and Data Mining, 324-331.
    Weiss, G. M., & Provost, F. (2003). Learning when training data are costly: The effect of class distribution on tree induction. Journal of Artificial Intelligence Research, 19, 315-354.
    Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241-259.
    Xie, X., Zhang, W., & Yang, L. (2003). Particle swarm optimization. Control and Decision, 18, 129-134.
    Yap, B. W., Rani, K. A., Rahman, H. A. A., Fong, S., Khairudin, Z., & Abdullah, N. N. (2014). An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. Proceedings of the First International Conference on Advanced Data and Information Engineering, 13-22.
    Zadrozny, B., Langford, J., & Abe, N. (2003). Cost-sensitive learning by cost-proportionate example weighting. Proceedings of the Third IEEE International Conference on Data Mining, 435-442.
    Zhao, H., Zhao, C., Zhang, X., Liu, N., Zhu, H., Liu, Q., & Xiong, H. (2023). An ensemble learning approach with gradient resampling for class-imbalance problems. Informs Journal on Computing, 35(4), 747-763.
    Zhao, J., Jin, J., Chen, S., Zhang, R., Yu, B., & Liu, Q. (2020). A weighted hybrid ensemble method for classifying imbalanced data. Knowledge-Based Systems, 203, 106087.
    Zhao, Z., Cui, T., Ding, S., Li, J., & Bellotti, A. G. (2024). Resampling Techniques Study on Class Imbalance Problem in Credit Risk Prediction. Mathematics, 12(5), 701.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE