簡易檢索 / 詳目顯示

研究生: 翁大淞
Weng, Ta-Sung
論文名稱: 探討不平衡率對袋裝法和提升法分類效能的影響
The Impact of Imbalanced Ratios on the Performance of Bagging and Boosting Algorithms.
指導教授: 翁慈宗
Wong, Tzu-Tsung
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業與資訊管理學系碩士在職專班
Department of Industrial and Information Management (on the job class)
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 59
中文關鍵詞: 不平衡率集成式學習過抽樣隨機欠抽樣成本敏感
外文關鍵詞: Bagging, Boosting, Imbalanced ratio, Cost-sensitive method, Random under-sampling, SMOTE
相關次數: 點閱:88下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 不平衡資料為常見的資料型態,以不平衡率來表示資料不平衡的程度;為多數類別資料筆數除以少數類別資料筆數,有鑒於傳統分類模型無法有效識別其中的少數類別資料,於現實生活中往往產生非常巨大的誤判成本,如信用卡盜刷或異常品召回等,因此如何有效的分類不平衡資料,一直是資料探勘領域中重要的議題。依據過往研究,集成式學習中的袋裝法與提升法為能有效處理不平衡資料的分類方法,其搭配各種不同的不平衡資料集處理方式,能適時提升分類模型的效能,但對於此類分類方法組合之效能,是否隨著不平衡率的變動而受到影響?則少有文獻探討之。因此本研究將使用袋裝法與提升法,搭配近年常見的不平衡資料集處理方式;合成少數過抽樣、隨機欠抽樣與成本敏感,觀察其對不平衡資料的分類效能,並以統計檢定分析探討以上分類方法組合的效能是否因不平衡率變動而受影響,以為後續相關研究提供分類方法組合於不平衡資料的適用參考。
    本研究共挑選63個資料集,依據不平衡率介於0到10、介於10到20與大於20者平均分成三組,並以AUC值、G-mean值和MCC值分別收集每個分類方法組合的效能進行統計檢定,結果顯示不平衡率介於0到10時,各個分類方法組合並無明顯的優劣順序;不平衡率介於10到20時,搭配合成少數過抽樣的袋裝法為效能較佳者;不平衡率大於20時,搭配合成少數過抽樣的袋裝法同樣擁有相對效能優勢。因此,整體而言搭配合成少數過抽樣的袋裝法在分類不平衡資料的效能上是具備優勢的。

    The instances with the major class value in a data set are called negative instances, and the others are positive instances. The imbalanced ratio of a data set equals the number of negative instances over the number of positive instances. Imbalanced data is common in real applications, and traditional classification algorithms tend to favor negative instances. Due to the high cost of the wrong predictions on positive instances, ensemble algorithms, such as bagging and boosting, are proposed to classify imbalanced data. Combining those ensemble algorithms with the data-level methods for processing imbalanced data can improve classification performance. However, none of previous studies examined the impact of imbalanced ratios on the effectiveness of those methods. In this study, random under-sampling, SMOTE, and cost-sensitive methods were combined with bagging and boosting algorithms to investigate whether imbalanced ratio should be considered in processing imbalanced data. The performance of a classification algorithm is evaluated by G-mean, AUC, and MCC. Nonparametric method sign test is employed to analyze the experimental results obtained from 63 data sets that are divided into three groups based on their imbalanced ratios. All combinations of ensemble algorithms and data-level methods do not have significantly different performance when imbalanced ratio is less than ten. When imbalanced ratio is larger than or equal to ten, combining bagging algorithm with SMOTE generally outperforms the other combinations.

    摘要 I ABSTRACT II 誌謝 V 目錄 VI 表目錄 VIII 圖目錄 IX 第一章 緒論 1 1 . 1 研究背景與動機 1 1 . 2 研究目的 2 1 . 3 論文架構 3 第二章 文獻探討 4 2 . 1 不平衡資料 4 2 . 2 集成式分類方法與資料集處理方式 5 2 . 3 參考測度選用 9 2 . 4 小結 12 第三章 研究方法 13 3 . 1 資料集分組與交叉驗證 13 3 . 2 資料集處理方式 15 3 . 3 分類模型建構 18 3 . 4 分類效能分析 20 第四章 實證分析 23 4 . 1 資料集資訊 23 4 . 2 AUC值統計檢定結果 27 4 . 3 G-MEAN值統計檢定結果 29 4 . 4 MCC值統計檢定結果 31 4 . 5 小結 33 第五章 結論與建議 35 5 . 1 結論 35 5 . 2 建議與未來發展 36 參考文獻 37 附錄一 AUC值測度結果 40 附錄二 G-MEAN值測度結果 43 附錄三 MCC值測度結果 46

    Chawla, N. V. (2010). Data Mining and Knowledge Discovery Handbook. Data mining for imbalanced datasets: An overview. Second edtion. Springer.
    Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.
    Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., and Herrera, F. (2012). A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems Man and Cybernetics Part C-Applications and Reviews, 42(4), 463-484.
    Garcia, V., Marques, A. I., and Sanchez, J. S. (2019). Exploring the synergetic effects of sample types on the performance of ensembles for credit risk and corporate bankruptcy prediction. Information Fusion, 47, 88-101.
    Garcia, V., Sanchez, J. S., and Mollineda, R. A. (2012). On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowledge-Based Systems, 25(1), 13-21.
    Gong, J. and Kim, H. (2017). RHSBoost: Improving classification performance in imbalance data. Computational Statistics and Data Analysis, 111, 1-13.
    Guo, H. X., Li, Y. J., Shang, J., Gu, M. Y., Huang, Y. Y., and Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73, 220-239.
    Hancock, J. T., Khoshgoftaar, T. M., and Johnson, J. M. (2023). Evaluating classifier performance with highly imbalanced Big Data. Journal of Big Data, 10(1), 31, 42.
    He, H. B. and Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284.
    Juez-Gil, M., Arnaiz-Gonzalez, A., Rodriguez, J. J., and Garcia-Osorio, C. (2021). Experimental evaluation of ensemble classifiers for imbalance in Big Data. Applied Soft Computing, 108, 14, 107447.
    Kaur, H., Pannu, H. S., and Malhi, A. K. (2019). A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM Computing Surveys, 52(4), 36, 79.
    Khoshgoftaar, T. M., Van Hulse, J., and Napolitano, A. (2011). Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Transactions on Systems Man and Cybernetics Part A-Systems and Humans, 41(3), 552-568.
    Louk, M. H. L. and Tama, B. A. (2022). Revisiting gradient boosting-based approaches for learning imbalanced data: A case of anomaly detection on power grids. Big Data and Cognitive Computing, 6(2), 9, 41.
    Menardi, G. and Torelli, N. (2014). Training and assessing classification rules with imbalanced data. Data Mining and Knowledge Discovery, 28(1), 92-122.
    Olowookere, T. A. and Adewale, O. S. (2020). A framework for detecting credit card fraud with cost-sensitive meta-learning ensemble approach. Scientific African, 8, 15, e00464.
    Patel, H., Rajput, D. S., Reddy, G. T., Iwendi, C., Bashir, A. K., and Jo, O. (2020). A review on classification of imbalanced data for wireless sensor networks. International Journal of Distributed Sensor Networks, 16(4), 15, 1550147720916404.
    Prati, R. C., Batista, G., and Silva, D. F. (2015). Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowledge and Information Systems, 45(1), 247-270.
    Ri, J. and Kim, H. (2020). G-mean based extreme learning machine for imbalance learning. Digital Signal Processing, 98, 9, 102637.
    Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., and Napolitano, A. (2010). RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems Man and Cybernetics Part A-Systems and Humans, 40(1), 185-197.
    Sun, Z. B., Song, Q. B., Zhu, X. Y., Sun, H. L., Xu, B. W., and Zhou, Y. M. (2015). A novel ensemble method for classifying imbalanced data. Pattern Recognition, 48(5), 1623-1637.
    Thabtah, F., Hammoud, S., Kamalov, F., and Gonsalves, A. (2020). Data imbalance in classification: Experimental evaluation. Information Sciences, 513, 429-441.
    Van Hulse, J. and Khoshgoftaar, T. (2009). Knowledge discovery from imbalanced and noisy data. Data and Knowledge Engineering, 68(12), 1513-1542.
    Wang, S., Minku, L. L., and Yao, X. (2015). Resampling-based ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering, 27(5), 1356-1368.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE