簡易檢索 / 詳目顯示

研究生: 王敏
Wang, Min
論文名稱: 隨機生成基本模型之集成方法應用於不平衡資料分類
Ensemble Algorithms with Randomly Generated Base Models for Classifying Imbalanced Data
指導教授: 翁慈宗
Wong, Tzu-Tsung
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 66
中文關鍵詞: 二元分類集成學習不平衡資料簡易貝氏分類器隨機生成基本模型
外文關鍵詞: Binary classification, ensemble learning, imbalanced data, naive Bayesian classifier, randomly generated base models
相關次數: 點閱:95下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 不平衡資料在實務應用中十分常見,目前的研究文獻針對不平衡資料分類問題提出了各種方法,其中集成學習為廣受歡迎的方法之一。由於目前主流的集成學習方法主要都是從相同來源的訓練資料集,經過演算法的學習機制生成分類模型,使得模型的學習過程將有一定程度地受到原始資料分布的限制。因此,本研究透過隨機生成基本模型之集成方法為不平衡資料分類問題提供新的解決方案,根據過去的實驗結果,此方法跳脫了以往從資料訓練基本模型的步驟,並特別適合應用於二元資料集的預測分類問題,而不平衡資料通常也屬於這種情況。在隨機生成基本模型的集成方法中,由於篩選隨機模型的衡量指標對集成模型的結果相當重要,故本研究實驗了不同的模型衡量指標,包括單指標與雙指標組合為篩選門檻,實驗結果表明,使用雙指標對於提升集成模型分類表現的效果不明顯,同時需要花費較多的運算時間才能達成。因此,本研究後續使用三種不同單指標門檻值設定在30個資料集進行實證,結果表明,當不平衡資料影響模型的學習過程導致模型對少數類別的預測表現較差時,由於隨機模型在識別少數類別資料的能力上具有一定程度的優勢,使得將隨機模型加入集成模型中在多數測度上皆顯著地優於其他集成方法。然而,隨機生成基本模型之集成方法應用於不平衡資料分類問題時,在運算時間方面仍與過去的研究結果相同,相較於其他集成方法來得更耗時。

    The data collected for many real-world applications are imbalanced, and the techniques for handling imbalanced data are thus proposed in these years. Ensemble learning is one of the most popular approaches for this purpose. Since most ensemble algorithms induce base models from the instance sets derived from the same training set, the predictions of the base models are not independent. The ensemble models built by this way is likely to favor the majority class and has a relatively poor performance on imbalanced data. In this study, ensemble algorithms that can randomly generate base models are introduced to improve this deficiency. Both single and double thresholds are set to filter base models, and the experimental results suggests that single threshold is more efficient and achieves similar performance to double thresholds. Single threshold for filtering base models is thus used to design two ensemble algorithms for processing imbalanced data. The experimental results obtained from 30 imbalanced data sets indicate that the ensemble algorithm with randomly-generated base models and the ones produced by the bagging approach significantly outperforms the other three ensemble algorithms regardless of the metric employed for performance evaluation. However, the proposed ensemble algorithm is computationally intensive because of the high computational cost for filtering base models.

    摘要 I 致謝 V 目錄 VI 表目錄 VIII 圖目錄 IX 第ㄧ章 緒論 1 1.1研究動機 1 1.2研究目的 2 1.3研究架構 3 第二章 文獻探討 4 2.1不平衡資料問題 4 2.2解決不平衡資料問題的方法 5 2.2.1資料層面 5 2.2.2演算法層面 6 2.2.3成本敏感學習方法 7 2.3集成學習 8 2.3.1集成模型定義及集成方法 8 2.3.2集成學習解決不平衡資料問題 10 2.4衡量指標 12 2.5小結 17 第三章 研究方法 18 3.1研究方法流程 18 3.2資料前置處理與分割 19 3.3門檻值設定 20 3.4 隨機生成基本模型 22 3.4.1簡易貝氏分類器 22 3.4.2隨機生成簡易貝氏基本模型 23 3.5模型評估方式 23 3.5.1比較對象 25 3.5.2評估測度 26 第四章 實證研究 28 4.1資料集介紹與執行設定 28 4.2隨機模型對於不平衡資料集的分類結果 30 4.3單指標與雙指標門檻值設定之測試 35 4.4與其他方法比較 38 4.5執行時間比較 45 4.6小結 47 第五章 結論與建議 49 5.1結論 49 5.2建議 50 參考文獻 51

    黃中立. (2023)。以簡易貝氏分類器隨機生成基本模型之集成方法。國立成功大學資 訊管理研究所碩士論文。
    徐心縈. (2023)。用羅吉斯迴歸建構隨機分類模型之集成方法。國立成功大學工業與 資訊管理研究所碩士論文。
    鍾佩真. (2022)。不平衡資料分類效能衡量指標之性質分析。國立成功大學資訊管理 研究所碩士論文。
    Ahmad, A., Abujabal, H., & Aswani Kumar, C. (2017). Random subclasses ensembles by using 1-nearest neighbor framework. International Journal of Pattern Recognition and Artificial Intelligence, 31(10), 1750031.
    Alenazi, F. S., El Hindi, K., & AsSadhan, B. (2023). Complement-Class Harmonized Naïve Bayes Classifier. Applied Sciences, 13(8), 4852.
    Boonchuay, K., Sinapiromsaran, K., & Lursinsap, C. (2016). Decision tree induction based on minority entropy for the class imbalance problem. Pattern Analysis and Applications, 20(3), 769-782.
    Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145-1159.
    Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
    Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.
    Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. Proceedings of 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, 107-119. Cavtat-Dubrovnik, Croatia.
    Chen, Z., Duan, J., Kang, L., & Qiu, G. (2021). A hybrid data-level ensemble to enable learning from highly imbalanced dataset. Information Sciences, 554, 157-176.
    Chicco, D. & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC genomics, 21(1), 1-13.
    Chicco, D., & Jurman, G. (2023). A statistical comparison between Matthews correlation coefficient (MCC), prevalence threshold, and Fowlkes–Mallows index. Journal of Biomedical Informatics, 144, 104426.
    Davis, J. & Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd International Conference on Machine learning, 233-240.
    Dietterich, T. G. (2000). Ensemble methods in machine learning. Proceedings of the 1st International Workshop on Multiple Classifier Systems, 1-15.
    Díez-Pastor, J. F., Rodríguez, J. J., García-Osorio, C., & Kuncheva, L. I. (2015). Random Balance: Ensembles of variable priors classifiers for imbalanced data. Knowledge- Based Systems, 85, 96-111.
    Douzas, G., Bacao, F., & Last, F. (2018). Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Information Sciences, 465, 1-20.
    Elkan, C. (2001, August). The foundations of cost-sensitive learning. Proceedings of the 17th International Joint Conference on Artificial Intelligence, 973-978.
    Fan, W., Stolfo, S. J., Zhang, J., & Chan, P. K. (1999). AdaCost: Misclassification cost- sensitive boosting. International Conference on Machine Learning, 99, 97-105.
    Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121(2), 256-285.
    Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for the class Imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463-484.
    Gonzalez-Abril, L., Nuñez, H., Angulo, C., & Velasco, F. (2014). GSVM: An SVM for handling imbalanced accuracy between classes inbi-classification problems. Applied Soft Computing, 17, 23-31.
    Gu, Q., Cai, Z., Zhu, L., & Huang, B. (2008). Data mining on imbalanced data sets. Proceedings of International Conference on Advanced Computer Theory and Engineering, 1020-1024.
    Guo, H. X., Li, Y. J., Shang, J., Gu, M. Y., Huang, Y. Y., & Bing, G. (2017). Learning from class-imbalanced data: review of methods and applications. Expert Systems with Applications, 73, 220-239.
    Hanczar, B., Hua, J., Sima, C., Weinstein, J., Bittner, M., & Dougherty, E. R. (2010). Small-sample precision of ROC-related estimates. Bioinformatics, 26(6), 822-830.
    He, H. & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284.
    He, H. & Gho, A. (2010). Rare class classification by support vector machine. Proceedings of the 20th International Conference on Pattern Recognition, 548-551.
    Islam, A., Belhaouari, S. B., Rehman, A. U., & Bensmail, H. (2022). KNNOR: An oversampling technique for imbalanced datasets. Applied Soft Computing, 115, 108288.
    Joshi, M. V. (2002). On evaluating performance of classifiers for rare classes. Proceedings 2002 IEEE International Conference on Data Mining, 641-644. Maebashi City, Japan.
    Joshi, M. V., Kumar, V., & Agarwal, R. C. (2001). Evaluating boosting algorithms to classify rare classes: Comparison and improvements. Proceedings 2001 IEEE International Conference on Data Mining, 257-264.
    Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2011). Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 41(3), 552-568.
    Kim, M.-J., Kang, D.-K., & Kim, H. B. (2015). Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction. Expert Systems with Applications, 42(3), 1074-1082.
    Kotsiantis, S. & Pintelas, P. (2003). Mixture of expert agents for handling imbalanced data sets. Annals of Mathematics, Computing & Teleinformatics, 1(1), 46-55.
    Krawczyk, B., Woźniak, M., & Schaefer, G. (2014). Cost-sensitive decision tree ensembles for effective imbalanced classification. Applied Soft Computing, 14, 554-562.
    Kubat, M. & Matwin, S. (1997). Addressing the curse of imbalanced training sets: one- sided selection. Proceedings of the 14th International Conference on Machine Learning, 179-186. San Francisco, United States.
    Lee, W., Jun, C.-H., & Lee, J.-S. (2017). Instance categorization by support vector machines to adjust weights in AdaBoost for imbalanced data classification. Information Sciences, 381, 92-103.
    Leevy, J. L., Khoshgoftaar, T. M., Bauder, R. A., & Seliya, N. (2018). A survey on addressing high-class imbalance in big data. Journal of Big Data, 5(1), 1-30.
    Lin, W.-C., Tsai, C.-F., Hu, Y.-H., & Jhang, J.-S. (2017). Clustering-based undersampling in class-imbalanced data. Information Sciences, 409-410, 17-26.
    Liu, Y., Liu, Y., Yu, B. X. B., Zhong, S., & Hu, Z. (2023). Noise-robust oversampling for imbalanced data classification. Pattern Recognition, 133, 109008.
    Liu, Y., Zhou, Y., Wen, S., & Tang, C. (2014). A strategy on selecting performance metrics for classifier evaluation. International Journal of Mobile Computing and Multimedia Communications, 6(4), 20-35.
    López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113-141.
    Luque, A., Carrasco, A., Martín, A., & de las Heras, A. (2019). The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition, 91, 216-231.
    M, H. & M.N, S. (2015). A review on evaluation metrics for data classification evaluations. International Journal of Data Mining & Knowledge Management Process, 5(2), 01-11.
    Mirzaei, B., Nikpour, B., & Nezamabadi-pour, H. (2021). CDBH: A clustering and density- based hybrid approach for imbalanced data classification. Expert Systems with Applications, 164, 114035.
    Onan, A., Korukoğlu, S., & Bulut, H. (2016). Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications, 57, 232-247.
    Saito, T., & Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One, 10(3), e0118432.
    Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2010). RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 40(1), 185-197.
    Shen, F., Zhao, X., Kou, G., & Alsaadi, F. E. (2021). A new deep learning ensemble credit risk evaluation model with an improved synthetic minority oversampling technique. Applied Soft Computing, 98, 106852.
    Sun, Y., Kamel, M. S., Wong, A. K. C., & Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12), 3358-3378.
    Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., & Zhou, Y. (2015). A novel ensemble method for classifying imbalanced data. Pattern Recognition, 48(5), 1623-1637.
    Thabtah, F., Hammoud, S., Kamalov, F., & Gonsalves, A. (2020). Data imbalance in classification: Experimental evaluation. Information Sciences, 513, 429-441.
    Ting, K. M. (2002). An instance-weighting method to induce cost-sensitive trees. IEEE Transactions on Knowledge and Data Engineering, 14(3), 659-665.
    Veropoulos, K., Campbell, C., & Cristianini, N. (1999). Controlling the sensitivity of support vector machines. Proceedings of the International Joint Conference on AI, 55, 60.
    Wang, S. & Yao, X. (2009). Diversity analysis on imbalanced data sets by using ensemble models. Proceedings of 2009 IEEE Symposium on Computational Intelligence and Data Mining, 324-331. Nashville, TN, USA.
    Wang, W. & Sun, D. (2021). The improved AdaBoost algorithms for imbalanced data classification. Information Sciences, 563, 358-374.
    Wolpert, D. H. (1992). Stacked generalization. Neural networks, 5(2), 241-259.
    Wong, G. Y., Leung, F. H. F., & Ling, S.-H. (2018). A hybrid evolutionary preprocessing method for imbalanced datasets. Information Sciences, 454, 161-177.
    Xiong, Y., Ye, M., & Wu, C. (2021). Cancer classification with a cost-sensitive naive bayes stacking ensemble. Computational and Mathematical Methods in Medicine, 2021, 5556992.
    Yang, Z., Nie, X., Xu, W., & Guo, J. (2006). An approach to spam detection by naive bayes ensemble based on decision induction. Sixth International Conference on Intelligent Systems Design and Applications, 2, 861-866.
    Zhao, J., Jin, J., Chen, S., Zhang, R., Yu, B., & Liu, Q. (2020). A weighted hybrid ensemble method for classifying imbalanced data. Knowledge-Based Systems, 203, 106087.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE