簡易檢索 / 詳目顯示

研究生: 王斯暘
Wang, Si-Yang
論文名稱: 使用區域不純度-發展少數類別過抽樣技術以學習不平衡資料
Learning Class-Imbalanced Data with Region Impurity Synthetic Minority Oversampling Technique
指導教授: 利德江
Li, Der-Chiang
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 39
中文關鍵詞: 類別不平衡資料虛擬樣本過抽樣技術區域不純度邊界樣本
外文關鍵詞: Class-imbalanced, SMOTE, region-impurity, overlapped data
相關次數: 點閱:91下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 學習類別不平衡資料集對研究者而言是一項艱困的任務,乃因多數傳統機器學習演算法會將少數類別樣本視為可容忍之誤差,致使所建構之模型將所有樣本分類為多數類別樣本。為平衡資料集之類別比例,虛擬少數類別樣本過抽樣技術(Synthetic minority oversampling technique, SMOTE)已展現其透過生成虛擬樣本以提升模型之分類效能。然而部分的虛擬樣本可能是雜訊樣本,反倒降低分類的量測結果。後續基於SMOTE之延伸技術透過決定虛擬樣本生成之位置來避免生成雜訊樣本,然而他們本質上無法有效解決此問題,乃因SMOTE是基於kNN(k nearest neighbors)所開發,而kNN無法得知任兩筆少數類別之間的樣本分布情況。此外,虛擬樣本之生成數量仍未被有效探討,而這也進一步影響了分類成效。因此,本研究提出基於區域半徑之學習方法,首先依據定義之區域半徑尋找各少數類別樣本之鄰居,並依照半徑內的類別比率找出邊界區的樣本,這樣做的原因乃在於考量到這些樣本周圍含有相較多的多數類別樣本,致使其被誤判。接著從這些邊界樣本中,挑選適當的樣本合成樣本;在生成虛擬樣本之前,先計算兩筆少數類別樣本之間的樣本分布,稱為區域不純度,以避免生成之虛擬樣本被多數類別樣本所包圍;最後在區域半徑內生成虛擬樣本,直到該區域達到近似平衡。本研究提出之區域不純度-虛擬樣本過抽樣技術(Region Impurity SMOTE, RIOT)透過12比公開資料的實驗來驗證其在分類指標(如F-Measure, AUC等)比SMOTE和多數延伸技術有更佳的效果。

    Learning from class-imbalanced data is a tough task, which often leads classifiers to fail on identifying the minority class. To balance the class ratio, Synthetic Minority Oversampling TEchnique (SMOTE) has shown its improvement in classifying minority class by generating synthetic minority instances. However, in some scenarios, SMOTE and its extensions will generate noise instances and thus causing the performance degradation. This is because of that they were developed based on kNN (k nearest neighbors), which cannot identify the class distributions between pairs of two minority instances. Furthermore, the number of synthetic instances is left to be discussed in this field of study. To conquer these issues, we propose a new algorithm here named Region-Impurity synthetic minority Oversampling Technique (RIOT). Specifically, a region radius, we locate neighbors for minority instances and whereby to identify the relatively hard-to-learn minority instances, by the class ratio within the region and selecting building the base of sample generation. Note that, in the synthetic instances generation, a region impurity indicator is systematically calculated to measure the class distributions between the pair of two minority instances. Then, generating synthetic instances until the region is approximately balanced. We evaluate RIOT with 12 real-world datasets, where the experiments show that RIOT can perform better than many versions of SMOTE with less synthetic instances in terms of several model performance indicators.

    中文摘要 i Abstract ii 誌謝 iii List of tables v List of figures vi 1 Introduction 1 1.1 Background 1 1.2 Motivation 5 1.3 Objective 6 1.4 Architecture 6 2 Literature review 8 2.1 The class-imbalanced classification problem 8 2.2 SMOTE 10 2.3 Borderline-SMOTE 11 2.4 Safe-level SMOTE 13 2.5 Local Neighborhood SMOTE 14 2.6 Section Summary 14 3 Proposed method 16 3.1 Notations 16 3.2 Data preprocessing 17 3.3 Sample Generation 19 3.4 The algorithm of RIOT 21 4 Experimental results and discussions 24 4.1 The experimental environment 24 4.2 Experimental results 27 4.3 Results findings 30 5 Conclusions 32 Reference 34

    Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., & Herrera, F. (2011). Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic & Soft Computing, 17.

    Barua, S., Islam, M. M., Yao, X., & Murase, K. (2012). MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning. IEEE transactions on knowledge and data engineering, 26(2), 405-425.

    Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 20-29.

    Bhattacharyya, S., Jha, S., Tharakunnel, K., & Westland, J. C. (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems, 50(3), 602-613.

    Bunkhumpornpat, C., & Sinapiromsaran, K. (2017). DBMUTE: density-based majority under-sampling technique. Knowledge and Information Systems, 50(3), 827-850.

    Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem.In B. K. Thanaruk Theeramunkong, Nick Cercone , TuBao Ho (Eds.), Advances in Knowledge Discovery and Data Mining: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, (pp. 475-482). Berlin, Heidelberg: Springer-Verlag

    Castro, C. L., & Braga, A. P. (2013). Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data. IEEE Transactions on Neural Networks and Learning Systems, 24(6), 888-899.

    Cervantes, J., Garcia-Lamont, F., Rodriguez, L., López, A., Castilla, J. R., & Trueba, A. (2017). PSO-based method for SVM classification on skewed data sets. Neurocomputing, 228, 187-197.

    Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.

    Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting.In D. G. Nada Lavrač, Ljupčo Todorovski, Hendrik Blockeel (Eds.), Knowledge Discovery in Databases: PKDD 2003: 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, (pp. 107-119). Berlin, Heidelberg: Springer

    Chen, S., Guo, G., & Chen, L. (2010). A new over-sampling method based on cluster ensembles. Paper presented at the 2010 IEEE 24th International Conference on Advanced Information Networking and Applications Workshops, Perth, Western Australia, Australia.

    Cieslak, D. A., Chawla, N. V., & Striegel, A. (2006). Combating imbalance in network intrusion datasets. Paper presented at the 2006 IEEE International Conference on Granular Computing, Atlanta, Georgia, United States.

    Cohen, G., Hilario, M., Sax, H., Hugonnet, S., & Geissbuhler, A. (2006). Learning from imbalanced data in surveillance of nosocomial infection. Artificial intelligence in medicine, 37(1), 7-18.

    Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.

    Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1), 21-27.

    Douzas, G., Bacao, F., & Last, F. (2018). Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Information Sciences, 465, 1-20.

    Dua Dheeru, G. C. (2019). {UCI} Machine Learning Repository. Retrieved from: http://archive.ics.uci.edu/ml I. University of California, School of Information and Computer Sciences

    Elkan, C. (2001). The foundations of cost-sensitive learning.In (Eds.): Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, (pp. 973-978). San Francisco, California, United States: Morgan Kaufmann Publishers Inc.

    Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise.In J. H. Evangelos Simoudis, Usama M. Fayyad (Eds.), KDD'96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, (pp. 226-231). Menlo Park, California, United States: AAAI Press

    Fahim, M., & Sillitti, A. (2019). Anomaly detection, analysis and prediction techniques in IoT environment: A systematic literature review. IEEE Access, 7, 81664-81681.

    Freung, Y., & Shapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1), 119-139.

    Gao, M., Hong, X., Chen, S., Harris, C. J., & Khalaf, E. (2014). PDFOS: PDF estimation based over-sampling for imbalanced two-class problems. Neurocomputing, 138, 248-259.

    García, V., Sánchez, J., & Mollineda, R. (2007). An empirical study of the behavior of classifiers on imbalanced and overlapped data sets.In (Eds.), CIARP 2007: 12th Iberoamericann Congress on Pattern Recognition, (pp. 397–406). Berlin, Heidelberg: Springer-Verlag

    Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning.In (Eds.), Advances in Intelligent Computing: International Conference on Intelligent Computing, ICIC 2005, (pp. 878-887). Berlin, Heidelberg: Springer

    He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. Paper presented at the 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), Hong Kong, China.

    He, H., Zhang, W., & Zhang, S. (2018). A novel ensemble method for credit scoring: Adaption of different imbalance ratios. Expert Systems with Applications, 98, 105-117.

    Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM SIGKDD Explorations Newsletter, 6(1), 40-49.

    Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: one-sided selection.In (Eds.), ICML '97: Proceedings of the Fourteenth International Conference on Machine Learning, (pp. 179-186). San Francisco, Calfornia, United States: Morgan Kaufmann Publishers Inc.

    Lewis, D. D., & Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings 1994 (pp. 148-156): Elsevier.

    Li, D.-C., Liu, C.-W., & Hu, S. C. (2010). A learning method for the class imbalance problem with medical data sets. Computers in biology and medicine, 40(5), 509-518.

    Li, D.-C., Wu, C.-S., Tsai, T.-I., & Lina, Y.-S. (2007). Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Computers & Operations Research, 34(4), 966-982.

    Lin, W.-C., Tsai, C.-F., Hu, Y.-H., & Jhang, J.-S. (2017). Clustering-based undersampling in class-imbalanced data. Information Sciences, 409, 17-26.

    Lu, J., Zhang, C., & Shi, F. (2016). A classification method of imbalanced data base on PSO algorithm.In Q. H. Wanxiang Che, Hongzhi Wang, Weipeng Jing, Shaoliang Peng, Junyu Lin, Guanglu Sun, Xianhua Song, Hongtao Song, Zeguang Lu (Eds.), ICYCSEE 2016: Second International Conference of Young Computer Scientists, Engineers and Educators, (pp. 121-134). Singapore: Springer Singapore

    Maciejewski, T., & Stefanowski, J. (2011). Local neighbourhood extension of SMOTE for mining imbalanced data. Paper presented at the 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Paris, France.

    Menardi, G., & Torelli, N. (2014). Training and assessing classification rules with imbalanced data. Data Mining and Knowledge Discovery, 28(1), 92-122.

    Pérez, J. M., Muguerza, J., Arbelaitz, O., Gurrutxaga, I., & Martín, J. I. (2005). Consolidated tree classifier learning in a car insurance fraud detection domain with class imbalance.In M. S. Sameer Singh, Chid Apte, Petra Perner (Eds.), Pattern Recognition and Data Mining: Third International Conference on Advances in Pattern Recognition, (pp. 381-389). Berlin, Heidelberg: Springer

    Philip, K., & Chan, S. (1998). Toward scalable learning with non-uniform class and cost distributions: A case study in credit card fraud detection.In P. E. S. Rakesh Agrawal (Eds.), KDD'98: Proceeding of the Fourth International Conference on Knowledge Discovery and Data Mining, (pp. 164-168). Menlo Park, California, United States: AAAI Press

    Piri, S., Delen, D., & Liu, T. (2018). A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets. Decision Support Systems, 106, 15-29.

    Prati, R. C., Batista, G. E., & Monard, M. C. (2004). Class imbalances versus class overlapping: an analysis of a learning system behavior.In G. A.-F. Raúl Monroy, Luis Enrique Sucar, Humberto Sossa (Eds.), MICAI 2004: Third Mexican International Conference on Artificial Intelligence, (pp. 312-321). Berlin, Heidelberg: Springer

    Ren, F., Cao, P., Li, W., Zhao, D., & Zaiane, O. (2017). Ensemble based adaptive over-sampling method for imbalanced data learning in computer aided detection of microaneurysm. Computerized Medical Imaging and Graphics, 55, 54-67.

    Rivera, W. A. (2017). Noise reduction a priori synthetic over-sampling for class imbalanced data sets. Information Sciences, 408, 146-161. Retrieved from <Go to ISI>://WOS:000403126200009. doi:10.1016/j.ins.2017.04.046

    Sanchez, A. I., Morales, E. F., & Gonzalez, J. A. (2013). Synthetic oversampling of instances using clustering. International Journal on Artificial Intelligence Tools, 22(02), 1350008.

    Sáez, J. A., Luengo, J., Stefanowski, J., & Herrera, F. (2015). SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Information Sciences, 291, 184-203.

    Xie, Y., Li, X., Ngai, E., & Ying, W. (2009). Customer churn prediction using improved balanced random forests. Expert Systems with Applications, 36(3), 5445-5449.

    Yuan, X., Xie, L., & Abouelenien, M. (2018). A regularized ensemble framework of deep learning for cancer detection from multi-class, imbalanced training data. Pattern Recognition, 77, 160-172.

    Zeng, Z.-Q., & Gao, J. (2009). Improving SVM classification with imbalance data set.In M. L. Chi Sing Leung, Jonathan H. Chan (Eds.), ICONIP: 16th International Conference, (pp. 389-398). Berlin, Heidelberg: Springer

    Zhang, Y., Fu, P., Liu, W., & Chen, G. (2014). Imbalanced data classification based on scaling kernel-based support vector machine. Neural Computing and Applications, 25(3-4), 927-935.

    下載圖示 校內:2022-07-09公開
    校外:2022-07-09公開
    QR CODE