| 研究生: |
陳彥均 Chen, Yen-Chun |
|---|---|
| 論文名稱: |
以相依變量增生小類別樣本技術學習不平衡資料 The Dependent-variable SMOTE for learning imbalanced data-set |
| 指導教授: |
利德江
Li, Der-Chiang |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 工業與資訊管理學系 Department of Industrial and Information Management |
| 論文出版年: | 2018 |
| 畢業學年度: | 106 |
| 語文別: | 中文 |
| 論文頁數: | 49 |
| 中文關鍵詞: | 不平衡資料 、屬性相依性 |
| 外文關鍵詞: | imbalanced data, dependent sample, SMOTE |
| 相關次數: | 點閱:119 下載:7 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
不平衡資料出現在現實生活中的很多領域,近年來這種資料的學習方法也受到很多學者的研究。其中增生小類別樣本技術(Synthetic minority over-sampling technique, SMOTE)是一個被廣泛使用的資料分析的前處理技術,它通過增加合成的少量類別樣本,以平衡不同類別的樣本數,藉以提高分類方法對於小類別的預測準確率。過往的研究中,為進一步提升SMOTE所生成的樣本品質,許多延伸版本被開發出來,例如邊界SMOTE(Borderline SMOTE, B-SMOTE)、安全等級SMOTE (Safe-Level SMOTE, SL-SMOTE)、以及局部鄰近區SMOTE (Local-Neighborhood SMOTE, LN-SMOTE) 等,均在虛擬樣本的落點處進行探討。然而,這些方法卻仍基於SMOTE,在產生各個屬性的虛擬值時都是獨立生成的,並未考慮屬性間的相關性。因此,本研究針對此部分,提出一個考量屬性間相關性的虛擬樣本產生法,以進一步提升SMOTE產生的樣本品質,而提高分類方法對於少量類別的分類準確率,該方法目前命名為相依變量增生小類別樣本技術 (Dependent-Variable SMOTE, DV-SMOTE)。最後通過實驗說明DV-SMOTE在改善不平衡資料的分類效果比SMOTE和延伸版本B-SMOTE,SL-SMOTE和LN-SMOTE來的佳。
Data mining on imbalanced data sets receives more and more attentions in recent years.The class imbalanced problem occurs when there’s just few number of instances in one classes comparing to other classes.The SMOTE:Synthetic Minority Over-Sampling Technique is an effective method to improve the recognition of minority class in class-imbalanced problem. SMOTE is an over-sampling method that generates new synthetic instances from the minority class. And it provides a standard procedure for further research such as Borderline SMOTE , Safe-Level SMOTE and Local-Neighborhood SMOTE . In this paper we propose an extension method of SMOTE by considering the relations between different attributes and deciding the location of synthetic samples based on fuzzy technique.This study develops a new sample-generating procedure to determine the range on the both sides of the line that connects the pairs of population samples.Three data sets taken from UCI Machine Learning Repository in the experiments. We compare the proposed method with SMOTE and other extension version including Borderline SMOTE , Safe-Level SMOTE and Local-Neighborhood SMOTE.The result shows that the proposed method achieves better classifier performance for the minority class than other methods after examined the data sets with C4.5 decision trees.
Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Pacific-Asia conference on knowledge discovery and data mining (pp. 475-482). Berlin,Heidelberg.
Chan, P. K., & Stolfo, S. J. (1998). Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection. In KDD (Vol. 98, pp. 164-168) ,New York,USA.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Chongfu, H. (1997). Principle of information diffusion. Fuzzy sets and Systems, 91(1), 69-90.
Cohen, G., Hilario, M., Sax, H., Hugonnet, S., & Geissbuhler, A. (2006). Learning from imbalanced data in surveillance of nosocomial infection. Artificial intelligence in medicine, 37(1), 7-18.
Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1), 21-27.
Elkan, C. (2001). The foundations of cost-sensitive learning. In International joint conference on artificial intelligence(Vol. 17, No. 1, pp. 973-978). Seattle,WA, USA
García, V., Sánchez, J., & Mollineda, R. (2007). An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In Iberoamerican Congress on Pattern Recognition (pp. 397-406). Berlin, Heidelberg.
Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing(pp. 878-887). Berlin, Heidelberg.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284.
Japkowicz, N. (2000). Learning from imbalanced data sets: a comparison of various strategies. In AAAI workshop on learning from imbalanced data sets (Vol. 68, pp. 10-15). Texas,USA.
Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1), 40-49.
Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In ICML (Vol. 97, pp. 179-186).Nashville,TN,USA.
Lewis, D. D., & Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. In Machine Learning Proceedings 1994 (pp. 148-156).Rutgers University, New Brunswick,USA.
Li, D.-C., Chen, C.-C., Chen, W.-C., & Chang, C.-J. (2012). Employing dependent virtual samples to obtain more manufacturing information in pilot runs. International Journal of Production Research, 50(23), 6886-6903.
Li, D.-C., Lin, W.-K., Lin, L.-S., Chen, C.-C., & Huang, W.-T. (2017). The attribute-trend-similarity method to improve learning performance for small datasets. International Journal of Production Research, 55(7), 1898-1913.
Maciejewski, T., & Stefanowski, J. (2011). Local neighbourhood extension of SMOTE for mining imbalanced data. In Computational Intelligence and Data Mining (CIDM), 2011 IEEE Symposium on (pp. 104-111). Paris, France.
Napierała, K., Stefanowski, J., & Wilk, S. (2010). Learning from imbalanced data in presence of noisy and borderline examples. International Conference on Rough Sets and Current Trends in Computing (pp. 158-167). Berlin,Heidelberg.
Pérez, J. M., Muguerza, J., Arbelaitz, O., Gurrutxaga, I., & Martín, J. I. (2005). Consolidated tree classifier learning in a car insurance fraud detection domain with class imbalance.International Conference on Pattern Recognition and Image Analysis. (pp. 381-389). Berlin, Heidelberg.
Sasaki, Y. (2007). The truth of the F-measure. Teach Tutor mater, 1(5).
Sáez, J. A., Luengo, J., Stefanowski, J., & Herrera, F. (2015). SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Information Sciences, 291, 184-203.
Solberg, A. S., & Solberg, R. (1996). A large-scale evaluation of features for automatic detection of oil spills in ERS SAR images. In Geoscience and Remote Sensing Symposium, 1996. IGARSS'96.'Remote Sensing for a Sustainable Future.', International (Vol. 3, pp. 1484-1486). Lincoln, NE, USA
Tomek, I. (1976). Two modifications of CNN. IEEE Trans. Systems, Man and Cybernetics, 6, 769-772.
Xie, Y., Li, X., Ngai, E., & Ying, W. (2009). Customer churn prediction using improved balanced random forests. Expert Systems with Applications, 36(3), 5445-5449.