簡易檢索 / 詳目顯示

研究生: 廖書緯
Liao, Shu-Wei
論文名稱: 以基於局部空間訊息量之SMOTE產生虛擬樣本處理類別不平衡資料集
A Local Information Based Synthetic Minority Oversampling Technique for Imbalanced Dataset Learning
指導教授: 利德江
Li, Der-Chiang
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業與資訊管理學系
Department of Industrial and Information Management
論文出版年: 2019
畢業學年度: 107
語文別: 中文
論文頁數: 67
中文關鍵詞: 分類不平衡虛擬少數類別過抽樣技術群聚分析
外文關鍵詞: class imbalanced, SMOTE, cluster analysis
相關次數: 點閱:95下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在這個科技高速發展的社會中,人工智慧、物聯網等名詞大家耳熟能詳,而架構出這社會主流的基層即為數據,因此在各式各樣不同型態的數據集合中,需要有不同或是特殊的方法用以應對分析。然而在進行數據分析時,若是遇到數據當中不同類別的樣本數量比例失衡,就會導致類別不平衡的學習問題。根據過往的分類學習演算法,對於高度類別不平衡資料的學習,常會將少數類別的資料分類錯誤,而此些少數類別資料相較於多數類別資料,代表重要的意義或巨大成本。因此,在類別比例差異懸殊的資料中提升少數類別的分類正確率已然成為重要的議題。虛擬少數類別過抽樣技術(synthetic minority oversampling technique, SMOTE)是常用於解決類別不平衡問題的方法之一,其方法是任取一個少數類別樣本做為種子樣本,並找出周遭同為少數類別樣本後任取其一作為選取樣本後,並在兩個少數類別樣本之間生成虛擬樣本。然而本研究則考慮多數類別樣本與少數類別樣本以及少數類別樣本和少數類別樣本之間的影響力,提出基於局部空間訊息量之SMOTE(Local Information Index SMOTE, LII-SMOTE),當不平衡資料集經由本文所提出之方法生成虛擬樣本後,其少數類別樣本的評估指標相較其他SMOTE能有效提升。

    A dataset is imbalanced if the classes are not approximately equally represented. Data mining on imbalanced datasets receives more and more attentions in recent years. The class imbalanced problem occurs when there’s just few number of sample in one classes comparing to other classes. The SMOTE : Synthetic Minority Over-Sampling Technique is an effective method to solve imbalanced learning problem. The way is to take one of the minority sample as the seed sample, and find the minority sample nearby as the selected sample. After finding seed sample and selected sample, we generate virtual sample between two minority samples. Therefore, in this paper we consider the influence between majority samples and the selected sample and the influence between minority samples and the selected sample. This study develops a new sample-generating procedure by local majority class information and local minority class information. Four datasets taken from UCI Machine Learning Repository in experiments. We compare the proposed method with SMOTE and other extension version including Borderline SMOTE1(B1-SMOTE), Safe-Level SMOTE(SL-SMOTE), Local-Neighborhood SMOTE(LN-SMOTE), and ADASYN. The result shows that the proposed method achieve better classifier performance for the minority class than other methods after examined the data sets with C4.5 decision trees.

    摘要 i 英文摘要 ii 致謝 xiii 表目錄 xvi 圖目錄 xvii 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機 5 1.3 研究目的 7 1.4 研究架構 7 第二章 文獻回顧 9 2.1 不平衡資料及其評估指標 9 2.1.1 不平衡資料學習問題 9 2.1.2 不平衡資料之評估指標 11 2.2 不平衡資料之研究方法回顧 14 2.2.1 不平衡資料學習的主要方法 14 2.2.2 不平衡資料學習的最新方法 17 2.3 SMOTE及SMOTE之延伸方法 17 2.3.1 SMOTE 17 2.3.2 SMOTE演算法的延伸 19 2.4 群聚分析 25 第三章 研究方法 27 3.1本研究運用之符號及其解釋 27 3.2 樣本資料集之分群 28 3.3 局部空間訊息量 30 3.3.1 局部空間多數類別樣本訊息量 31 3.3.2 局部空間少數類別樣本訊息量 33 3.4 虛擬樣本之生成 36 3.5 流程步驟 37 第四章 實證研究 39 4.1 實驗環境 39 4.1.1 實驗方式 39 4.1.2 評估指標 40 4.1.3 實驗資料 42 4.1.4 分類模式建構軟體 43 4.1.4 距離閥(dc)敏感度分析 44 4.2 實驗結果 46 4.2.1 Haberman資料集及實驗結果 46 4.2.2 Pima資料集及實驗結果 51 4.2.3 Phoneme資料集及實驗結果 55 4.2.4 Satimage資料集及實驗結果 59 第五章 結論 63 5.1結論 63 5.2未來研究建議 64 參考文獻 65

    Airola, A., Pahikkala, T., Waegeman, W., De Baets, B., & Salakoski, T. (2011). An experimental comparison of cross-validation techniques for estimating the area under the ROC curve. Computational Statistics & Data Analysis, 55(4), 1828-1844.
    Barandela, R., Valdovinos, R. M., & Sánchez, J. S. (2003). New applications of ensembles of classifiers. Pattern Analysis & Applications, 6(3), 245-256.
    Barua, S., Islam, M. M., Yao, X., & Murase, K. (2014). MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning. IEEE Transactions on Knowledge and Data Engineering, 26(2), 405-425.
    Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter, 6(1), 20-29.
    Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Paper presented at the Proceedings of the fifth annual workshop on Computational learning theory.
    Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Paper presented at the Pacific-Asia conference on knowledge discovery and data mining.
    Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
    Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
    Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1), 21-27.
    Elkan, C. (2001). The foundations of cost-sensitive learning. Paper presented at the International joint conference on artificial intelligence.
    Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Paper presented at the Kdd.
    Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Machine learning, 31(1), 1-38.
    García, V., Sánchez, J., & Mollineda, R. (2007). An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. Paper presented at the Iberoamerican Congress on Pattern Recognition.
    Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Paper presented at the International Conference on Intelligent Computing.
    He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. Paper presented at the Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on.
    He, H., & Garcia, E. A. (2008). Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering(9), 1263-1284.
    He, H., Zhang, W., & Zhang, S. (2018). A novel ensemble method for credit scoring: Adaption of different imbalance ratios. Expert Systems with Applications, 98, 105-117.
    He, H. B., & Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284. doi:10.1109/Tkde.2008.239
    Holland, J. H. (1992). Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence: MIT press.
    Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1), 40-49.
    Kaufman, L., & Rousseeuw, P. (1987). Clustering by means of medoids: North-Holland.
    Laurikkala, J. (2001). Improving identification of difficult small classes by balancing class distribution. Paper presented at the Conference on Artificial Intelligence in Medicine in Europe.
    Lewis, D. D., & Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. In Machine Learning Proceedings 1994 (pp. 148-156): Elsevier.
    Liu, M., Xu, C., Luo, Y., Xu, C., Wen, Y., & Tao, D. (2018). Cost-Sensitive Feature Selection by Optimizing F-Measures. IEEE Transactions on Image Processing, 27(3), 1323-1335.
    Maciejewski, T., & Stefanowski, J. (2011). Local neighbourhood extension of SMOTE for mining imbalanced data. Paper presented at the Computational Intelligence and Data Mining (CIDM), 2011 IEEE Symposium on.
    MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Paper presented at the Proceedings of the fifth Berkeley symposium on mathematical statistics and probability.
    Napierała, K., Stefanowski, J., & Wilk, S. (2010). Learning from imbalanced data in presence of noisy and borderline examples. Paper presented at the International Conference on Rough Sets and Current Trends in Computing.
    Nevin, J. A. (1969). Signal detection theory and operant behavior: A review of david m. green and john a. swets' signal detection theory and psychophysics. 1. Journal of the Experimental Analysis of Behavior, 12(3), 475-480.
    Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492-1496.
    Tomczak, J. M., & ZięBa, M. (2015). Probabilistic combination of classification rules and its application to medical diagnosis. Machine learning, 101(1-3), 105-135.
    Tomek, I. (1976). Two modifications of CNN. IEEE Trans. Systems, Man and Cybernetics, 6, 769-772.
    Vert, J.-P., Tsuda, K., & Schölkopf, B. (2004). A primer on kernel methods. Kernel methods in computational biology, 47, 35-70.
    Wang, S., Minku, L. L., & Yao, X. (2018). A systematic study of online class imbalance learning with concept drift. IEEE Transactions on Neural Networks and Learning Systems.
    Weiss, G. M. (2004). Mining with rarity: a unifying framework. ACM Sigkdd Explorations Newsletter, 6(1), 7-19.
    Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics(3), 408-421.
    Xie, Z., Jiang, L., Ye, T., & Li, X. (2015). A synthetic minority oversampling method based on local densities in low-dimensional space for imbalanced learning. Paper presented at the International Conference on Database Systems for Advanced Applications.
    Zhu, B., Niu, Y., Xiao, J., & Baesens, B. (2017). A new transferred feature selection algorithm for customer identification. Neural Computing and Applications, 28(9), 2593-2603.

    下載圖示 校內:2022-07-01公開
    校外:2024-07-01公開
    QR CODE