簡易檢索 / 詳目顯示

研究生: 李宗頲
Lee, Zong-Ting
論文名稱: 以局部資訊密度為基礎之小類別樣本增生技術
A Region Density Based Minority Oversampling Technique
指導教授: 利德江
Li, Der-Chiang
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業與資訊管理學系
Department of Industrial and Information Management
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 43
中文關鍵詞: 不平衡資料虛擬樣本過抽樣技術安全樣本
外文關鍵詞: Imbalanced classification, SMOTE, Safe instances
相關次數: 點閱:51下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 現實生活中樣本分配往往是不均等的。對傳統機器學習演算法而言,稀少樣本數之類別(小類別)往往會被視為誤差,使模型無法有效區別小類別樣本與失去一般化的能力。為提升小類別樣本資訊量,虛擬小類別樣本過抽樣技術(synthetic minority over-sampling technique, SMOTE)以合成虛擬樣本提升模型的分類效能。然而,SMOTE未考量類別之間的樣本分布,使其合成的樣本反倒增加資料集複雜度,致使模型成效不佳。儘管SMOTE之改良技術透過挑選適當區域合成虛擬樣本,以k-NN為核心的他們仍無法有效辨認樣本分布。原因乃在於k-NN僅保證找到k個鄰近樣本,而非所有鄰近樣本,這使其所衡量的樣本分布不足以代表單一樣本周圍的樣本分布。本研究提出以距離閥為核心,發展小類別樣本增生技術(Region Density Based Minority Oversampling Technique, RDMOT)。首先依照定義之距離閥,尋找所有小類別樣本的鄰近樣本。接著,計算小類別的平均鄰近小類別比率,並挑選鄰近小類別比率高於該平均值的點在利用局部鄰居的類別比例來衡量樣本安全性,用以找出含有相較多小類別樣本的區域。最後,合成虛擬樣本直到挑選出來的小類別樣本的鄰近區域為平衡為止。為衡量RDMOT之成效,我們挑選9筆公開資料集並與數個過抽樣技術比較F-measure以及AUC等模型量測指標。RDMOT能在合成較少虛擬樣本的情況下,提供更好或接近最佳的結果。

    In the real world, the sample sizes of classes are usually unequal. This is known as the imbalanced classification issue, which causes the traditional machine learning algorithms take the minority class as tolerable error, and further causes the final model to fail to distinguish classes. To conquer this issue, Synthetic Minority Oversampling Technique (SMOTE) creates synthetic examples to increase the identification of minority class. However, it does not consider the sample distributions between classes, which increase the complexity of the dataset and further decrease the model performance. Although some extensions of SMOTE try to locate the suitable region for creating synthetic examples, they remain some issues due to the k nearest neighborhoods (k-NN) algorithm. This is because the mechanism of k-NN will find k neighbors, which cannot measure the complete region of one selected instance. In this study, we proposed a method based on the region density, named Region-Density Minority Oversample Technique (RDMOT). We set up a radius to measure the region density of minority class. Then, calculating the average minority class ratio and selecting minority instances with a higher minority class ratio then the average minority class ratio. Consequently, we can select a relatively high concentration region of minority class and creating synthetic examples to balance the region based on the difference between the number of minority and majority classes. We select 9 datasets to evaluate the performance with F-Measure, AUC and accuracy, and compare with some extensions of SMOTE. The experiment results show that RDMOT can have comparable or better outcomes than others.

    目錄 摘要 iii 圖目錄 xiv 表目錄 xv 第1章 緒論 1 1.1. 研究背景 1 1.2. 研究動機 3 1.3. 研究目的 6 1.4. 研究架構 6 第2章 文獻回顧 8 2.1. 數據不平衡學習問題 8 2.2. 虛擬小類別樣本過抽樣技術 9 2.3. 虛擬樣本增生相關技術回顧 11 2.3.1. 基於樣本重要性判定之虛擬樣本增生技術 11 2.3.2. 基於混和技術之虛擬樣本增生技術 17 2.3.3. 基於群聚分析之虛擬樣本增生技術 18 第3章 研究方法 21 3.1. 本研究之符號及其解釋 21 3.2. 定義資料密度 21 3.3. 定義種子樣本集合 23 3.4. k-NN法與距離閥之比較 24 3.5. 演算法流程 25 第4章 實證研究 28 4.1. 實驗環境 28 4.1.1. 實驗方式 28 4.1.2. 評估指標 29 4.1.3. 實驗資料 31 4.1.4. 分類模式建構軟體 32 4.2. 實驗結果 33 4.3. 研究發現 35 第5章 結論 37 參考文獻 39

    Airola, A., Pahikkala, T., Waegeman, W., De Baets, B., & Salakoski, T. (2011). An experimental comparison of cross-validation techniques for estimating the area under the ROC curve. Computational Statistics & Data Analysis, 55(4), 1828-1844.

    Barua, S., Islam, M. M., Yao, X., & Murase, K. (2012). MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning. IEEE transactions on knowledge and data engineering, 26(2), 405-425.

    Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 20-29.

    Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem.In (Eds.): Pacific-Asia conference on knowledge discovery and data mining, (pp. 475-482). Springer

    Chan, P. K., & Stolfo, S. J. (1998). Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection.In (Eds.): KDD, (pp. 164-168).

    Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.

    Cieslak, D. A., Chawla, N. V., & Striegel, A. (2006). Combating imbalance in network intrusion datasets.In (Eds.): 2006 IEEE International Conference on Granular Computing, (pp. 732-737).

    Cohen, G., Hilario, M., Sax, H., Hugonnet, S., & Geissbuhler, A. (2006). Learning from imbalanced data in surveillance of nosocomial infection. Artificial intelligence in medicine, 37(1), 7-18.

    Cover, T. M., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1), 21-27.

    Douzas, G., Bacao, F., & Last, F. (2018). Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Information Sciences, 465, 1-20.

    Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. New York: Chapman & Hall.

    Elkan, C. (2001). The foundations of cost-sensitive learning.In (Eds.): IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence, (pp. 973-978). Lawrence Erlbaum Associates Ltd

    Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise.In (Eds.): KDD'96 Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, (pp. 226-231).

    Farquad, M., & Bose, I. (2012). Preprocessing unbalanced data using support vector machine. Decision Support Systems, 53(1), 226-233.

    Fawcett, T. (2004). Notes and practical considerations for researchers. ReCALL, 31, 1-38.

    García, V., Sánchez, J., & Mollineda, R. (2007). An empirical study of the behavior of classifiers on imbalanced and overlapped data sets.In (Eds.): CIARP'07 Proceedings of the Congress on pattern recognition 12th Iberoamerican conference on Progress in pattern recognition, image analysis and applications, (pp. 397-406). Springer

    Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning.In (Eds.): ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing, (pp. 878-887). Springer

    He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: adaptive synthetic sampling approach for imbalanced learning.In (Eds.): 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), (pp. 1322-1328). IEEE

    He, H., & Garcia, E. A. (2009). Learning from Imbalanced Data IEEE Transactions on Knowledge and Data Engineering v. 21 n. 9. In: September.

    Jo, T., & Japkowicz, N. (2004). Class imbalances versus small disjuncts. ACM SIGKDD Explorations Newsletter, 6(1), 40-49.

    Khoshgoftaar, T. M., & Rebours, P. (2007). Improving software quality prediction by noise filtering techniques. Journal of Computer Science and Technology, 22(3), 387-396.

    Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: one-sided selection.In (Eds.): ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning, (pp. 179-186). Nashville, USA

    Li, D.-C., Liu, C.-W., & Hu, S. C. (2010). A learning method for the class imbalance problem with medical data sets. Computers in biology and medicine, 40(5), 509-518.

    Li, D.-C., Wu, C.-S., Tsai, T.-I., & Lina, Y.-S. (2007). Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Computers & Operations Research, 34(4), 966-982.

    Maciejewski, T., & Stefanowski, J. (2011). Local neighbourhood extension of SMOTE for mining imbalanced data.In (Eds.): 2011 IEEE Symposium on Computational Intelligence and Data Mining, (pp. 104-111). IEEE

    Napierała, K., Stefanowski, J., & Wilk, S. (2010). Learning from imbalanced data in presence of noisy and borderline examples.In (Eds.): RSCTC'10 Proceedings of the 7th international conference on Rough sets and current trends in computing, (pp. 158-167). Springer

    Pérez, J. M., Muguerza, J., Arbelaitz, O., Gurrutxaga, I., & Martín, J. I. (2005). Consolidated tree classifier learning in a car insurance fraud detection domain with class imbalance.In (Eds.): ICAPR'05 Proceedings of the Third international conference on Advances in Pattern Recognition, (pp. 381-389). Springer

    Piri, S., Delen, D., & Liu, T. (2018). A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets. Decision Support Systems, 106, 15-29.

    Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81-106.

    Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492-1496.

    Sáez, J. A., Luengo, J., Stefanowski, J., & Herrera, F. (2015). SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Information Sciences, 291, 184-203.

    Tomek, I. (1976). Two modifications of CNN. IEEE Trans. Systems, Man and Cybernetics, 6, 769-772.

    Xie, Y., Li, X., Ngai, E., & Ying, W. (2009). Customer churn prediction using improved balanced random forests. Expert Systems with Applications, 36(3), 5445-5449.

    Yuan, X., Xie, L., & Abouelenien, M. (2018). A regularized ensemble framework of deep learning for cancer detection from multi-class, imbalanced training data. Pattern Recognition, 77, 160-172.

    下載圖示 校內:2022-07-19公開
    校外:2022-07-19公開
    QR CODE