簡易檢索 / 詳目顯示

研究生: 陳益廷
Chen, Yi-Ting
論文名稱: 運用整體趨勢擴散技術於超平面上生成虛擬樣本以改善支援向量機於不平衡資料集的學習
Using the Mega-trend-diffusion Technique to Generate Virtual Samples on the Hyperplane to Improve Support Vector Machine for Imbalanced Data Set Learning
指導教授: 利德江
Li, Der-Chiang
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業與資訊管理學系
Department of Industrial and Information Management
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 40
中文關鍵詞: 不平衡資料集虛擬樣本欠抽樣技術整體趨勢擴散技術支援向量機超平面孿生神經
外文關鍵詞: imbalanced dataset, oversampling, Mega-Trend-diffusion (MTD), Siamese Network
相關次數: 點閱:158下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 由於電腦資訊和科技網路的磅礡發展,形成了現今的大數據時代,而企業為了能夠長久的發展,經常透過獲取大量的數據資料來建構出機器學習的模型來發展該企業之未來策略,但對於大數據而言,常常會產生不平衡資料集(imbalanced data set)的問題,因為現實的世界中樣本的分配常常是不均等的,對於傳統的機器學習分類演算法來說,少數的樣本類別常常會被歸類為誤差,會使得模型傾向多數類別,而導致分類模型沒辦法有效的將少數類別樣本區別開來,進而導致決策的失誤,所以目前如何將不平衡資料集建構出一個可靠的學習模型,對學術界和科技界會是個重要的挑戰,因此為了增加少數類別之樣本資訊量,本研究提出新的增加過抽樣(oversampling)方法來增加少數類別之樣本數量,而提取的少數類別樣本取自模擬資料集來建構出支援向量機(Support Vector Machine;SVM)模型之中的超平面(hyperplane)上,爾後使用整體趨勢擴散(Mega-Trend-diffusion;MTD)技術來生成出虛擬樣本,並應用我們研究的新型評估機制,此機制是藉由孿生神經(Siamese Network)為基礎所創造的,主要用來來評估生成的虛擬樣本是否符合我們的要求,最後使用該模擬資料集建構出的預測模型並對實際數據進行評估,而評估分類指標將由Geometric Mean (G-mean)、F-measure (F1)、AUC、ACC來評測標準和判斷本研究方法不平衡資料集的學習效率。

    Due to the majestic development of computer information and network technology, the present era of big data has been formed, and enterprises often construct machine learning models to develop their strategies by acquiring a large amount of data. For traditional machine learning classification algorithms, the minority class was often classified as an error, which made the model trend toward the majority class, and the classified model cannot effectively distinguish the minority samples, which lead to erratic decision-making. Therefore, to increase the amount of sample information on the minority class, this paper developed a new method of oversampling to increase the samples of the minority class, and the extracted samples of the minority class were taken from the support vector which was found from the hyper plane of the support vector machine. The virtual samples are generated by the Mega-Trend-diffusion (MTD) technique, and the evaluation mechanism of the virtual samples based on the Siamese Network was created to evaluate whether the generated virtual samples were suitable for our requirements. In the end, the prediction model was constructed using the simulated dataset and run on the actual data, and the evaluation of learning model efficiency in this research was used by Geometric Mean (G-mean), F-measure (F1), ACC, AUC, AVE.

    摘要 I 目錄 IX 表目錄 XI 圖目錄 XII 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機 3 1.3 研究目的 4 1.4 研究架構 5 第二章 文獻探討 7 2.1 不平衡資料 7 2.2 SMOTE 7 2.2.1Safe-Level SMOTE(SL-SMOTE) 8 2.3 虛擬樣本 (Virtual sample) 9 2.4支援向量機(SVM) 12 2.4.1超平面(Hyperplane) 13 2.5孿生神經網路(Siamese Network) 17 第三章 研究方法 18 3.1 符號定義 18 3.2 整體趨勢擴散技術 18 3.2.1虛擬樣本之值域推估 19 3.2.2進行偏態設定 20 3.2.3隸屬函數 21 3.2.4 Siamese network 24 3.3研究流程 25 第四章 實例驗證 27 4.1實驗環境與參數 27 4.2評估指標 28 4.3實驗資料 29 4.4實驗數據與結果 30 第五章 結論與未來建議 35 參考文獻 37

    Airola, A., Pahikkala, T., Waegeman, W., De Baets, B., & Salakoski, T. (2011). An experimental comparison of cross-validation techniques for estimating the area under the ROC curve. Computational Statistics & Data Analysis, 55(4), 1828-1844.
    Alshomrani, S., Bawakid, A., Shim, S.-O., Fernández, A., & Herrera, F. (2015). A proposal for evolutionary fuzzy systems using feature weighting: dealing with overlapping in imbalanced datasets. Knowledge-Based Systems, 73, 1-17.
    Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., & Shah, R. (1993). Signature verification using a" siamese" time delay neural network. Advances in neural information processing systems, 6.
    Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Paper presented at the Pacific-Asia conference on knowledge discovery and data mining.
    Chan, P. K., & Stolfo, S. J. (1998). Learning with non-uniform class and cost distributions: Effects and a distributed multi-classifier approach. Paper presented at the In Workshop Notes KDD-98 Workshop on Distributed Data Mining.
    Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
    Chongfu, H. (1997). Principle of information diffusion. Fuzzy Sets and Systems, 91(1), 69-90.
    Cohen, G., Hilario, M., Sax, H., Hugonnet, S., & Geissbuhler, A. (2006). Learning from imbalanced data in surveillance of nosocomial infection. Artificial intelligence in medicine, 37(1), 7-18.
    Cortes, C., & Vapnik, V. (1995). Support vector machine. Machine learning, 20(3), 273-297.
    De La Calleja, J., Fuentes, O., & González, J. (2008). Selecting Minority Examples from Misclassified Data for Over-Sampling. Paper presented at the FLAIRS Conference.
    Del Río, S., López, V., Benítez, J. M., & Herrera, F. (2014). On the use of mapreduce for imbalanced big data using random forest. Information Sciences, 285, 112-137.
    Estabrooks, A., Jo, T., & Japkowicz, N. (2004). A multiple resampling method for learning from imbalanced data sets. Computational intelligence, 20(1), 18-36.
    Fernández, A., García, S., del Jesus, M. J., & Herrera, F. (2008). A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets and Systems, 159(18), 2378-2398.
    Fu, Y., Gao, G., & Liu, S. (2014). Assessment based on Monte Carlo for sample rotation under stratified cluster sampling. In Advanced Engineering and Technology (pp. 279-284): CRC Press.
    Galpert, D., Del Río, S., Herrera, F., Ancede-Gallardo, E., Antunes, A., & Agüero-Chapin, G. (2015). An effective big data supervised imbalanced classification approach for ortholog detection in related yeast species. BioMed research international, 2015.
    García, V., Sánchez, J., & Mollineda, R. (2007). An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. Paper presented at the Iberoamerican Congress on Pattern Recognition.
    He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284.
    Hong, X., Chen, S., & Harris, C. J. (2007). A kernel-based two-class classifier for imbalanced data sets. IEEE Transactions on neural networks, 18(1), 28-41.
    Huang, C., & Moraga, C. (2004). A diffusion-neural-network for learning from small samples. International Journal of Approximate Reasoning, 35(2), 137-161.
    Huang, C. F., & Moraga, C. (2004). A diffusion-neural-network for learning from small samples. International Journal of Approximate Reasoning, 35(2), 137-161. Retrieved from <Go to ISI>://000188735300002. doi:10.1016/j.ijar.2003.06.001
    Jang, J.-S. (1993). ANFIS: adaptive-network-based fuzzy inference system. Systems, Man and Cybernetics, IEEE Transactions on, 23(3), 665-685.
    Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: one-sided selection. Paper presented at the Icml.
    López, V., Fernández, A., Del Jesus, M. J., & Herrera, F. (2013). A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets. Knowledge-Based Systems, 38, 85-104.
    Li, D.-C., Chen, C.-C., Chen, W.-C., & Chang, C.-J. (2012). Employing dependent virtual samples to obtain more manufacturing information in pilot runs. International Journal of Production Research, 50(23), 6886-6903.
    Li, D.-C., & Lin, Y.-S. (2006). Using virtual sample generation to build up management knowledge in the early manufacturing stages. European Journal of Operational Research, 175(1), 413-434.
    Li, D.-C., Liu, C.-W., & Hu, S. C. (2010). A learning method for the class imbalance problem with medical data sets. Computers in biology and medicine, 40(5), 509-518.
    Li, D.-C., Wu, C.-S., Tsai, T.-I., & Chang, F. M. (2006). Using mega-fuzzification and data trend estimation in small data set learning for early FMS scheduling knowledge. Computers & Operations Research, 33(6), 1857-1869.
    Li, D.-C., Wu, C.-S., Tsai, T.-I., & Lina, Y.-S. (2007). Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Computers & Operations Research, 34(4), 966-982.
    Li, D.-C., Wu, C., & Chang, F. M. (2005). Using data-fuzzification technology in small data set learning to improve FMS scheduling accuracy. The International Journal of Advanced Manufacturing Technology, 27(3-4), 321-328.
    Lo, H.-Y., Chang, C.-M., Chiang, T.-H., Hsiao, C.-Y., Huang, A., Kuo, T.-T., . . . Yen, C.-C. (2008). Learning to improve area-under-FROC for imbalanced medical data classification using an ensemble method. ACM SIGKDD Explorations Newsletter, 10(2), 43-46.
    Murphey, Y. L., Guo, H., & Feldkamp, L. A. (2004). Neural learning from unbalanced data. Applied Intelligence, 21(2), 117-128.
    Napierała, K., Stefanowski, J., & Wilk, S. (2010). Learning from imbalanced data in presence of noisy and borderline examples. Paper presented at the International conference on rough sets and current trends in computing.
    Nguwi, Y.-Y., & Cho, S.-Y. (2009). Support vector self-organizing learning for imbalanced medical data. Paper presented at the 2009 international joint conference on neural networks.
    Niyogi, P., Girosi, F., & Poggio, T. (1998). Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE, 86(11), 2196-2209.
    Pérez, J. M., Muguerza, J., Arbelaitz, O., Gurrutxaga, I., & Martín, J. I. (2005). Consolidated tree classifier learning in a car insurance fraud detection domain with class imbalance. Paper presented at the International Conference on Pattern Recognition and Image Analysis.
    Peng, X., & King, I. (2008). Robust BMPM training based on second-order cone programming and its application in medical diagnosis. Neural Networks, 21(2-3), 450-457.
    Peng, Y., & Yao, J. (2010). AdaOUBoost: adaptive over-sampling and under-sampling to boost the concept learning in large scale imbalanced data sets. Paper presented at the Proceedings of the international conference on Multimedia information retrieval.
    Piras, L., & Giacinto, G. (2012). Synthetic pattern generation for imbalanced learning in image retrieval. Pattern Recognition Letters, 33(16), 2198-2205.
    Piri, S., Delen, D., & Liu, T. (2018). A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets. Decision Support Systems, 106, 15-29.
    Song, L., Li, D., Zeng, X., Wu, Y., Guo, L., & Zou, Q. (2014). nDNA-prot: identification of DNA-binding proteins based on unbalanced classification. BMC bioinformatics, 15(1), 1-10.
    Sugeno, M., & Kang, G. (1988). Structure identification of fuzzy model. Fuzzy Sets and Systems, 28(1), 15-33.
    Sun, Y., Kamel, M. S., Wong, A. K., & Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern recognition, 40(12), 3358-3378.
    Tahir, M. A., Kittler, J., & Yan, F. (2012). Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern recognition, 45(10), 3738-3750.
    Wang, C., Hu, L., Guo, M., Liu, X., & Zou, Q. (2015). imDC: an ensemble learning method for imbalanced classification with miRNA data. Genetics and Molecular Research, 14(1), 123-133.
    Xie, J., & Qiu, Z. (2007). The effect of imbalanced data sets on LDA: A theoretical and empirical analysis. Pattern recognition, 40(2), 557-562.
    Xie, Y., Li, X., Ngai, E., & Ying, W. (2009). Customer churn prediction using improved balanced random forests. Expert Systems with Applications, 36(3), 5445-5449.
    Yen, S.-J., & Lee, Y.-S. (2006). Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In Intelligent Control and Automation (pp. 731-740): Springer.
    Zhang, H., & Wang, Z. (2011). A normal distribution-based over-sampling approach to imbalanced data classification. Paper presented at the International conference on advanced data mining and applications.
    Zhang, Y.-P., Zhang, L.-N., & Wang, Y.-C. (2010). Cluster-based majority under-sampling approaches for class imbalance learning. Paper presented at the 2010 2nd IEEE International Conference on Information and Financial Engineering.
    Zhao, Z., Zhong, P., & Zhao, Y. (2011). Learning SVM with weighted maximum margin criterion for classification of imbalanced data. Mathematical and Computer Modelling, 54(3-4), 1093-1099.

    無法下載圖示 校內:2026-06-13公開
    校外:2026-06-13公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE