簡易檢索 / 詳目顯示

研究生: 張雅婷
Chang, Ya-Ting
論文名稱: 用數值預測方法來估算集成大小之研究
Estimating Ensemble Size by Numeric Prediction Methods
指導教授: 翁慈宗
Wong, Tzu-Tsung
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 66
中文關鍵詞: 集成學習基本模型迴歸分析數值預測模型樹
外文關鍵詞: Ensemble learning, base models, regression analysis, numerical prediction, model tree
相關次數: 點閱:117下載:38
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在這大數據的時代下,資料探勘是許多人研究的議題之一,而其中關於集成學習更是很多研究中都會採用的方法,集成學習是指透過集結單個或多個分類方法生成的基本模型,希望產生一個更強大的模型,而且此集成的模型優於任何一個單一模型的預測結果。但是在採用集成學習時,關於其中要用來集成的基本模型數量,目前並沒有太多的探討,大多都是使用者自行設置,或是透過逐步地加入模型數量來做調整。因此本研究想要探討在不同的情況之下基本模型合適的大小,會先使用分類方法、資料集特性以及生成基本模型的方法,這三種影響基本模型數量的因素視為自變數,而基本模型之數量為應變數,從中收集五十個資料集做迴歸分析以及M5P樹狀迴歸分析,而對模型進行解釋方式則是透過數值預測,對額外十個資料集進行測試,以確定其有效性。實驗結果顯示,所有情況下所得到的數量皆落在相對絕對誤差範圍之內,這意味著決定基本模型數量的上界不應大於線性迴歸模型獲得到預測值的兩倍。而分類方法和生成基本模型的方式是決定集成學習中基本模型數量的兩個主要因素。

    In the era of big data, ensemble learning is a widely used technique in many classification studies. Ensemble learning refers to build a stronger model by combining several base models induced by single or multiple classification algorithms, and this ensemble model is expected to outperform any single base model. The number of base models is critical to the performance of an ensemble model. Most of the time, users set it manually or adjust it by gradually increasing the number of models. Therefore, this study aims to investigate the factors for determining the appropriate size of base models. The factors considered in this study include classification algorithms, the characteristics of data sets, and the methods for generating base models. The factors paly the roles of independent variables for predicting the value of the dependent variable, the number of base models for achieving the highest accuracy. Numeric prediction methods multiple linear regression and model tree are adopted to analyze the data collected from 50 data sets. The numeric prediction models obtained from model tree is tested by other 10 data sets to identify its effectiveness. The experimental results show that the actual number of base models for each data set falls within the range of its relative absolute error. This implies that the upper bound of base models for searching should not be larger than the twice of the predicted value obtained the linear regression model. Classification algorithm and the way for generating base models are two main factors for determining the number of base models in ensemble learning.

    摘要I 目錄VI 表目錄VIII 圖目錄IX 第一章 緒論1 1.1研究背景1 1.2研究動機1 1.3研究目的2 1.4研究架構2 第二章 文獻探討3 2.1集成學習3 2.2集成學習的方法與策略4 2.3 集成模型的數量對集成學習的影響5 2.4決定模型數量的方法6 2.5小節10 第三章 研究方法11 3.1研究流程11 3.2影響集成學習中基本模型數量的因素12 3.3收集樣本資料14 3.3.1決策樹14 3.3.2簡易貝氏分類器15 3.3.3羅吉斯迴歸15 3.3.4支持向量機15 3.4決定資料合適的基本模型數量16 3.5 基本模型數量預測方式17 3.6模型的解釋與分析20 3.7模型的驗證21 第四章 實驗結果22 4.1資料集介紹22 4.2資料變數與最終模型數量23 4.3線性迴歸分析結果25 4.4 M5P樹狀迴歸分析結果28 4.5驗證迴歸方程式30 4.6總結32 第五章 結論與建議34 5.1結論34 5.2建議與未來研究35 參考文獻36 附錄41

    Abbaszadeh, P., Alipour, A., & Asadi, S. (2018). “Development of a coupled wavelet transform and evolutionary L evenberg‐M arquardt neural networks for hydrological process modeling.” Computational Intelligence, 34(1), 175-199.
    Adnan, M. N. & Islam, M. Z. (2016). “Optimizing the number of trees in a decision forest to discover a subforest with high ensemble accuracy using a genetic algorithm.” Knowledge-Based Systems, 110, 86-97.
    Asadi, S. & Roshan, S. E. (2021). “A bi-objective optimization method to produce a near-optimal number of classifiers and increase diversity in Bagging.” Knowledge-Based Systems, 213, 106656.
    Asadi, S., Jafari, S., & Shokrollahi, Z. (2019). “Developing a course recommender by combining clustering and fuzzy association rules.” Journal of AI and Data Mining, 7(2), 249-262.
    Bonab, H. & Can, F. (2019). “Less is more: a comprehensive framework for the number of components of ensemble classifiers.” IEEE Transactions on Neural Networks and Learning Systems, 30(9), 2735-2745.
    Bonab, H. R. & Can, F. (2016). “A theoretical framework on the ideal number of classifiers for online ensembles in data streams.” Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2053-2056.
    Breiman, L. (2001). "Random forests." Machine learning, 45(1), 5-32.
    Dietterich, T. G. (2000). "Ensemble methods in machine learning." International Workshop on Multiple Classifier Systems, Springer, 1-15.
    Dong, X., Yu, Z., Cao, W., Shi, Y., & Ma, Q. (2020). “A survey on ensemble learning.” Frontiers of Computer Science, 14(2), 241-258.
    Feng, W., Dauphin, G., Huang, W., Quan, Y., & Liao, W. (2019). “New margin-based subsampling iterative technique in modified random forests for classification.” Knowledge-Based Systems, 182, 104845.
    González, S., García, S., Del Ser, J., Rokach, L., & Herrera, F. (2020). “A practical tutorial on bagging and boosting based ensembles for machine learning: Algorithms, software tools, performance study, practical perspectives and opportunities.” Information Fusion, 64, 205-237.
    Gul, A., Perperoglou, A., Khan, Z., Mahmoud, O., Miftahuddin, M., Adler, W., & Lausen, B. (2018). “Ensemble of a subset of kNN classifiers.” Advances in Data Analysis and Classification, 12(4), 827-840.
    Guo, H., Liu, H., Li, R., Wu, C., Guo, Y., & Xu, M. (2018). “Margin & diversity based ordering ensemble pruning.” Neurocomputing, 275, 237-246.
    Hernández-Lobato, D., Martínez-Muñoz, G., & Suárez, A. (2013). “How large should ensembles of classifiers be?.” Pattern Recognition, 46(5), 1323-1336.
    Jiang, F., Yu, X., Du, J., Gong, D., Zhang, Y., & Peng, Y. (2021). “Ensemble learning based on approximate reducts and bootstrap sampling.” Information Sciences, 547, 797-813.
    Khan, Z., Gul, A., Perperoglou, A., Miftahuddin, M., Mahmoud, O., Adler, W., & Lausen, B. (2020). “Ensemble of optimal trees, random forest and random projection ensemble classification.” Advances in Data Analysis and Classification, 14(1), 97-116.
    Khan, Z., Gul, N., Faiz, N., Gul, A., Adler, W., & Lausen, B. (2021). “Optimal trees selection for classification via out-of-bag assessment and sub-bagging.” IEEE Access, 9, 28591-28607.
    Krawczyk, B., Minku, L. L., Gama, J., Stefanowski, J., & Woźniak, M. (2017). “Ensemble learning for data stream analysis: A survey.” Information Fusion, 37, 132-156.
    Kumar, A. & Chen, M. (2015). “Inherent predictability, requirements on the ensemble size, and complementarity.” Monthly Weather Review, 143(8), 3192-3203.
    Lausser, L., Schmid, F., Schirra, L. R., Wilhelm, A. F., & Kestler, H. A. (2018). “Rank-based classifiers for extremely high-dimensional gene expression data.” Advances in Data Analysis and Classification, 12(4), 917-936.
    Lin, N., Jiang, R., Li, G., Yang, Q., Li, D., & Yang, X. (2022). “Estimating the heavy metal contents in farmland soil from hyperspectral images based on Stacked AdaBoost ensemble learning.” Ecological Indicators, 143, 109330.
    Lopes, M. E. (2020). "Estimating a sharp convergence bound for randomized ensembles." Journal of Statistical Planning and Inference, 204, 35-44.
    Lysiak, R., Kurzynski, M., & Woloszynski, T. (2014). “Optimal selection of ensemble classifiers using measures of competence and diversity of base classifiers.” Neurocomputing, 126, 29-35.
    Margineantu, D. D., & Dietterich, T. G. (1997). “Pruning adaptive boosting.”
    International Conference on Machine Learning, 97, 211-218.
    Mohammed, A. M., Onieva, E., & Woźniak, M. (2022). “Selective ensemble of classifiers trained on selective samples.” Neurocomputing, 482, 197-211.
    Ngo, G., Beard, R., & Chandra, R. (2022). “Evolutionary bagging for ensemble learning.” Neurocomputing, 510, 1-14.
    Oshiro, T. M., Perez, P. S., & Baranauskas, J. A. (2012). “How many trees in a random forest?” Proceedings of the International Workshop on Machine Learning and Data Mining in Pattern Recognition ,154–168.
    Pietruczuk, L., Rutkowski, L., Jaworski, M., & Duda, P. (2017). “How to adjust an ensemble size in stream data mining?.” Information Sciences, 381, 46-54.
    Probst, P., & Boulesteix, A. L. (2017). “To tune or not to tune the number of trees in random forest.” The Journal of Machine Learning Research, 18(1), 6673-6690.
    Quintián, H. & Corchado, E. (2020). “A novel ensemble beta-scale invariant map algorithm.” IEEE Access, 8, 108857-108884.
    Rokach, L. (2010). “Ensemble-based classifiers.” Artificial Intelligence Review, 33, 1-39.
    Scornet, E. (2017). "Tuning parameters in random forests." ESAIM: Proceedings and Surveys, 60, 144-162.
    Sevinç, E. (2022). "An empowered AdaBoost algorithm implementation: A COVID-19 dataset study." Computers & Industrial Engineering, 165, 107912.
    Wang, L., Guo, Y., Fan, M., & Li, X. (2022). “Wind speed prediction using measurements from neighboring locations and combining the extreme learning machine and the AdaBoost algorithm.” Energy Reports, 8, 1508-1518.
    Wang, X. Z., Wang, R., Feng, H. M., & Wang, H. C. (2013). “A new approach to classifier fusion based on upper integral.” IEEE Transactions on Cybernetics, 44(5), 620-635.
    Wang, Z. & Shi, Y. (2017). “Research on motion recognition algorithm based on accelerometer.” 2017 International Conference on Computer Systems, Electronics and Control (ICCSEC), 1211-1220.
    Yates, D. & Islam, M. Z. (2021). “FastForest: Increasing random forest processing speed while maintaining accuracy.” Information Sciences, 557, 130-152.
    Zhang, H. & Cao, L. (2014). “A spectral clustering based ensemble pruning approach.” Neurocomputing, 139, 289-297.
    Zhou, Z. H. (2012). Ensemble Methods: Foundations and Algorithms. CRC press.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE