簡易檢索 / 詳目顯示

研究生: 蔡敬泰
Tsai, Jing-Tai
論文名稱: K等分交叉驗證法所得分類正確率之相依性分析
Dependency Analysis of the Accuracy Estimates Obtained from k-fold Cross Validation
指導教授: 翁慈宗
Wong, Tzu-Tsung
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2017
畢業學年度: 105
語文別: 中文
論文頁數: 39
中文關鍵詞: 相依性分析獨立性K等分交叉驗證法訓練資料重複
外文關鍵詞: Dependency analysis, independence, K-fold cross validation, training data overlapping
相關次數: 點閱:136下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在資料探勘領域中,判斷分類方法的優劣時通常會使用分類正確率作為指標,而分類正確率的評估常使用K等分交叉驗證法的方式求得,但此種評估方式,許多文獻認為K等分交叉驗證法每個等分所得到的分類正確率彼此存在相依關係,原因在於任意兩等分的測試資料所對應的訓練資料有重複。而在本研究中會探討在K等分交叉驗證法中訓練資料的重複,是否適合當作每等分所得的分類正確率彼此有相關性的判斷依據,並設計檢定方法,以檢驗一個資料檔中檢定K等分交叉驗證法中每等分所得到的分類正確率是否適合假設為獨立。本研究採用18個資料檔,實證結果顯示,訓練資料的重複並不能代表K等分交叉驗證法中,每等分正確率彼此之間就有相依關係,且本研究在實證中也發現分類方法不會影響每等分正確率有相依的關係,而拒絕虛無假設的總比例值也相近於犯型一誤差的機率,故由實證結果顯示K等分交叉驗證法所得每等分正確率適合假設為獨立。

    The performance of a classification algorithm is generally evaluated by K-fold cross validation. Several previous studies consider that the accuracies obtained from K-fold cross validation are dependent because of the overlapping of training sets. This studies first investigate whether the dependency relationships among fold accuracies are caused by the overlapping of training sets. Then a statistical method is proposed to test the independence assumption for fold accuracies. The experimental results obtained from 18 datasets processed by four classification algorithms demonstrate that the overlapping of training sets is irrelevant to the dependency relationships among fold accuracies. The independence assumption is generally appropriate for fold accuracies, regardless of the number of folds and the algorithm used for classification.

    摘要 I 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 2 1.3 研究架構 2 第二章 文獻探討 3 2.1 分類方法的分類正確率估算 3 2.2 K等分交叉驗證法中訓練資料重複對評估的影響 5 2.3 獨立性 8 2.4 樣本變異數的抽樣分配 8 2.5 小結 9 第三章 研究方法 10 3.1 訓練資料重複之影響 10 3.2 分類預測相依性 12 3.3 分類正確率的變異數理論值 13 3.4 樣本期望值變異數的不偏性 14 3.5 檢定方法 17 3.6 方法評估 18 第四章 實證分析 22 4.1 資料檔特性 22 4.2 每等分正確率獨立性測試 23 4.3 每等分正確率相依性之因素 30 4.3.1 不同K值對相依性影響 30 4.3.2 不同分類方法對相依性影響 31 4.4 各等分正確率相依性探討 32 4.5 小結 34 第五章 結論與建議 36 5.1 結論 36 5.2 建議與研究發展 37 參考文獻 38

    Airlot, S. & Celisse, A. (2010). A survey of cross validation procedures for model selection. Statistical Surveys, 4, 40-79.

    Bouckaert, R. R. (2003). Choosing between two learning algorithms based on calibrated tests. Proceedings of the 20th International Conference on Machine Learning, Washington D.C., USA, 51-58.

    Bengio, Y. & Grandvalet, Y. (2004). No unbiased estimator of the variance of k-fold cross-validation. Journal of Machine Learning Research, 5, 1089-1105.

    Borra, S. & Di Ciaccio, A. (2010). Measuring the prediction error. A comparison of cross-validation, bootstrap and covariance penalty methods. Computational Statistics and Data Analysis, 54(12), 2976-2989.

    Casella, G. & Berger, R. L. (2002). Statistical Inference. USA: Brooks/Cole.

    Chen, C., Wang, Y., Chang, Y., & Ricanek, K. (2012). Sensitivity analysis with cross-validation for feature selection and manifold learning. Proceedings of the 9th International Conference on Advances in Neural Networks, Shenyang, China, 458-467.

    Corani, G. & Benavoli, A. (2015). A Bayesian approach for comparing cross-validated algorithms on multiple data sets. Machine Learning, 100(2-3), 285-304.

    Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895-1923.

    Feller, W. (1957). An Introduction to Probability Theory and Its Applications(Vol I). USA: John Wiley & Sons.

    Han, J., Pei, J., & Kamber, M. (2011). Data Mining: Concepts and Techniques. USA: Morgan Kaufmann.

    Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of International Joint Conference on Artificial Intelligence, Montreal, Canada, 1137-1143.

    Kim, J. H. (2009). Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap. Computational Statistics and Data Analysis, 53(11), 3735-3745.

    Lichman ,M. (2013). UCI machine learning repository http://arhive.ics.uci.edu/ml/datasets.html.

    Nadeau, C. & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52(3), 239-281.

    Rodriguez, J. D., Perez, A., & Lozano, J. A. (2010). Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3), 569-575.

    Rodriguez, J. D., Perez, A., & Lozano, J. A. (2013). A general framework for the statistical analysis of the sources of variance for classification error estimators. Pattern Recognition, 46(3), 855-864.

    Wong, T. T. (2015). Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognition, 48(9), 2839-2846.

    Wang, Y., Wang, R. B., Jia, H. C., & Li, J. H. (2014). Blocked 3× 2 cross-validated t-test for comparing supervised classification learning algorithms. Neural Computation, 26(1), 208-235.

    無法下載圖示 校內:2022-07-26公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE