| 研究生: |
蔡敬泰 Tsai, Jing-Tai |
|---|---|
| 論文名稱: |
K等分交叉驗證法所得分類正確率之相依性分析 Dependency Analysis of the Accuracy Estimates Obtained from k-fold Cross Validation |
| 指導教授: |
翁慈宗
Wong, Tzu-Tsung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2017 |
| 畢業學年度: | 105 |
| 語文別: | 中文 |
| 論文頁數: | 39 |
| 中文關鍵詞: | 相依性分析 、獨立性 、K等分交叉驗證法 、訓練資料重複 |
| 外文關鍵詞: | Dependency analysis, independence, K-fold cross validation, training data overlapping |
| 相關次數: | 點閱:136 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在資料探勘領域中,判斷分類方法的優劣時通常會使用分類正確率作為指標,而分類正確率的評估常使用K等分交叉驗證法的方式求得,但此種評估方式,許多文獻認為K等分交叉驗證法每個等分所得到的分類正確率彼此存在相依關係,原因在於任意兩等分的測試資料所對應的訓練資料有重複。而在本研究中會探討在K等分交叉驗證法中訓練資料的重複,是否適合當作每等分所得的分類正確率彼此有相關性的判斷依據,並設計檢定方法,以檢驗一個資料檔中檢定K等分交叉驗證法中每等分所得到的分類正確率是否適合假設為獨立。本研究採用18個資料檔,實證結果顯示,訓練資料的重複並不能代表K等分交叉驗證法中,每等分正確率彼此之間就有相依關係,且本研究在實證中也發現分類方法不會影響每等分正確率有相依的關係,而拒絕虛無假設的總比例值也相近於犯型一誤差的機率,故由實證結果顯示K等分交叉驗證法所得每等分正確率適合假設為獨立。
The performance of a classification algorithm is generally evaluated by K-fold cross validation. Several previous studies consider that the accuracies obtained from K-fold cross validation are dependent because of the overlapping of training sets. This studies first investigate whether the dependency relationships among fold accuracies are caused by the overlapping of training sets. Then a statistical method is proposed to test the independence assumption for fold accuracies. The experimental results obtained from 18 datasets processed by four classification algorithms demonstrate that the overlapping of training sets is irrelevant to the dependency relationships among fold accuracies. The independence assumption is generally appropriate for fold accuracies, regardless of the number of folds and the algorithm used for classification.
Airlot, S. & Celisse, A. (2010). A survey of cross validation procedures for model selection. Statistical Surveys, 4, 40-79.
Bouckaert, R. R. (2003). Choosing between two learning algorithms based on calibrated tests. Proceedings of the 20th International Conference on Machine Learning, Washington D.C., USA, 51-58.
Bengio, Y. & Grandvalet, Y. (2004). No unbiased estimator of the variance of k-fold cross-validation. Journal of Machine Learning Research, 5, 1089-1105.
Borra, S. & Di Ciaccio, A. (2010). Measuring the prediction error. A comparison of cross-validation, bootstrap and covariance penalty methods. Computational Statistics and Data Analysis, 54(12), 2976-2989.
Casella, G. & Berger, R. L. (2002). Statistical Inference. USA: Brooks/Cole.
Chen, C., Wang, Y., Chang, Y., & Ricanek, K. (2012). Sensitivity analysis with cross-validation for feature selection and manifold learning. Proceedings of the 9th International Conference on Advances in Neural Networks, Shenyang, China, 458-467.
Corani, G. & Benavoli, A. (2015). A Bayesian approach for comparing cross-validated algorithms on multiple data sets. Machine Learning, 100(2-3), 285-304.
Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895-1923.
Feller, W. (1957). An Introduction to Probability Theory and Its Applications(Vol I). USA: John Wiley & Sons.
Han, J., Pei, J., & Kamber, M. (2011). Data Mining: Concepts and Techniques. USA: Morgan Kaufmann.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of International Joint Conference on Artificial Intelligence, Montreal, Canada, 1137-1143.
Kim, J. H. (2009). Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap. Computational Statistics and Data Analysis, 53(11), 3735-3745.
Lichman ,M. (2013). UCI machine learning repository http://arhive.ics.uci.edu/ml/datasets.html.
Nadeau, C. & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52(3), 239-281.
Rodriguez, J. D., Perez, A., & Lozano, J. A. (2010). Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3), 569-575.
Rodriguez, J. D., Perez, A., & Lozano, J. A. (2013). A general framework for the statistical analysis of the sources of variance for classification error estimators. Pattern Recognition, 46(3), 855-864.
Wong, T. T. (2015). Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognition, 48(9), 2839-2846.
Wang, Y., Wang, R. B., Jia, H. C., & Li, J. H. (2014). Blocked 3× 2 cross-validated t-test for comparing supervised classification learning algorithms. Neural Computation, 26(1), 208-235.
校內:2022-07-26公開