簡易檢索 / 詳目顯示

研究生: 林巧盈
Lin, Chiao-Ying
論文名稱: 探討K等分交叉驗證法對於分類器錯選率之研究
A study on the selection error rate of classification algorithms evaluated by k-fold cross validation.
指導教授: 翁慈宗
Wong, Tzu-Tsung
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2014
畢業學年度: 102
語文別: 中文
論文頁數: 55
中文關鍵詞: K等分交叉驗證法全樣本模型錯選率
外文關鍵詞: full sample model, K-fold cross validation, selection error rate
相關次數: 點閱:95下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 一般會使用K等分交叉驗證法(K-fold Cross Validation)來估計一個分類器的分類正確率,K等分交叉驗證法的運作模式是將一資料檔隨機切成K個等分,輪流當其餘(K-1)等分的資料去做訓練、學習後的測試資料,再將得出K筆分類正確率計算平均值,而根據所得之正確率來選擇最佳分類器。通常是使用目前收集的所有資料去產生一個模型,在本研究中稱之為全樣本模型(Full Sample Model),來對任一筆新產生的資料進行預測與結果的解釋,由於沒有其它資料可用來估算全樣本模型的分類正確率,因此一般都會使用K等分交叉驗證法所得到的分類正確率來做為全樣本模型的正確率估計值,這樣的做法在挑選分類器時,有可能會挑選到K等分交叉驗證法的正確率較佳之分類器,其所產生的全樣本模型的表現卻較差的錯選情形。本研究採用三十個資料檔,實驗結果顯示,實際錯選率與理論錯選率於部分分類器選用時會有差不多的結果,而於一些分類器組合時會有不吻合的結果;在從K等分交叉驗證法訓練模型中,選用適合新模型時,將會發現使用K等分交叉驗證法所得之正確率中間的新模型時,與使用全樣本模型比較,在面臨兩兩分類器選擇時,可以降低選錯分類器的錯選情況,代表將來遇到一筆新資料時,可以考慮使用該新模型來取代全樣本模型,做為對新資料的預測及結果的解釋。

    The performance of a classification algorithm is generally evaluated by K-fold cross validation to find the one that has the highest accuracy. Then the model induced from all available data by the best classification algorithm, called full sample model, is used for prediction and interpretation. Since there are no extra data to evaluate the full sample model resulting from the best algorithm, its prediction accuracy can be less than the accuracy of the full sample model induced by the other classification algorithm, and this is called a selection error. This study designs an experiment to calculate and estimate the selection error rate, and attempts to propose a new model for reducing selection error rate. The classification algorithms considered in this study are decision tree, naïve Bayesian classifier, logistic regression, and support vector machine. The experimental results on 30 data sets show that the actual and estimated selection error rates can be greatly different in several cases. The new model that has the median accuracy can reduce the selection error rate without sacrificing the prediction accuracy.

    第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 2 1.3 研究架構 3 第二章 文獻探討 4 2.1 探討K等分交叉驗證法與全資料模型間分類正確性與一致性之研究 4 2.2 K等分交叉驗證法 5 2.2.1 偏誤與變異數 5 2.2.2 K等分交叉驗證法應用 6 2.3 分類器 7 2.3.1 決策樹 7 2.3.2 簡易貝氏分類器 9 2.3.3 支撐向量機 11 2.3.4 邏輯斯迴歸 14 2.4 小結 15 第三章 研究方法 16 3.1 計算實際的錯選率 17 3.2 推導理論上的錯選率 18 3.3 推薦適用模型 21 3.4 分類器 22 第四章 實證研究 24 4.1資料檔屬性 24 4.2計算實際錯選率 26 4.2.1 各資料檔實際錯選率 26 4.2.2 實際錯選率小結 29 4.3推導理論錯選率 29 4.3.1 各資料檔理論錯選率 29 4.3.2 實際錯選率與理論錯選率比較 33 4.3.3 理論錯選率小結 34 4.4新模型錯選率 34 4.4.1 第一新模型錯選率 35 4.4.2 第二新模型錯選率 36 4.4.3 第三新模型錯選率 37 4.4.4 第四新模型錯選率 39 4.4.5 第五新模型錯選率 40 4.4.6 新模型錯選率小結 41 第五章 結論與未來發展 45 5.1 結論 45 5.2 未來發展 46 參考文獻 47 附錄一 全樣本模型平均正確率 50 附錄二 第一新模型平均正確率 51 附錄三 第二新模型平均正確率 52 附錄四 第三新模型平均正確率 53 附錄五 第四新模型平均正確率 54 附錄六 第五新模型平均正確率 55

    陳映伊(2013)。探討K等分交叉驗證法與全資料模型間分類正確性與一致性之研究。國立成功大學資訊管理研究所碩士班碩士論文。
    Astudillo, C. A. and Oommen, B. J. (2013). On achieving semi-supervised pattern recognition by utilizing tree-based SOMs. Pattern Recognition, 46(1), 293-304.
    Bache, K. and Lichman, M. (2013). UCI machine learning repository http://www.ics.uci.edu/~mlearn/MLRepository.html.
    Ballings, M. and Poel, D. V. D. (2012). Customer event history for churn prediction: How long is long enough?. Expert Systems with Applications., 39(18), 13517-13522.
    Catal, C., Sevim, U., and Diri, B. (2011). Practical development of an Eclipse-based software fault prediction tool using Naive Bayes algorithm. Expert Systems with Applications, 38(3), 2347-2353.
    Cawley, G. C. and Talbot, N. L. C. (2010). On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research, 11, 2079-2107.
    Chattopadhyay, S., Davis, R. M., Menezes, D. D., Singh, G., Acharya, R. U., and Tamura, T. (2012). Application of Bayesian classifier for the diagnosis of dental pain. Journal of medical systems, 36(3), 1425-1439.
    Chen, C., Wang, Y., Chang, Y., and Ricanek, K. (2012). Sensitivity analysis with cross-validation for feature selection and manifold learning. In Advances in Neural Networks–ISNN 2012 (pp. 458-467). Springer Berlin Heidelberg.
    Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
    de Lorimier, A. and El-Geneidy, A. M. (2013). Understanding the Factors Affecting Vehicle Usage and Availability in Carsharing Networks: A Case Study of Communauto Carsharing System from Montreal, Canada. International Journal of Sustainable Transportation, 7(1), 35-51.
    Hwang, K. S., Chen, Y. J., Jiang, W. C., and Yang, T. W. (2012). Induced states in a decision tree constructed by Q-learning. Information Sciences, 213, 39-49.
    Karami, G., Attaran, N., S. M. S., and Hossein, S. M. S. (2012). Bankruptcy Prediction, Accounting Variables and Economic Development: Empirical Evidence from Iran. International Business Research., 5(8), 147-152.
    López, M. and Iglesias, G. (2013). Artificial Intelligence for estimating infragravity energy in a harbour. Ocean Engineer, 7, 56-63.
    Marcot, B. G. (2012). Metrics for evaluating performance and uncertainty of Bayesian network models. Ecological Modelling, 230, 50-62.
    Nahar, J., Imam, T., Tickle, K. S., and Chen, Y. P. P. (2013). Computational intelligence for heart disease diagnosis: A medical knowledge driven approach. Expert Systems with Applications, 40(1), 96-104.
    Rodriguez, J. D., Perez, A., and Lozano, J. A. (2010). Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3), 569-575.
    Rodriguez, J. D., Perez, A., and Lozano, J. A. (2013). A General Framework for the Statistical Analysis of the Sources of Variance for Classification Error Estimators. Pattern Recognition, 46(3), 855-864.
    Shao, C., Paynabar, K., Kim, T. H., Jin, J. J., Hu, S. J., Spicer, J. P., Wang H., and Abell, J. A. (2013). Feature selection for manufacturing process monitoring using cross-validation. Journal of Manufacturing Systems. doi: 10.1016/j.jmsy.2013.05.006
    Sun, J. and Li, H. (2012). Financial distress prediction using support vector machines: Ensembles vs. individual. Applied Soft Computing, 12(8), 2254-2265.
    Valle, M. A., Varas, S., and Ruz, G. A. (2012). Job performance prediction in a call center using a naive Bayes classifier. Expert Systems with Applications, 39(11), 9939-9945.

    下載圖示 校內:2019-07-30公開
    校外:2019-07-30公開
    QR CODE