簡易檢索 / 詳目顯示

研究生: 黃宜音
Huang, Yi-Yin
論文名稱: 探討K等分交叉驗證法改善分類器錯選率之新模型研究
A study on the new models for improving the selection error rate among classification algorithms evaluated by k-fold cross validation.
指導教授: 翁慈宗
Wong, Tzu-Tsung
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2015
畢業學年度: 103
語文別: 中文
論文頁數: 52
中文關鍵詞: K等分交叉驗證法全樣本模型錯選率
外文關鍵詞: K-fold cross validation, full sample model, selection error rate
相關次數: 點閱:97下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 資料探勘的分類領域中,資料輸入分類模型就可以推論出類別,當實際遇到一筆新資料時,通常是使用現有的整個資料產生全樣本模型,來對新資料做預測與結果的解釋,但由於現實中並沒有新資料在手邊,一般會使用K等分交叉驗證法來估計全樣本模型的分類正確率。K等分交叉驗證法的運作方式是將資料檔隨機切割成約相同大小的K等分,選取每一等分輪流擔任測試資料,來測試其餘K-1等分的學習結果,因此將會有K個模型及K個正確率,再將這K個正確率進行平均,即為全樣本模型的預測正確率的估計值,然後利用此正確率估計值來挑選較佳分類器。有研究發現這樣的挑選程序,有可能會挑選到K等分交叉驗證法的正確率較佳之分類器,其所產生的全樣本模型的表現卻較差的錯選情形。本研究將採用三十個資料檔,進行重複實驗,計算出更可靠的錯選率數據,實證結果顯示,當K值上升時,錯選率的改變不大,而當面臨分類器選擇越多時,錯選率也會上升,而資料筆數越大的資料檔擁有較小的錯選率,混合型資料檔則有較大的錯選率。選用適用新模型時,本研究採用修改類別值方法或是從K個模型中找出最接近K等分交叉驗證法表現的模型來當作新模型驗證,結果建議可在不同的分類器組合下,選用不同方式的新模型來取代全樣本模型,將可以降低錯選率且維持正確率,以此新模型來取代全樣本模型來對新資料做預測與結果的解釋會更適合。

    The performance of a classification algorithm is generally evaluated by K-fold cross validation to find the one that has the highest accuracy. Then the model induced from all available data by the best classification algorithm, called full sample model, is used for prediction and interpretation. Since there are no extra data to evaluate the full sample model resulting from the best algorithm, its prediction accuracy can be less than the accuracy of the full sample model induced by the other classification algorithm, and this is called a selection error. The experimental results of some previous studies showed that the actual and the estimated selection error rates can be greatly different in several cases. This study repeatedly performs the experiment to stabilize the estimated selection error rates, and attempts to propose new models for reducing selection error rate without sacrificing the prediction accuracy. The classification algorithms considered in this study are decision tree, naïve Bayesian classifier, logistic regression, and support vector machine. This study investigates the impact of the number of classification algorithms, the number of folds, and the characteristics of data sets on the selection error rate, and proposes three methods to generate new models for reducing the selection error rate. The experimental results on thirty data sets show that the selection error rate increases as the number of classification algorithms increases, while the number of folds will not affect the selection error rate. The new models proposed in this study can effectively reduce the selection error rate for interpreting learning results.

    第一章 緒論 1 1.1 研究背景 1 1.2 研究目的 2 1.3 研究架構 3 第二章 文獻探討 4 2.1 K等分交叉驗證法 4 2.1.1 偏誤與變異數 5 2.1.2 K等分交叉驗證法應用 6 2.2 錯選率 6 2.3 分類器 8 2.3.1 決策樹 8 2.3.2 簡易貝氏分類器 10 2.3.3 支撐向量機 11 2.3.4 邏輯斯迴歸 14 2.4 小結 15 第三章 研究方法 16 3.1 計算錯選率 16 3.2 推薦適用模型 19 3.2.1 新模型1 20 3.2.2 新模型2 20 3.2.2 新模型3 21 3.3 分類器 23 第四章 實證研究 25 4.1資料檔屬性 25 4.2實證錯選率 27 4.2.1多分類器下之錯選率 27 4.2.2不同K值下之錯選率 32 4.2.3 實證錯選率小結 33 4.3新模型錯選率 34 4.3.1 新模型1錯選率 34 4.3.2 新模型2錯選率 35 4.3.3 新模型3錯選率 36 4.3.4 新模型錯選率小結 38 第五章 結論與未來發展 40 5.1 結論 40 5.2 未來發展 41 參考文獻 42 附錄一 各資料檔兩個分類器組合錯選率 45 附錄二 各資料檔三、四個分類器組合錯選率 46 附錄三 新模型1各資料檔兩個分類器組合錯選率 47 附錄四 新模型1各資料檔三、四個分類器組合錯選率 48 附錄五 新模型2各資料檔兩個分類器組合錯選率 49 附錄六 新模型2各資料檔三、四個分類器組合錯選率 50 附錄七 新模型3各資料檔兩個分類器組合錯選率 51 附錄八 新模型3各資料檔三、四個分類器組合錯選率 52

    林巧盈. (2014)。探討K等分交叉驗證法對於分類器錯選率之研究。國立成功大學資訊管理研究所碩士論文。
    陳映伊. (2013)。探討K等分交叉驗證法與全資料模型間分類正確性與一致性之研究。國立成功大學資訊管理研究所碩士論文。
    Bache, K. and Lichman, M. (2013). UCI machine learning repository http://www.ics.uci.edu/~mlearn/MLRepository.html.
    Ballings, M., & Poel, D. V. D. (2012). Customer event history for churn prediction: How long is long enough? Expert Systems with Applications, 39(18), 13517-13522.
    Cawley, G. C., & Talbot, N. L. C. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11, 2079-2107.
    Chao, C.-M., Yu, Y.-W., Cheng, B.-W., & Kuo, Y.-L. (2014). Construction the model on the breast cancer survival analysis use support vector machine, logistic regression and decision tree. Journal of Medical Systems, 38(10), 1-7.
    Chattopadhyay, S., Davis, R. M., Menezes, D. D., Singh, G., Acharya, R. U., & Tamura, T. (2012). Application of bayesian classifier for the diagnosis of dental pain. Journal of Medical Systems, 36(3), 1425-1439.
    Chen, C., Wang, Y., Chang, Y., & Ricanek, K. (2012). Sensitivity analysis with cross-validation for feature selection and manifold learning Advances in Neural Networks–ISNN 2012 , 7367, 458-467.
    Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.
    De Lorimier, A., & El-Geneidy, A. M. (2013). Understanding the factors affecting vehicle usage and availability in carsharing networks: A case study of Communauto carsharing system from Montréal, Canada. International Journal of Sustainable Transportation, 7(1), 35-51.
    Granholm, V., Noble, W., & Käll, L. (2012). A cross-validation scheme for machine learning algorithms in shotgun proteomics. BMC Bioinformatics, 13(16), 1-8.
    Hwang, K.-S., Chen, Y.-J., Jiang, W.-C., & Yang, T.-W. (2012). Induced states in a decision tree constructed by Q-learning. Information Sciences, 213(0), 39-49.
    Krstajic, D., Buturovic, L. J., Leahy, D. E., & Thomas, S. (2014). Cross-validation pitfalls when selecting and assessing regression and classification models. Journal of cheminformatics, 6(1), 1-15.
    López, M., & Iglesias, G. (2013). Artificial Intelligence for estimating infragravity energy in a harbour. Ocean Engineering, 57(0), 56-63.
    Lim, G.-M., Bae, D.-M., & Kim, J.-H. (2014). Fault diagnosis of rotating machine by thermography method on support vector machine. Journal of Mechanical Science and Technology, 28(8), 2947-2952.
    Nahar, J., Imam, T., Tickle, K. S., & Chen, Y.-P. P. (2013). Computational intelligence for heart disease diagnosis: A medical knowledge driven approach. Expert Systems with Applications, 40(1), 96-104.
    Rodriguez, J. D., Perez, A., and Lozano, J. A. (2010). Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3), 569-575.
    Shao, C., Paynabar, K., Kim, T. H., Jin, J. J., Hu, S. J., Spicer, J. P., Wang H., and Abell, J. A. (2013). Feature selection for manufacturing process monitoring using cross-validation. Journal of Manufacturing Systems, 32(4), 550-555.
    Sun, J., & Li, H. (2012). Financial distress prediction using support vector machines: Ensemble vs. individual. Applied Soft Computing, 12(8), 2254-2265.
    Valle, M. A., Varas, S., & Ruz, G. A. (2012). Job performance prediction in a call center using a naive Bayes classifier. Expert Systems with Applications, 39(11), 9939-9945.

    下載圖示 校內:2020-07-30公開
    校外:2020-07-30公開
    QR CODE