| 研究生: |
陳映伊 Chen, Ying-Yi |
|---|---|
| 論文名稱: |
探討K等分交叉驗證法與全資料模型間分類正確性與一致性之研究 A study for investigating classification accuracy and consistency between K-fold cross validation and complete-data model |
| 指導教授: |
翁慈宗
Weng, Tzu-Tsung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2013 |
| 畢業學年度: | 101 |
| 語文別: | 中文 |
| 論文頁數: | 51 |
| 中文關鍵詞: | K等分交叉驗證法 、不一致率 |
| 外文關鍵詞: | K-fold cross validation, inconsistent rate |
| 相關次數: | 點閱:81 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在資料探勘領域的分類問題中,研究通常會透過K等分交叉驗證法挑選出一個最佳分類器後,再將此分類器利用現有的資料(available data)學習成一個模型對新的資料(new data)做預測和解釋。K等分交叉驗證法(K-fold Cross Validation)的運作模式是隨機的將資料檔切成互斥的K個等分,讓它們輪流當其餘的( K-1)-等分的資料去做訓練、學習後的測試資料,再將得出的K個學習模型的K筆分類正確率去做平均,利用此值來預估使用現有資料學習出的模型分類正確率。然而,無法保證使用現有資料學習出來的模型挑選出表現最好的分類器,和使用K等分交叉驗證法挑選出來的分類器相同。所以本研究提出了實驗方法,驗證使用K等分交叉驗證法的平均正確率用來預估利用現有資料所訓練出來的模型正確率是否合理,和提出一個不一致率來衡量K等分交叉驗證法的K個模型,和利用現有資料學習出來的模型對於新資料預測的不相同的程度,當不一致率較小時,代表使用現有資料學習出來的模型用來預測、解釋是較合適的。本研究採用三十個資料檔,實驗結果顯示,K等分交叉驗證法所得之平均分類正確率和全資料模型正確率,在統計上是無顯著不同的,但是因為其兩值偏誤在一個百分點以上的機會大於六成,所以以此正確率來選擇分類器時,會有三成以上的機會會發生誤判的情形。最後,在不一致率驗證時,四個分類器中,決策樹所得出來的分類結果是和K等分交叉驗證法所預估出來的分類結果較不一致的,代表其在利用全資料模型對於新資料的解釋力是較差的。
In classification applications, analysts generally use K-fold cross validation to find the classifier that has the best performance. Then the classifier generates a learning model from all available data for prediction and interpretation. The K-fold cross validation randomly divides all available data into K folds, and every fold is in turn used for testing the model learned from the other K-1 folds. The average of the accuracies resulting from the K folds is an estimate of the prediction accuracy of the model learned from all available data. However, this procedure does not guarantee that the model induced from all available data by the best classifier evaluated by K-fold cross validation will have the highest prediction accuracy on new data with respect to the other classifiers.
This study first designs an experiment to investigate whether the mean accuracy resulting from K-fold cross validation is a good estimate for the prediction accuracy of the model learned from all available data. An inconsistent rate is then introduced to measure the prediction consistency between the model learned from all available data and the K models induced from K-fold cross validation. When the inconsistent rate is small, using the model learned from all available data for prediction and interpretation will be appropriate.
The experimental results on 30 data sets indicate that the average of the mean accuracy resulting from K-fold cross validation and the average of the prediction accuracy of the model induced from all available data on new data are generally not significantly different. However, since the probability of the difference between the mean accuracy resulting from K-fold cross validation and the prediction accuracy resulting from the model induced from all available data to be larger than one percent is approximately 0.60, the probability of choosing a classifier with a lower prediction accuracy on new data is generally larger than 0.3. The inconsistent rate shows that among the four classifiers adopted in this study, decision tree learning is the worst one to generate a model from all available data for prediction and interpretation.
參考文獻
Astudillo, C. A. and Oommen, B. J. (2013). On achieving semi-supervised pattern recognition by utilizing tree-based SOMs. Pattern Recognition, 46(1), 293-304.
Asunction, A. and Newman, D.J. (2007). UCI machine learning repository http://www.ics.uci.edu/~mlearn/MLRepository.html. Irvine, CA: University of California, School of Information and Computer Science.
Ballings, M. and Van den Poel, D. (2012). Customer event history for churn prediction: How long is long enough? Expert Systems with Applications, 39(18), 13517-13522.
Catal, C., Sevim, U., and Diri, B. (2011). Practical development of an Eclipse-based software fault prediction tool using Naive Bayes algorithm. Expert Systems with Applications, 38(3), 2347-2353.
Cawley, G. C. and Talbot, N. L. C. (2010). On over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research, 11, 2079-2107.
Chattopadhyay, S., Davis, R. M., Menezes, D. D., Singh, G., Acharya, R. U., and Tamura, T. (2012). Application of Bayesian Classifier for the Diagnosis of Dental Pain. Journal of Medical Systems, 36(3), 1425-1439.
Chaves, R., Ramírez, J., Górriz, J. M., and Puntonet, C. G. (2012). Association rule-based feature selection method for Alzheimer’s disease diagnosis. Expert Systems with Applications, 39(14), 11766-11774.
Cheng, H. Y. K., Cheng, C. Y., and Ju, Y. Y. (2013). Work-related musculoskeletal disorders and ergonomic risk factors in early intervention educators. Applied Ergonomics, 44(1), 134-141.
de Lorimier, A. and El-Geneidy, A. M. (2013). Understanding the Factors Affecting Vehicle Usage and Availability in Carsharing Networks: A Case Study of Communauto Carsharing System from Montreal, Canada. International Journal of Sustainable Transportation, 7(1), 35-51.
Fushiki, T. (2011). Estimation of prediction error by using K -fold cross-validation. Statistics and Computing, 21(2), 137-146.
Hwang, K. S., Chen, Y. J., Jiang, W. C., and Yang, T. W. (2012). Induced states in a decision tree constructed by Q-learning. Information Sciences, 213, 39-49.
Karami, G., Attaran, N., Hosseini, S. M. S., and Hossein, S. M. S. (2012). Bankruptcy Prediction, Accounting Variables and Economic Development: Empirical Evidence from Iran. International Business Research, 5(8), 147-152.
Karimaldini, F., Teang Shui, L., Ahmed Mohamed, T., Abdollahi, M., and Khalili, N. (2012). Daily Evapotranspiration Modeling from Limited Weather Data by Using Neuro-Fuzzy Computing Technique. Journal of Irrigation and Drainage Engineering, 138(1), 21-34.
Marcot, B. G. (2012). Metrics for evaluating performance and uncertainty of Bayesian network models. Ecological Modelling, 230(0), 50-62.
Nahar, J., Imam, T., Tickle, K. S., and Chen, Y. P. P. (2013). Computational intelligence for heart disease diagnosis: A medical knowledge driven approach. Expert Systems with Applications, 40(1), 96-104.
Pan, S., Iplikci, S., Warwick, K., and Aziz, T. Z. (2012). Parkinson’s Disease tremor classification – A comparison between Support Vector Machines and neural networks. Expert Systems with Applications, 39(12), 10764-10771.
Rodriguez, J. D., Perez, A., and Lozano, J. A. (2010). Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3), 569-575.
Subramanian, J. and Simon, R. (2011). An evaluation of resampling methods for assessment of survival risk prediction in high-dimensional settings. Statistics in Medicine, 30(6), 642-653.
Sun, J. and Li, H. (2012). Financial distress prediction using support vector machines: Ensemble vs. individual. Applied Soft Computing, 12(8), 2254-2265.
Turrado García, F., García Villalba, L. J., and Portela, J. (2012). Intelligent system for time series classification using support vector machines applied to supply-chain. Expert Systems with Applications, 39(12), 10590-10599.
Valle, M. A., Varas, S., and Ruz, G. A. (2012). Job performance prediction in a call center using a naive Bayes classifier. Expert Systems with Applications, 39(11), 9939-9945.