| 研究生: |
涂宗裕 Tu, Tsung-Yu |
|---|---|
| 論文名稱: |
重抽樣技巧做資料分類 Classification of Data by Resampling Techniques |
| 指導教授: |
溫敏杰
Wen, Min-Jie |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 統計學系 Department of Statistics |
| 論文出版年: | 2011 |
| 畢業學年度: | 99 |
| 語文別: | 中文 |
| 論文頁數: | 40 |
| 中文關鍵詞: | 重複抽樣技巧 、資料採礦 、資料分類 |
| 外文關鍵詞: | Resampling Techniques, Data Mining, Classification of Data |
| 相關次數: | 點閱:182 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由於電腦科技的進步,更易收集與儲存龐大的資料。然而,資料分析與萃取,傳統技術以統計分析為代表,舉凡統計學內所含之敘述統計(Descriptive Statistics)、迴歸分析(Regression Analysis)、類別資料分析(Categorical Data Analysis),以及多變量分析(Multivariate Analysis)等。近年來,資料採礦方法的崛起,改良傳統統計分析方法,試圖分析高維度以及筆數龐大的資料。在本研究中,使用統計分析方法與資料採礦方法,配合不同類型的實際資料,進行分析與比較,再以重複抽樣的方法,進行資料切割與交叉驗證,並配合評估準則,深入探討資料採礦方法。因此,在本研究中,經由對資料重複抽樣的模擬結果,可以得知,雖然不同類型的資料,分類的情形不近相同;但羅吉斯迴歸分析(Logistic Regression Analysis),整體而言是比較好的分類方法。另外,對於高維度資料而言,重複抽樣是比較可行的分析方法。
As computer technology is fast growing, it's easier to save and gather huge data. However, analyzing and picking data take the statistical methods traditionally, for examples, Descriptive Statistics, Regression Analysis, Categorical Data Analysis, Multivariate Analysis, and so on. In recent years, the data mining method improves the traditional statistical methods and attempts to analyze high-dimensional and enormous data. In this study, we employ statistical methods and data mining methods with different types of actual data to analyze and compare, and then resampling data to segment and cross-validate. Finally, with assessing criteria, we investigate profoundly each data mining method by simulating results of resampling data. The study shows that the classifications of different types of data are not exactly the same. However, Logistic Regression Analysis is a better classification as a whole. In addition, the resampling method is a better analysis for the high dimensional data.
中文文獻
1.王派洲(2008),資料探勘概念與方法,滄海圖書出版社。
2.謝邦昌(2001),資料採礦入門及應用,資商訊息顧問股份有限公司。
3.謝邦昌(2005),資料採礦與商業智慧―SQL Server 2005,鼎茂圖書出版社。
英文文獻
1. Berson, A., Smith, S. & Thearling, K. (2000), “Building data mining applications for CRM,” McGraw-Hill.
2. Chen, S. N. and Wen, K. C. (2006), “An integrated system for cancer-related genes mining from biomedical literatures,” International Journal of Computer Science & Applications, Vol. 3, No. 1, 26-39.
3. Dobbin, K. K. and Simon, R. (2011), “Optimally splitting cases for training and testing high dimensional classifiers,” BMC Medical Genomics, 4(1):31. doi:10.1186/1755-8794-4-31.
4. Iiritano, S. and Ruffolo, M. (2001), “Managing the knowledge contained in electronic documents: a clustering method for text mining,” Proc. 12th International Workshop on Database and Expert System Applications, 454-458.
5. McLachlan, G. J. (1992), ”Discriminant analysis and statistical pattern recognition,” John Wiley & Sons, Inc., New York.
6. Molinaro A. M., Simon, R. and Pfeiffer, R. M. (2005), “Prediction error estimation: a comparison of resampling methods,” Bioinformatics, 21, 3301–3307.
7. Mukherjee, S., Tamayo, P., Rogers, S., Rifkin, R., Engle, A., Campbell, C., Golub, T. R. and Mesirov, J. P. (2003), “Estimating dataset size requirements for classifying DNA microarray data,” J Comput. Biol., 10, 119-42.
8. Quinlan, J. R. (1986), “Induction of decision trees,” Machine Learning 1, 81-106.
9. Riloff, E. and Lehnert, W. (1994), “Information extraction as a basis for high-precision text classification,” ACM Transactions on Information Systems, Vol. 12, No.3, 296-333.
10. Ritchie, M. D., White, B. C., Parker, J. S., Hahn, L. W. and Moore, J. H. (2003), “Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases,” BMC Bioinformatics, 4, 28-42.
11. Schaffer, C. (1993), “Selecting a classification method by cross-validation,” Machine Learning, 13(1), 135–143.
12. Stone, M. (1974), “Cross-validation choice and assessment of statistical predictions,” Journal of the Royal Statistical Society B, 36, 111-147.
13. Xu, Y., Mural, R. J., Einstein, J. R., Shah, M. B. and Uberbacher, E. C. (1996), “GRAIL: a multi-agent neural network system for gene identification,” Proceedings of the IEEE Volume: 84, Issue: 10, 1544-1552.