| 研究生: |
王國河 Wang, Kuo-Ho |
|---|---|
| 論文名稱: |
整合叢集與迴歸技術以處理大型資料庫遺失值問題之新方法 A New Method for Handling Missing Values in Large Databases by Integrating Clustering and Regression Techniques |
| 指導教授: |
曾新穆
Tseng, Shin-Mu |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2002 |
| 畢業學年度: | 90 |
| 語文別: | 中文 |
| 論文頁數: | 72 |
| 中文關鍵詞: | 資料探勘 、遺失值 、叢集分析 、迴歸分析 、資料清理 |
| 外文關鍵詞: | data cleaning, regression analysis, clustering analysis, missing value, data mining |
| 相關次數: | 點閱:150 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
資料探勘為(Data Mining)目前非常熱門的研究領域,主要在研究如何從龐大料庫中萃取出有用的知識。但如果資料庫中含有遺失值(Missing Values)存在時將嚴重影響到資料探的勘分析品質,所以如何妥善處理遺失值問題是相當重要的議題。雖然已經有很多處理遺失值的方法被提出,但沒有一種方法可以完美處理各種不同類型的遺失值,因為不同型態的資料集可能需要有不同的資源。本研究針對有叢集特性的資料集時,提出一種新的遺失值處理方法,在這種特性之下,透過叢集分析(clustering analysis)與迴歸分析(regression analysis)的整合可以適當回復遺失值。根據實驗結果顯示,本研究方法確實在不同類型的資料集下回復遺失值都較之前的研究方法來得優良。
Data mining has become a very popular research area recently. It is the process of extracting desirable knowledge from existing databases for specific purposes. However, the quality of the data mining results will be affected substantially if there exist missing values in the database. Therefore, how to handle missing values effectively is an important topic. Although a number of methods for analyzing missing values have been proposed, none of them can handle different types of missing values well at the same time since different types of datasets might need different resolutions. In this thesis, we propose a new approach to handle missing values for datasets that have clustering characteristic. The proposed approach integrates the techniques of clustering and regression analysis such that the missing values can be recovered suitably if there exist some kinds of cluster properties in the dataset. Through empirical evaluation, the proposed approach was shown to perform better than other methods in recovering the missing values under various types of datasets.
1. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.,"Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications", Proc. Of the ACM SIGMOD Conference on Management of Data, Seattle, Washington, June 1998.
2. Aldenderfer, M.S.,and Blashfield, R.K.,"Cluster Analysis", Sage Publications, Inc., 1984.
3. Allison, P.D.,"Missing data", Thousand Oaks, Cali,Sage Publications, 2002.
4. Ben-Dor, A. and Yakhini, Z.,"Clustering gene expression patterns", Proceedings of the 3rd Annual International Conference on Computational Molecular BiologyRECOMB , 1999.
5. Bramer, M.A., Liu, W.Z.,White, A.P., Thompson, S.G.,"Techniques for Dealing with Missing Values in Classification", IDA 527-536,1997.
6. Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J.,"Classification and Regression Trees", Wadsworth and Brooks, Pacific Grove CA, 1984.
7. Ching-Pin, K., Shin-Mu, T.,"Efficient Clustering Methods for Gene Expression Mining:A performance Evaluation", Sixth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2002.
8. Hyafil, L. and Rivest, R.,"Constructing optimal binary decision trees is NP-complete," Information Processing Letters, 15-17, 1976.
9. Jain, A.K. and Dubes, R.C.,"Algorithms for Clustering Data", Prentice Hall, 1988.
10. Jiawei, H. and Micheline, K.,"Data Mining: Concepts and Techniques", Morgan Kaufmann Publishers, 2000.
11. Kalton, G., and Kasprzyk, D.,"Imputing for missing survey response", Proc. Sect. Survey Res. Meth., Amer. Statist. Assoc., 22-23, 1982.
12. Kaufman, L. and Rousseeuw, P.J.,"Finding groups in data: an Introduction to cluster analysis", John Wiley and Sons, 1990.
13. Kononenko, I., Bratko, I. and Roskar, E.,"Experiments in automatic learning of medical diagnostic rules", Technical Report. Jozef Stefan Institute, Ljubjana,Yugoslavia,1984.
14. Lien-Chin, C., "A Correlation-Based Approach for Validating Gene Expression Clustering", Department of Computer Science and Information Engineering National Cheng Kung University, 2002.
15. Little, R.J.A. and Rubin ,D.B."Statistical Analysis with Missing Data", New York, John Wiley and Sons, 1987.
16. MartinEster, H.P.K., Sander, J. and Xiaowei, X.,"A density-based algorithm for discovering clusters in large spatial databases with noise", Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pages 226-231, Portland, Orgon, 1996.
17. McQueen, J.B.,"Some Methods of Classification and Analysis of Multivariate Observations", Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281-297, 1967.
18. Ng, R.T. and Jiawei, H.,"Efficient and effective clustering methods for spatial data mining", Proceedings of the 20th VLDB Conference, pages 144-155, Santiago, Chile, 1994.
19. Ng, V., Lee, J.,"Quantitative association rules over incomplete data", Systems, Man, and Cybernetics, IEEE International Conference on , Volume: 3 ,1998.
20. Numao, M., Lobo, O.O.,"Ordered Estimation of Missing Values", PAKDD 499-503,1999.
21. Pedreira, C.E., Parente, E.,"Neural Networks with Missing Values Attributes", Proceedings., IEEE International Conference on , Volume: 6 ,1995.
22. Plye, D.,"Data PreParation for Data Mining", Morgan Kaufmann Publishers, 1999.
23. Quinlan, J.R.,"C4.5: Programs for machine learning", Morgan Kaufmann, San Mateo, CA,1993.
24. Quinlan, J.R.,"Induction of decision trees", Machine Learning 1, 1986.
25. Ragel, A., and Cremilleux, B,."Treatment of Missing Values for Association Rules", PAKDD 258-270, 1998.
26. Ragel, A.,"Preprocessing of Missing Values Using Robust Association Rules", PKDD 414-422, 1998.
27. Ragle, A., and Cremilleux, B.,"MVC a preprocessing method to deal with missing values", Knowledge-Base System, vol. 12, Issue:5-6. pp. 205-332, October 1999.
28. Richard, C.T.L., James, R.S., and Mong, C.T.,"Application of Clustering to Estimate Missing Data and Improve Data Integrity", ICSE, 1976.
29. Rubin, D.B.,"Multiple imputation for nonresponse in surveys", New York, Wiley, 1987.
30. Schafer, J.L.,"Analysis of Incomplete Multivariate Data", NewYork, Chap and Hall, 1997.
31. Sudipto, G., Rajeev R., and Kyuseok S.,"CURE: An efficient clustering algorithm for large databases", Proceedings of ACM-SIGMOD International Conference on Management of Data, pages 73-84, New York, 1998.
32. Sudipto, G., Rajeev R., and Kyuseok S.,"ROCK: a robust clustering algorithm for categorical attributes", Proceedings of the 15th International Conference on Data Eng., 1999.
33. Wei, W., Jiong, Y., and Richard, M.,"STING: a statistical information grid approach to spatial data mining", Proc. 23rd Int. Conf. On Very Large Data Bases (VLDB), 186-195, 1997.
34. Zhang, T., Ramakrishnan, R., and Livny, M.,"BIRCH: A new data clustering algorithm and its applications", Data Mining and Knowledge Discovery, 1(2):141¡X182, 1997.
35. Zhang, T., Ramakrishnan, R., and Livny, M.,"BIRCH: An Efficient Data Clustering Method for Very Large Databases", Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 103-114, Montreal, Canada, 1996.
36. 黃珮菁,"含遺失值之列聯表最大概似估計量及模式的探討", 國立政治大學 統計學系,1999.
37. 趙士儀,"以主成份分析法處理定量資料缺失值問題", 元智大學 資訊管理研究所, 1999.
38. 曹志弘,"遺漏值插補方法的比較", 國立中央大學 統計研究所, 1998.
39. 林清山,"多變項分析統計法 社會及行為科學研究適用", 東華社會科學叢書, 1981.
40. 林真真 ,鄒幼涵,"迴歸分析", 華泰書局, 1990.
41. 陳信木,林僅塋,"調查資料之遺漏值的處理-以熱卡插補法為例", 社會調查研究 第三期.