| 研究生: |
張哲仁 Chang, Che-Jen |
|---|---|
| 論文名稱: |
應用可自定群數的非監督式學習法於基因選取 A clustering method with a pre-specified number of clusters for gene selection |
| 指導教授: |
翁慈宗
Wong, Tzu-Tsung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2008 |
| 畢業學年度: | 96 |
| 語文別: | 中文 |
| 論文頁數: | 55 |
| 中文關鍵詞: | 基因微陣列 、群集分析 、基因選取 |
| 外文關鍵詞: | microarray data, Clustering analysis, gene selection |
| 相關次數: | 點閱:83 下載:11 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
科學家利用基因微陣列來研究特定基因在不同細胞與組織中的表現差異量,期待了解疾病如癌症等的成因。然而,基因微陣列卻存在著高維度和小樣本的特性,所以在分類時,常會有過度配適、分類過程包含過多不相關基因…等的問題,而為了改善這些情況,基因選取便扮演了一個重要的角色。雖然近年來,已有許多的基因選取法被相繼提出,但是大多為個別基因選取法,存在基因共線性、缺乏組合基因考量…等問題。為了解決這個問題,劉冠良 (2007) 提出了以個別基因排序法結合群集演算法,進行相關基因選取和多餘基因剔除的方法,此方法雖然可以進行三個以上的組合基因替換,但是在演算法進行時,卻苦於無法精準設定使用者所需群數的這個參數,因而造成處理難度的增加,進而影響到了速度和穩定度的改善。本研究便以此為出發點,提出一個有效率的分群演算法,除了可以提升效能,也可以擁有穩定的分群結果,而這個穩定的分群結果也適合應用在基因微陣列上,作基因挑選和替換的動作。
Scientists use gene microarray (DNA microarray) data to study the differences among cells and tissues. They usually expect to find the origin causes of a specific disease. Due to the special characteristics of microarray data, high dimension and small number of samples, overfitting will generally occur, or noisy data may exist. Thus, gene selection plays an important role to solve the above problems. Though many gene selection methods have been proposed in these years, most of them are individual gene selection methods which may encounter the problems of gene collinearity and lack of consideration for combination genes. In order to solve those problems, an integration approach of individual gene ranking methods and clustering methods has been proposed to select a proper gene subset for classification. However, in such an integration approach, since the number of clusters cannot be a pre-specified parameter for clustering, its computational efficiency and stability are problematic. This research proposes an effective clustering method. After testing the clustering method on five microarray data sets, the experimental results show that it cannot only enhance the computational efficiency, but also generate stable clustering results. In addition, after applying our clustering method for gene selection, the resulting accuracies are close to the accuracies resulting from the original integration approach.
中文
劉冠良 (2007),以叢集分析與距離測度為基礎之基因選取法,國立成功大學資訊管理研究所碩士班碩士論文。
英文
Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp.94-105, Seattle, Washington.
Ankerst, M., Breunig, M. M., Kriegel, H. P., and Sander, J. (1999). OPTICS: ordering points to identify the clustering structure, ACM SIGMOD Record, Vol.28, No.2, pp.49-60.
Dash, M., Choi, K., Scheuermann, P., and Liu, H. (2002). Feature Selection for Clustering - A Filter Solution, Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM'02), pp.115, Maebashi City, Japan.
Dash, M. and Liu, H. (1997). Feature selection methods for classifications, Intelligent Data Analysis - An International Journal, Vol.13, pp.131-156.
Davies, D. L. and Bouldin D.W. (1979). A Cluster Separation Measure, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.1, No.2, pp.224-227.
Devaney, M. and Ram, A. (1997). Efficient Feature Selection in Conceptual Clustering, Proceedings of the Fourteenth International Conference on Machine Learning, pp.92-97, Nashville, Tennessee, USA.
Dy, J. G. and Brodley, C. E. (2000). Feature Subset Selection and Order Identification for Unsupervised Learning, Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, CA, USA.
Dy, J. G. and Brodley, C. E. (2004). Feature Selection for Unsupervised Learning, The Journal of Machine Learning Research, Vol.5, pp.845-889.
Ester, M., Kriegel, H. P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial database with noise. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, pp.226-231, Portland, Oregon.
Estivill-Castro, V. and Lee, I. (2000). AUTOCLUST: Automatic Clustering via Boundary Extraction for Massive Point Data Sets. Proceedings of the 5th International Conference on Geocomputation, pp.23-25, University of Greenwich, Kent, UK.
Guha, S., Rastogi, R., and Shim, K. (1998). CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp.73-84, Seattle, Washington, USA.
Guha, S., Rastogi, R., and Shim, K. (1999). Rock: a robust clustering for categorical attributes. Proceedings of the 15th International Conference on Data Engineering, pp.512-521, Sydney, Australia.
Hinneburg, A. and Keim, D. A. (1998). An efficient approach to clustering in large multimedia databases with noise. Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pp.58-65, New York, USA.
Jaeger, J., Sengupta, R., and Rzzo, W. L. (2003). Improved gene selection for classificationof microarrays, Pacific Symposium on Biocomputing, pp.53-64.
Jain, A. K. and Dube, R. C. (1988). Algorithms for clustering data, Englewood Cliffs: Prentice Hall.
Jain, A. K., Murty, M. N., and Flynn, P. J. (1999) Data clustering: a review, ACM Computing Surveys (CSUR), Vol.31, No.3, pp.264-323.
Jiang, D., Tang, C., and Zhang, A. (2004). Cluster Analysis for Gene Expression Data: A Survey, IEEE Transactions on Knowledge and Data Engineering, Vol.16, No.11, pp.1370-1386.
Karypis, G., Han, E. H., and Kumar, V. (1999). CHAMELEON: a hierarchical clustering algorithm using dynamic modeling, IEEE Computer, Vol.32, No.8, pp.68-75.
Kaufman, L. and Rousseeuw, P. J. (1987). Clustering by means of Medoids, in Statistical Data Analysis Based on the L1-Norm and Related Methods, pp.405-416, North-Holland Publishing Company, Elsevier, Amsterdam.
Kaufman, L. and Rousseeuw, P. J. (1990). Finding groups in data: an Introduction to cluster analysis, pp.342, John Wiley & Sons, New York.
Khan, S. S. and Ahmad, A. (2004). Cluster center initialization algorithm for K-means clustering, Pattern Recognition Letters, Vol.25, pp.1293-1302.
Liu, H. and Yu, L. (2005). Toward Integrating Feature Selection Algorithms for Classification and Clustering, IEEE Transactions on Knowledge and Data Engineering, Vol.17, No.4, pp.491-502.
MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp.281-297.
Mitra, P., Murthy, C. A., and Pal, S. K. (2002). Unsupervised Feature Selection Using Feature Similarity, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.24, No.3, pp.301-312.
Qian, W. N. and Zhou, A. Y. (2002). Analyzing popular clustering algorithms from different viewpoints, Journal of Software, Vol.13, No.8, pp.1382-1394.
Nguyen, D. V. and Rocke, D. M. (2002). Tumor classification by partial least squares using microarray gene expression data, Bioinformatics, Vol.18, pp.39-50.
Sheikholeslami, G., Chatterjee, S., and Zhang A (1998). Wave cluster: a multiresolution clustering approach for very large spatial databases. Proceedings of the 24th VLDB Conference on Very Large Data Bases, pp.428-439, New York, USA.
Wang, W., Yang, J., and Muntz R. (1997). STING: A Statistical Information Grid Approach to Spatial Data Mining. Proceedings of the 23rd VLDB Conference on Very Large Data Bases, pp.186-195, Athens, Greece.
Wismath, S.K., Soong, H.P., and Aki, S.G. (1981). Feature selection by interactive clustering, Pattern Recognition, Vol. 14, pp.75–80.
Xing, E. P., Jordan, M. I., and Karp, R. M. (2001). Feature selection for high-dimensional genomic microarray data, Proceedings of the Eighteenth International Conference on Machine Learning, pp.601-608, Williamstown, MA, USA.
Xing, E. (2002). Feature Selection in Microarray Analysis, chapter 6, pp.110–131. Kluwer Academic Publishers.
Xu, X., Ester, M., Kriegel, H. P., and Sander J. (1998). A Nonparametric Clustering Algorithm for Knowledge Discovery in Large Spatial Databases, Proceedings of the Fourteenth International Conference on Data Engineering, Orlando, Florida, USA.
Zhang, T., Ramakrishnan, R. and Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pp.103-114, Montreal, Quebec, Canada.