簡易檢索 / 詳目顯示

研究生: 張哲仁
Chang, Che-Jen
論文名稱: 應用可自定群數的非監督式學習法於基因選取
A clustering method with a pre-specified number of clusters for gene selection
指導教授: 翁慈宗
Wong, Tzu-Tsung
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2008
畢業學年度: 96
語文別: 中文
論文頁數: 55
中文關鍵詞: 基因微陣列群集分析基因選取
外文關鍵詞: microarray data, Clustering analysis, gene selection
相關次數: 點閱:83下載:11
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 科學家利用基因微陣列來研究特定基因在不同細胞與組織中的表現差異量,期待了解疾病如癌症等的成因。然而,基因微陣列卻存在著高維度和小樣本的特性,所以在分類時,常會有過度配適、分類過程包含過多不相關基因…等的問題,而為了改善這些情況,基因選取便扮演了一個重要的角色。雖然近年來,已有許多的基因選取法被相繼提出,但是大多為個別基因選取法,存在基因共線性、缺乏組合基因考量…等問題。為了解決這個問題,劉冠良 (2007) 提出了以個別基因排序法結合群集演算法,進行相關基因選取和多餘基因剔除的方法,此方法雖然可以進行三個以上的組合基因替換,但是在演算法進行時,卻苦於無法精準設定使用者所需群數的這個參數,因而造成處理難度的增加,進而影響到了速度和穩定度的改善。本研究便以此為出發點,提出一個有效率的分群演算法,除了可以提升效能,也可以擁有穩定的分群結果,而這個穩定的分群結果也適合應用在基因微陣列上,作基因挑選和替換的動作。

    Scientists use gene microarray (DNA microarray) data to study the differences among cells and tissues. They usually expect to find the origin causes of a specific disease. Due to the special characteristics of microarray data, high dimension and small number of samples, overfitting will generally occur, or noisy data may exist. Thus, gene selection plays an important role to solve the above problems. Though many gene selection methods have been proposed in these years, most of them are individual gene selection methods which may encounter the problems of gene collinearity and lack of consideration for combination genes. In order to solve those problems, an integration approach of individual gene ranking methods and clustering methods has been proposed to select a proper gene subset for classification. However, in such an integration approach, since the number of clusters cannot be a pre-specified parameter for clustering, its computational efficiency and stability are problematic. This research proposes an effective clustering method. After testing the clustering method on five microarray data sets, the experimental results show that it cannot only enhance the computational efficiency, but also generate stable clustering results. In addition, after applying our clustering method for gene selection, the resulting accuracies are close to the accuracies resulting from the original integration approach.

    摘要 ………………………………………………………………………………… I Abstract ……………………………………………………………………………… II 誌謝 ……………………………………………………………………………… III 目錄 ……………………………………………………………………………… IV 圖目錄 …………………………………………………………………………… VI 表目錄 …………………………………………………………………………… VII 第一章 緒論 …………………………………………………………………… 1 1.1 研究動機 ……………………………………………………………… 1 1.2 研究目的 ……………………………………………………………… 2 1.3 研究架構 ……………………………………………………………… 3 第二章 文獻探討 ……………………………………………………………… 4 2.1 群集演算法 …………………………………………………………… 4 2.1.1 相似度的計算方式 …………………………………………… 5 2.1.2 群集法的類型 ………………………………………………… 6 2.1.3 K-means 演算法……………………………………………… 10 2.2 特徵選取法 …………………………………………………………… 11 2.2.1 特徵選取的基本步驟 ………………………………………… 12 2.2.2 特徵選取法的類型 …………………………………………… 12 2.2.3 應用在非監督式學習法的特徵選取 ………………………… 13 第三章 研究方法設計 ……………………………………………………… 16 3.1 Liu 演算法 ………………………………………………………… 16 3.2 資料檔收集與資料的前置處理 …………………………………… 17 3.3 演算法的基本架構 ………………………………………………… 19 3.3.1 決定演算法的起始點 ………………………………………… 19 3.3.2 完整的起始點選取演算法 …………………………………… 21 3.4 演算法的參數設定 ………………………………………………… 25 3.5 演算法的評估方法 …………………………………………………… 26 3.5.1 評估指標 ……………………………………………………… 26 3.5.2 與 MBK 演算法比較的對象 ……………………………… 29 3.5.2.1 DBSCAN ………………………………………………… 29 3.5.2.2 CCIA 結合 K-means …………………………………… 30 3.5.3 與 MBK 演算法比較的項目 ……………………………… 31 3.6 小結 ………………………………………………………………… 31 第四章 實證研究 …………………………………………………………… 32 4.1 演算法的執行過程與執行環境 ……………………………………… 32 4.2 檢驗參數設定之合宜性 …………………………………………… 33 4.3 實證結果 …………………………………………………………… 35 4.3.1 演算法執行時間比較 ………………………………………… 35 4.3.2 群集分析結果應用在二階法分類法的比較 ………………… 38 4.3.2.1 代表性基因的比較 ………………………………… 38 4.3.2.2 結合分類器後的正確率比較 ……………………… 39 4.3.3 與 CCIA 的比較 ……………………………………………… 42 4.4 MBK 演算法設定不同群數的比較…………………………………… 43 4.5 小結 …………………………………………………………………… 46 第五章 結論與建議 …………………………………………………………… 47 5.1 結論 …………………………………………………………………… 47 5.2 研究方向與建議 ……………………………………………………… 48 參考文獻 ………………………………………………………………………… 50

    中文
    劉冠良 (2007),以叢集分析與距離測度為基礎之基因選取法,國立成功大學資訊管理研究所碩士班碩士論文。

    英文
    Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp.94-105, Seattle, Washington.
    Ankerst, M., Breunig, M. M., Kriegel, H. P., and Sander, J. (1999). OPTICS: ordering points to identify the clustering structure, ACM SIGMOD Record, Vol.28, No.2, pp.49-60.
    Dash, M., Choi, K., Scheuermann, P., and Liu, H. (2002). Feature Selection for Clustering - A Filter Solution, Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM'02), pp.115, Maebashi City, Japan.
    Dash, M. and Liu, H. (1997). Feature selection methods for classifications, Intelligent Data Analysis - An International Journal, Vol.13, pp.131-156.
    Davies, D. L. and Bouldin D.W. (1979). A Cluster Separation Measure, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.1, No.2, pp.224-227.
    Devaney, M. and Ram, A. (1997). Efficient Feature Selection in Conceptual Clustering, Proceedings of the Fourteenth International Conference on Machine Learning, pp.92-97, Nashville, Tennessee, USA.
    Dy, J. G. and Brodley, C. E. (2000). Feature Subset Selection and Order Identification for Unsupervised Learning, Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, CA, USA.
    Dy, J. G. and Brodley, C. E. (2004). Feature Selection for Unsupervised Learning, The Journal of Machine Learning Research, Vol.5, pp.845-889.
    Ester, M., Kriegel, H. P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial database with noise. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, pp.226-231, Portland, Oregon.
    Estivill-Castro, V. and Lee, I. (2000). AUTOCLUST: Automatic Clustering via Boundary Extraction for Massive Point Data Sets. Proceedings of the 5th International Conference on Geocomputation, pp.23-25, University of Greenwich, Kent, UK.
    Guha, S., Rastogi, R., and Shim, K. (1998). CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp.73-84, Seattle, Washington, USA.
    Guha, S., Rastogi, R., and Shim, K. (1999). Rock: a robust clustering for categorical attributes. Proceedings of the 15th International Conference on Data Engineering, pp.512-521, Sydney, Australia.
    Hinneburg, A. and Keim, D. A. (1998). An efficient approach to clustering in large multimedia databases with noise. Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pp.58-65, New York, USA.
    Jaeger, J., Sengupta, R., and Rzzo, W. L. (2003). Improved gene selection for classificationof microarrays, Pacific Symposium on Biocomputing, pp.53-64.
    Jain, A. K. and Dube, R. C. (1988). Algorithms for clustering data, Englewood Cliffs: Prentice Hall.
    Jain, A. K., Murty, M. N., and Flynn, P. J. (1999) Data clustering: a review, ACM Computing Surveys (CSUR), Vol.31, No.3, pp.264-323.
    Jiang, D., Tang, C., and Zhang, A. (2004). Cluster Analysis for Gene Expression Data: A Survey, IEEE Transactions on Knowledge and Data Engineering, Vol.16, No.11, pp.1370-1386.
    Karypis, G., Han, E. H., and Kumar, V. (1999). CHAMELEON: a hierarchical clustering algorithm using dynamic modeling, IEEE Computer, Vol.32, No.8, pp.68-75.
    Kaufman, L. and Rousseeuw, P. J. (1987). Clustering by means of Medoids, in Statistical Data Analysis Based on the L1-Norm and Related Methods, pp.405-416, North-Holland Publishing Company, Elsevier, Amsterdam.
    Kaufman, L. and Rousseeuw, P. J. (1990). Finding groups in data: an Introduction to cluster analysis, pp.342, John Wiley & Sons, New York.
    Khan, S. S. and Ahmad, A. (2004). Cluster center initialization algorithm for K-means clustering, Pattern Recognition Letters, Vol.25, pp.1293-1302.
    Liu, H. and Yu, L. (2005). Toward Integrating Feature Selection Algorithms for Classification and Clustering, IEEE Transactions on Knowledge and Data Engineering, Vol.17, No.4, pp.491-502.
    MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp.281-297.
    Mitra, P., Murthy, C. A., and Pal, S. K. (2002). Unsupervised Feature Selection Using Feature Similarity, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.24, No.3, pp.301-312.
    Qian, W. N. and Zhou, A. Y. (2002). Analyzing popular clustering algorithms from different viewpoints, Journal of Software, Vol.13, No.8, pp.1382-1394.
    Nguyen, D. V. and Rocke, D. M. (2002). Tumor classification by partial least squares using microarray gene expression data, Bioinformatics, Vol.18, pp.39-50.
    Sheikholeslami, G., Chatterjee, S., and Zhang A (1998). Wave cluster: a multiresolution clustering approach for very large spatial databases. Proceedings of the 24th VLDB Conference on Very Large Data Bases, pp.428-439, New York, USA.
    Wang, W., Yang, J., and Muntz R. (1997). STING: A Statistical Information Grid Approach to Spatial Data Mining. Proceedings of the 23rd VLDB Conference on Very Large Data Bases, pp.186-195, Athens, Greece.
    Wismath, S.K., Soong, H.P., and Aki, S.G. (1981). Feature selection by interactive clustering, Pattern Recognition, Vol. 14, pp.75–80.
    Xing, E. P., Jordan, M. I., and Karp, R. M. (2001). Feature selection for high-dimensional genomic microarray data, Proceedings of the Eighteenth International Conference on Machine Learning, pp.601-608, Williamstown, MA, USA.
    Xing, E. (2002). Feature Selection in Microarray Analysis, chapter 6, pp.110–131. Kluwer Academic Publishers.
    Xu, X., Ester, M., Kriegel, H. P., and Sander J. (1998). A Nonparametric Clustering Algorithm for Knowledge Discovery in Large Spatial Databases, Proceedings of the Fourteenth International Conference on Data Engineering, Orlando, Florida, USA.
    Zhang, T., Ramakrishnan, R. and Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pp.103-114, Montreal, Quebec, Canada.

    下載圖示 校內:2009-07-04公開
    校外:2009-07-04公開
    QR CODE