| 研究生: |
陳連進 Chen, Lien-Chin |
|---|---|
| 論文名稱: |
以關聯度為基礎的基因表現叢集驗證之方法 A Correlation-Based Approach for Validating Gene Expression Clustering |
| 指導教授: |
曾新穆
Tseng, Shin-Mu |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2002 |
| 畢業學年度: | 90 |
| 語文別: | 中文 |
| 論文頁數: | 64 |
| 中文關鍵詞: | 關聯度基礎相似度 、基因維陣列 、基因表現分析 、叢集驗證 、叢集 |
| 外文關鍵詞: | correlation-based similarity, microarray, clustering validation, gene expression analysis, clustering |
| 相關次數: | 點閱:116 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究是探討各種適用於基因表現分析之關聯度基礎的叢集驗證法。在生物學的分析上,通常都是先使用叢集演算法,依表現變化程度將基因分群,接著再利用叢集驗證方法對叢集結果進行評估。然而,現存的叢集分析所使用的相似度量測方法,大多數是屬於距離基礎類型。但實際上來說,生物學家所希望的是同一叢聚中的基因具有相似的表現趨勢,而非相同的表現值,這即是我們使用關聯度基礎叢集與驗證指標的研究動機。
在本論文中,我們提出一套自動化叢集分析驗證系統,利用此系統可以導引使用者在進行叢集分析的過程中選擇合適的驗證指標。我們發展了一套容積雲狀叢集資料產生器來模擬各種型態的資料,此外也對數種關聯度基礎的驗證指標進行其對叢集結果驗證品質的測試。因此,本系統可以針對使用者所提供之不同類型的資料集,有效地建議其最佳的驗證指標方法。
This research explores various correlation-based clustering validation methods that are suitable for the gene expression analysis. In biological analysis, the clustering algorithms are often used first to partition the genes into groups exhibiting similar patterns of variation in expression level, then the clustering validation methods are applied to evaluate the validity of the clustering results. However, most of similarity measurements used in existing clustering analysis belong to the distance-based category. In fact, a biologist aims to cluster together genes that have similar expression tendency instead of same expression values. This motivates the use of correlation-based clustering and validation indices in this study.
In this thesis, an automatic clustering validation system was presented to guide the user to choose the suitable validation index in cluster analysis. We developed a volumetric-clouds type clusters generator to synthesize various datasets, and a number of correlation-based validation indices were evaluated for measuring the quality of clustering results. Hence, the system can suggest the best validation index for different types of datasets given by users effectively.
[1] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications.” Proc. Of the ACM SIGMOD Int’l Conference on Management of Data, Seattle, Washington, June 1998.
[2] Mark S. Aldenderfer, Roger K. Blashfield, “Cluster Analysis.” Sage Publications, Inc., 1984.
[3] M. Ankerst, M. M. Breunig, H. P. Kriegel, and J. Sander, “OPTICS: ordering points to identify the clustering structure.” Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Philadephia, Pennsylvania, USA, pages 49—60, June 1999.
[4] Amir Ben-Dor, Zohar Yakhini, “Clustering gene expression patterns.” Proceedings of the 3rd Annual International Conference on Computational Molecular BiologyRECOMB ‘99, 1999.
[5] James C. Bezdek, Nikhil R. Pal. “Cluster Validation with Generalized Dunn’s Indices”. Proceedings of the 2nd New Zealand Two-Stream International Conference on Artificial Neural Networks and Expert Systems (ANNES), 1995.
[6] James C. Bezdek,Nikhil R. Pal. ”Some New Index of Cluster Validity”. IEEE TRANSACTION ON SYSTEMS, MAN, AND CYBERNETICS. PART B: CYBERNETICS, Vol.28, NO.3, June 1998.
[7] P. Cheeseman and J. Stutz, “Bayesian classification (AutoClass): Theory and results.” D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in Knowledge Discovery and Data Mining, pages 153—180, Cambridge, MA: AAAI/MIT Press, 1996.
[8] Ming-Syan Chen, Jiawei Han, and Philip S. Yu, “Data mining: An Overview from a Database Perspective.” IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No.6, December 1996.
[9] Hugues Roest Crollius, Olivier Jaillon, Alain Bernot, Corinne Dasilva, Laurence Bouneau, Cecile Fischer, Cecile Fizames, Patrick Wincker, Philippe Brottier, Francis Quetier, William Saurin and Jean Weissenbach, “Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence.” Nature Genetics 25, 235-238, Jun 2000.
[10] D.L. Davies and D.W. Bouldin. ”A cluster separation measure.” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.1, No2. ,1979.
[11] J. DeRisi, V. R. Iyer and P. O. Brown, “Exploring the metabolic and genetic control of gene expression on a genomic scale.” Science 278, 680-686, 1997.
[12] J. DeRisi, L. Penland, P. O. Brown, M. L. Bittner, P. S. Meltzer, M. Ray, Y. Chen, Y. A. Su and J. M. Trent, “Use of a cDNA microarray to analyze gene expression patterns in human cancer.” Nature Genetics 14: 457-460, 1996.
[13] J. C. Dunn,”Well separated clusters and optimal fuzzy partitions ”, J. Cybern. Vol.4,pp.95-104, 1974.
[14] M. B. Eisen, P. T. Spellman, P.O. BrownD. Botstein. "Cluster analysis 'and display of genome-wide expression patterns. " Proc. Natl Acad Sci U S A 95(25): 14863-8, 1998.
[15] Martin Ester, Hans-Peter Kriegel, Jorg Sander and Xiaowei Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise.” Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pages 226-231, Portland, Orgon, 1996.
[16] Brent Ewing and Phil Green, “Analysis of expressed sequence tags indicates 35,000 human genes.” Nature Genetics 25, 232-234, 2000.
[17] Doug Fisher, “Improving Inference through Conceptual Clustering.” Proceedings of 1987 AAAI Conferences, pages 461-465, Seattle, WA, July 1987.
[18] S. P. A. Fodor, J. L. Read, M. C. Pirrung, L Stryer, A. T. Lu and D. Solas, “Light-directed, spatially addressable parallel chemical synthesis.” Science 251, 767-773,1991.
[19] S. P. A. Fodor, R. P. Rava, X. C. Huang, A. C. Pease, C. P. Holmes, C. L. Adams, “Multiplexed biochemical assays with biological chips.” Nature 364, 555-556, 1993.
[20] S. P. A. Fodor, “Massively parallel genomics.” Science 277, 393–395, 1997.
[21] John H. Gennari, Pat Langley, and Doug Fisher, “Models of incremental concept formation.” Artificial Intelligence, Vol. 40, pages 11-61, 1989.
[22] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim, “CURE: An efficient clustering algorithm for large databases.” Proceedings of ACM-SIGMOD International Conference on Management of Data, pages 73-84, New York, 1998.
[23] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim, “ROCK: a robust clustering algorithm for categorical attributes.” Proceedings of the 15th International Conference on Data Eng., 1999.
[24] M. Halkidi,Y. Batistakis,M. Vazirgiannis. ”On Cluster Validation Technigues”.Intelligent Information System Journal, 2001.
[25] Jiawei Han and Micheline Kamber, “Data Mining: Concepts and Techniques.” Morgan Kaufmann, 2000.
[26] A. Hinneburg and D. A. Keim, “An Efficient Approach to Clustering in Multimedia Databases with Noise.” Proc. 4th Int. Conf. On Knowledge Discovery and Data Mining, New York,AAAI Press, 1998.
[27] Anil K. Jain and Richard C. Dubes, “Algorithms for Clustering Data.” Prentice Hall, 1988.
[28] Ching-Pin Kao, Shin-Mu Tseng.” Efficient Clustering Methods for Gene Expression Mining:A performance Evaluation.” Sixth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2002.
[29] G. Karypis, E. H. Han, and V. Kumar, “CHAMELEON: A hierarchical clustering algorithm using dynamic modeling,” Technical Report TR-99-120, Department of Computer Science, University of Minnesota, Minneapolis, 1999.
[30] L. Kaufman and P. J. Rousseeuw, “Finding groups in data: an Introduction to cluster analysis.” John Wiley & Sons, 1990.
[31] Teuvo Kohonen, “The self-organizing map.” Proceedings of the IEEE, Vol. 78, No 9, pages 1464—1480, September 1990.
[32] Harley H. McAdams and Lucy Shapiro, “Circuit Simulation of Genetic Networks”. Science 269, 650-656, 1995.
[33] J. B. McQueen, “Some Methods of Classification and Analysis of Multivariate Observations.” Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281-297, 1967.
[34] Schena M, Shalon D, Davis RW and Brown P.O., “Quantitative monitoring of gene expression patterns with a complementary DNA microarray.” Science 270: 467-470, 1995.
[35] Raymond T. Ng and Jiawei Han, “Efficient and effective clustering methods for spatial data mining.” Proceedings of the 20th VLDB Conference, pages 144-155, Santiago, Chile, 1994.
[36] M. Schena, D. Shalon, R. W. Davis and P. O. Brown, “Quantitative monitoring of gene expression patterns with a complementary DNA microarray.” Science 270: 467-470, 1995.
[37] Gholamhosein Sheikholeslami, Surojit Chatterjee, and Aidong Zhang, “WaveCluster: A multi-resolution clustering approach for very large spatial databases.” Proceedings of the 24 th Very Large Databases Conference (VLDB 98), pages 428—439, New York, Aug. 1998.
[38] S. Theodoridis, K. Koutroubas,” Pattern recognition”. Academic Press,1999.
[39] E. M. Voorhees,“Implementing agglomerative hierarchical clustering algorithms for use in document retrieval,” Information Processing & Management, 22:465-476, 1986.
[40] Wei Wang, Jiong Yang, and Richard Muntz, “STING: a statistical information grid approach to spatial data mining.” Proc. 23rd Int. Conf. On Very Large Data Bases (VLDB), 186-195, 1997.
[41] X. Wen, S. Fuhrman, G. S. Michaels, D. B. Carr, S. Smith, J. L. Barker, and R. Somogyi, “Large-scale temporal gene expression mapping of central nervous system development.” Proc. Of the Nat. Acadamy of Sciences of the USA, 95(1):334—339, 1998.
[42] Tian Zhang, Raghu Ramakrishnan, and Miron Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Databases,” Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 103—114, Montreal, Canada, 1996.
[43] Tian Zhang, Raghu Ramakrishnan, and Miron Livny, “BIRCH: A new data clustering algorithm and its applications.” Data Mining and Knowledge Discovery, 1(2):141—182, 1997.