| 研究生: | 劉冠良 Liu, Kuan-Liang | 
|---|---|
| 論文名稱: | 以叢集分析與距離測度為基礎之基因選取法 A Subset gene selection method based on clustering analysis and distance measure | 
| 指導教授: | 翁慈宗 Wong, Tzu-Tsung | 
| 學位類別: | 碩士 Master | 
| 系所名稱: | 管理學院 - 資訊管理研究所 Institute of Information Management | 
| 論文出版年: | 2007 | 
| 畢業學年度: | 95 | 
| 語文別: | 中文 | 
| 論文頁數: | 61 | 
| 中文關鍵詞: | 基因選取 、基因微陣列 、距離測度計量值 、癌症分類 、叢集分析 | 
| 外文關鍵詞: | gene selection, distance method, gene microarray, tumor classification, clustering | 
| 相關次數: | 點閱:79 下載:1 | 
| 分享至: | 
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 | 
人類基因體計畫完成後,生物學家的下一個課題便是了解數萬個基因所代表的意義與其相互關係。基因微陣列技術能將所有的基因表現資料儲存在一個微小的晶片中,相關研究人員因此能夠同時對所有的基因資料進行分析,然而相較於一般統計資料型態,基因微陣列資料龐大的維度與相對少量的樣本數量卻也造成了相關研究的瓶頸,因此如何針對特定問題篩選出代表性的基因集是本研究的主要目的。以往有許多學者針對不同面向提出相關的基因選取方法,然而不管是執行面或是最後所篩選出來的結果,仍舊存在著諸如基因共線性、缺乏組合基因考量或是整體運算複雜度的問題,本研究即針對上述問題來設計基因選取演算法,其中使用以密度為基礎之叢集分析將整體資料進行結構性的分配,並藉此將具有類似表現且個別基因排序值相對較低的基因過濾出來,再利用距離測度計量值 的衡量以及叢集相似度指標的輔助進行組合基因的挑選與替換,而考量基因資料特性,本研究採用關聯度作為基因間相似度的衡量。本研究將所提出的演算法使用在癌症分類的資料檔中,並於最終分類預測的效果表現上得到提升,經過基因替換而提高DM計量值的基因集也能夠得到較高的分類預測正確率,而最終所篩選出來的基因集不管在DM計量值或分類預測正確率的表現上均優於純粹使用個別基因排序法所選取的基因集,顯示DM計量值的確能夠輔助基因篩選,而在本研究考量下所設計的基因選取法也確實能夠得到更具代表性的基因集。
After the Human Genome Project, the next challenge for bio-researchers is to understand the meanings of genes and the inter-relationship between them. As the technique of gene expression microarray stores all the gene expression data in a tiny chip, researchers become able to analyze all expression data of genes simultaneously. Nevertheless, compared to the original statistic data, the huge dimensionality and comparatively few sample amounts of gene expression data are still research obstacles. The objective of this research is to screen a representative set of genes according to a specific problem. Although many gene selection methods have been proposed in recent years, problems, such as gene collinearity, lack of consideration for combination genes, and work complexity, are not thoroughly examined and worked out. The gene selection algorithm of this research is tailored to the problems mentioned above. We first distribute the whole data set of genes using density-based clustering technique and screen out genes that are similar and have comparatively lower individual gene rank values. Then we select and substitute combinative genes according to examination of distance measure value, , and cluster similarity index. Considering the characteristic of gene expression data, we introduce relation-based methods and measure similarity between genes. Coupled with the data of tumor classification, the algorithm proposed in this research is tested and the accuracy rate of classification was improved. The gene set of enhanced can really get a higher accuracy rate of classification. In addition, the accuracy rates of gene sets from our selection algorithm are better than the gene sets from individual gene ranking methods.
中文
陳健尉 (2000),基因微陣列之簡介及其應用:二十一世紀基因分析的利器,生物醫學報導,第二期。
陳連進 (2002),以關聯度為基礎的基因表現叢集驗證之方法,國立成功大學資訊工程研究所碩士論文。
周正中 (2005),基因微陣列數據分析簡介,台灣醫學,第9卷第5期,622-627。
張雅芳、黃正仲 (2004),微陣列生物科技,科學發展,第381期,34-41。
鄭凱峰 (2004),小樣本高維度資料中二階段分類法之效能評估-以基因微陣列資料癌症分類為例,國立成功大學工業與資訊管理學系碩士班碩士論文。
許景涵 (2005),以基因微陣列資料探討基因選取方法對分類正確率之影響,國立成功大學工業與資訊管理學系碩士班碩士論文。
程中慧 (2006),無歸納偏置影響因素的基因選取之研究,國立成功大學資訊管理研究所碩士班碩士論文。
英文
Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M., and Yakhini, Z. (2000). Tissue classification with gene expression profiles, Proceedings of the fourth annual international Conference on Computational molecular biology , 54-64.
Breiman, L. (1996). Bagging predictors, Machine Learning, 24, 123-140.
Daszykowski, M., Walczak, B., and Massart, D. L. (2001). Looking for natural patterns in data part 1. density-based approach, Chemometrics and Intelligent Laboratory Systems, 56(2), 83-92.
Davies, D. I. and Bouldin, D. W. (1979). A cluster seperation measure, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2), 224-227.
DeRisi, J. L., Iyer, V. R., and Brown, P. O. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale, Science, 278, 680-686.
Ding, C. and Peng, H. (2003). Minimum redundancy feature selection from microarray gene expression data, Proceedings of the Computational Systems Bioinformatics Conference, 523-529.
Dougherty, E. R. (2001). Small sample issue for microarray-based classification, Comparative and Functional Genomics, 2, 28-34.
Dudoit, S., Fridlyand, J., and Speed, T. (2002). Comparison of discrimination methods for the classification of tumor using gene expression data, Journal of the American Statistical Association, 97, 77-87.
Eisen, M. B., Spellman, P. T., Brown P.O., and Botstein, D. (1998) Cluster analysis and display of genome wide expression patterns, Proceedings of the National Academy of Science of the United States of America, 95, 14863-14868.
Ester, M., Kriegel, H. P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial database with noise. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, 226-231.
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286, 531-537.
Guha, S., Rastogi, R., and Shim, K. (1998). CURE: an efficient clustering algorithm for large databases, International Conference on Management of Data, 73-84.
Guha, S., Rastogi, R., and Shim, K. (2000). Rock: a robust clustering for categorical attributes, Proceedings of the 15th International Conference on Data Engineering, 512-521.
Guyon, I., Weston, J., and Barnhill, S. (2002). Gene selection for cancer classification using support vector machines, Machine Learning, 46, 389-422.
Hanczar, B., Courtine, M., Benis, A., Hennegar, C., Clement, K., and Zucker, J. D. (2003). Improving classification of microarray data using prototype-based feature selection, ACM SIGKDD Explorations Newsletter, 5, 23-30.
Huetra, E. B., Duval, B., and Hao, J. K. (2006). A hybrid GA/SVM approach for gene selection and classification of microarray data, Lecture Notes in Computer Science, 3907, 34-44.
Jaeger, J., Sengupta, R., and Rzzo, W. L. (2003). Improved gene selection for classificationof microarrays, Pacific Symposium on Biocomputing, 53-64. 
Jain, A. K., Dube, R. C. (1988). Algorithms for clustering data, Englewood Cliffs: Prentice Hall.
Jiang, D., Tang, C., and Zhang, A. (2004). Cluster analysis for gene expression data: a survey, IEEE Transactions on Knowledge and Data Engineering, 16(11), 1370-1386.
Jörnsten, R. and Yu, B. (2003). Simultaneous gene clustering and subset selection for sample classification via MDL, Bioinformatics, 19, 1100-1109.
Karypis, G., Han, E. H., and Kumar, V. (1999). CHAMELEON: a hierarchical clustering algorithm using dynamic modeling, IEEE Computer, 32(8), 68-75. 
Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1-2), 273-324.
Lee, K. E., Sha, N., Dougherty, E. R., Vannucci, M., and Mallick, B. K. (2003). Gene selection: a Baysian variable selection approach, Bioinformatics, 19, 90-97.
Li, L., Weinberg, R. C., Darden, T. A., and Pedersen, L. G. (2001). Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method, Bioinformatics, 17, 1131-1142.
Li, J., Zhang, C., and Ogihara, M. (2004). A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression, Bioinformatics, 20(5), 2429-2437.
Liu, H., Li, J., and Wong, L. (2002). A comparative study of feature selection and multiclass classification methods using gene expression profilesand proteomic patterns, Genome Informatics, 13, 51-60.
Liu, B., Wan, C., and Wang, L. (2006). An efficient semi-unsupervised gene selection method via spectral biclustering, IEEE Transactions on Nanobioseience, 5(2), 110-114.
Lu, Y. and Han, J. (2003). Cancer classification using gene expression data, Information Systems, 28, 243-268.
Nguyen, D. V. and Rocke, D. M. (2002). Tumor classification by partial least squares using microarray gene expression data, Bioinformatics, 18, 39-50.
Park, P., Pagano, M., and Bonetti, M. (2001). A nonparametric scoring algorithm for identifying informative genes from microarray data, Proceedings of the Pacific Symposium on Biocomputing, 6, 52-63.
Qian, W. N. and Zhou, A. Y. (2002). Analyzing popular clustering algorithms from different viewpoints, Journal of Software, 13(8), 1382-1394.
Su, Y., Murali, T., Pavlovic, V., Schaffer, M., and Kasif, S. (2003). RankGene: identification of diagnostics genes based on expression data, Bioinformatics, 19, 1578-1579.
Wang, Y., Makedon, F., Ford, J., and Pearlman, J. (2005). Hykgene: a hybrid approach for selecting genes for phenotype classification using microarray gene expression data, Bioinformatics, 21(8), 1530-1537.
Xing, E. P., Jordan, M. I., and Karp, R. M. (2001). Feature selection for high-dimensional genomic microarray data. Proceedings of the Eighteenth International Conference on Machine Learning, 601-608.
Xiong, M., Fang, Z., and Zhao, J. (2003). Biomarker identification by feature wrappers, Genome Research, 11, 1878-1887.
Wong, T. T. and Hsu, C. H. (2006). Two-stage classification methods for microarray data, accepted by Expert Systems with Applications.
Yu, L. and Liu, H. (2004). Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 5, 1205-1224.
Yu, L. and Liu, H. (2004). Redundancy based feature selection for microarray data. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 737-742.