簡易檢索 / 詳目顯示

研究生: 簡志吉
Jian, Jhih-Ji
論文名稱: 應用致病基因於貝氏模型平均法之基因選取
A gene selection method based on risk gene set using Bayesian model averaging
指導教授: 翁慈宗
Wong, Tzu-Tsung
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業與資訊管理學系
Department of Industrial and Information Management
論文出版年: 2009
畢業學年度: 97
語文別: 中文
論文頁數: 49
中文關鍵詞: 貝氏模型平均法基因微陣列資料基因選取致病基因
外文關鍵詞: Bayesian model averaging, gene selection, microarray data, risk gene
相關次數: 點閱:76下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 科學家藉由基因微陣列資料來瞭解數萬個基因所代表的意義與其相互關係,然而基因微陣列具有高維度、小樣本之特性,因此在分類時會造成許多困難。為了解決這些問題,目前已有學者相繼提出許多的基因選取法,但這些基因選取法大多未考慮到任何與疾病相關的基因可供基因選取時之參考。而近年來基因的研究技術不斷進步,在醫學、生物學上關於基因與疾病之間的相關資訊也開始陸續地被提出。本研究將修改反覆式貝氏模型平均法,並將致病基因的先驗資訊加入其中,以減少後續基因挑選之運算複雜度、提高預測精準度,並且為避免只有一組挑選過後的基因集而產生過度配適之問題,於是將多組挑選後的基因集之不確定性列入考慮,各基因集同時具有事後機率以利進行後續判斷。本研究之研究方法在基因挑選的機制上,將會分三階段進行:第一階段先將與致病基因功能性相似的基因去除;第二階段則針對留下的基因進行分群;第三階段則從各群集中選出代表性的基因做為候選基因,輔以致病基因進行基因挑選,同時將該次挑選後所留下的基因集與其事後機率記錄下來,重複執行直到達到我們想要的模型個數為止。不同於大多基因選取法最後只留下一組基因集,本研究會產生多組基因集與其事後機率以供後續判斷,並以四種特定疾病的資料檔進行基因挑選與分類預測。結果顯示了在這四個資料檔當中,兩個資料檔的分類正確率提升較為緩慢,另外兩個資料檔則是有顯著提升。

    Gene selection and clustering techniques are usually applied for analyzing microarray data. However, most of them do not consider the risk genes presented in biological studies. Our proposed method will modify iterative Bayesian model averaging algorithm and consider the risk gene set as prior information for gene selection. One major advantage of our method is that it considers more than one model to avoid overfitting. The whole method includes three stages. The genes highly correlated with any risk gene are removed at the first stage, and the remaining genes are divided into clusters at the second stage. At the final stage, a representative gene is chosen from each cluster to form a candidate gene set. We then apply the modified iterative Bayesian model averaging algorithm to select the genes in the candidate set that are suitable for deriving a regression model with risk genes. This method is tested on four well-known gene expression data sets for breast cancer and prostate cancer. The experimental results show that our gene selection method outperforms or has similar prediction accuracy to the methods proposed by other studies.

    摘要 I Abstract II 致謝 III 第一章 緒論 1 1.1 研究動機 1 1.2 研究目的 2 1.3 研究架構 3 第二章 文獻探討 4 2.1 基因微陣列 4 2.1.1 基因微陣列資料型態 5 2.2 群集分析 5 2.2.1 K-means演算法 6 2.2.2 Mitra-based K-means演算法 7 2.3 特徵選取 7 2.3.1. 無先驗資訊之特徵選取法 8 2.3.1.1 個別基因排序法 8 2.3.1.2 組合基因排序法 9 2.3.2. 有先驗資訊之特徵選取法 10 2.4 反覆性貝氏模型平均法 10 2.5 分類與評估方法 14 2.5.1. K鄰近點分析 14 2.5.2. 支向機 14 2.5.3. 交互認證法則 15 第三章 研究方法 16 3.1 基因選取架構與描述 16 3.1.1 第一階段-去除與致病基因相似的多餘基因 17 3.1.2 第二階段-基因分群與排序 18 3.1.3 第三階段-利用致病基因集的mBMA 19 3.2 評估流程 24 第四章 實證研究 26 4.1 資料收集與整理 26 4.2 參數設定 27 4.3 實證結果 28 4.3.1 q與R值之分類預測正確率 28 4.3.2 本研究之分類正確率與其他方法之比較 33 4.3.3 不同模型所挑選基因之歸屬群 34 4.3.4 選取的基因在BW之排序 39 4.4 小結 40 第五章 結論與建議 42 5.1 結論 42 5.2 建議 43 參考文獻 44

    中文

    陳丁群 (2008),以致病基因集為先驗資訊的基因選取方法之研究,國立成功大學資訊管理研究所碩士班碩士論文

    張哲仁 (2008),應用可自訂群數的非監督式學習法於基因選取,國立成功大學資訊管理研究所碩士班碩士論文

    程中慧 (2006),無歸納偏置影響因素的基因選取之研究,國立成功大學資訊管理研究所碩士班碩士論文

    劉冠良 (2007),以叢集分析與距離測度為基礎之基因選取法,國立成功大學資訊管理研究所碩士班碩士論文

    鄭凱峰 (2004),小樣本高維度資料中二階段分類法之效能評估-以基因微陣列資料癌症分類為例,國立成功大學工業與資訊管理學系碩士班碩士論文

    英文

    Ankerst, M., Breunig, M. M., Kriegel, H. P., and Sander, J. (1999). OPTICS: ordering points to identify the clustering structure, ACM SIGMOD Record, 23(2), 49-60.

    Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M., and Yakhini, Z. (2002). Tissue classification with gene expression profiles, Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, 54-64.

    Breiman, L. (1996). Bagging predictors, Machine Learning, 24, 123-140.

    DeRisi, J. L., Iyer, V. R., and Brown, P. O. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale, Science, 278, 680-686.

    Dudoit, S., Fridlyand, J., and Speed, T. (2002). Comparison of discrimination methods for the classification of tumor using gene expression data, Journal of the American Statistical Association, 97, 77-87.

    Dudoit, S., Laan, M., Keles, S., and Cornec, M. (2003). Unified cross-validation methodology for estimator selection and application to genomic, Bulletin of the International Statistical Institute, 54h Sessoin Proceedings, Vol. LX, Book 2, 412-415.

    Ester, M., Kriegel, H. P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial database with noise. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, Porland Oregon, 226-231.

    Furey, T., Cristianini, N., Duffy, M., Bednarski, D., Schummer, M., and Haussler, D. (2000). Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics, 16, 906-914.

    Gormley, M., Dampier, W., Ertel, A., Karacali, B., and Tozeren, A. (2007). Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets, published online by BMC Bioinformatics.

    Guha, S., Rastogi, R., and Shim, K. (1998). CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, Seattle, Washington, USA, 73-84.

    Guyon, I., Weston, J., and Barnhill, S. (2002). Gene selection for cancer classification using support vector machines, Machine Learning, 46, 389-422.

    Hoeting, Jennifer A., David Madigan, Adrian E. Raftery, and Chris T. Volinsky (1999). Bayesian model averaging: a tutorial, Statistical Science, 14(11), 382–401.

    Huerta, E. B., Duval, B., and Hao, J. K. (2006). A hybrid GA/SVM approach for gene selection and classification of microarray data, Lecture Notes in Computer Science, 3907, 34-44.

    Jrnsten, R. and Yu, B. (2003). Simultaneous gene clustering and subset selection for sample classification via MDL, Bioinformatics, 19, 1100-1109.

    Karypis, G., Han, E. H., and Kumar, V. (1999). CHAMELEON: a hierarchical clustering algorithm using dynamic modeling, IEEE Computer, 32(8), 68-75.

    Kaufman, L. and Rousseeuw, P. J. (1990). Finding groups in data: an Introduction to cluster analysis, John Wiley & Sons, New York, 342.

    Li, J., Zhang, C., and Ogihara, M. (2004). A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression, Bioinformatics, 20(5), 2429-2437.

    Li, L., Weinberg, R. C., Darden, T. A., and Pedersen, L. G. (2001). Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method, Bioinformatics, 17, 1131-1142.

    Liu, H., Li, J., and Wong, L. (2002). A comparative study of feature selection and multiclass classification methods using gene expression profiles and proteomic patterns, Genome Informatics, 13, 51-60.

    Lu, Y. and Han, J. (2003). Cancer classification using gene expression data, Information Systems, 28, 243-268.

    MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 281-297.

    Nguyen, D. V. and Rocke, D. M. (2002). Tumor classification by partial least squares using microarray gene expression data, Bioinformatics, 18, 39-50.

    Simek, K., Fujarewicz, K., Swierniak, A., Kimmel, M., Jarzab, B., Wiench, M., and Rzeszowska, J. (2004). Using SVD and SVM methods for selection, classification, clustering and modeling of DNA microarray data, Engineering Application of Artificial Intelligence, 17, 417-427.

    Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V., Richie, J. P., Lander, E. S., Loda, M., Kantoff, P. W., Golub, T. R., and Sellers, W. R. (2002). Gene expression correlates of clinical prostate cancer behavior, Cancer Cell, 1, 203-209.

    Sheikholeslami, G., Chatterjee, S., and Zhang, A. (1998). Wavecluster: a multi-resolution clustering approach for very large spatial databases, Proceedings of the 24th VLDB Conference on Very Large Data Bases, New York, USA, 428-439.

    Stuart , R. O., Wachsman, W., Berry, C. C., Wang-Rodriguez, J., Wasserman, L., Klacansky, I., Masys, D., Arden, K., Goodison, S., McClelland, M., Wang, Y., Sawyers, A., Kalcheva, I., Tarin, D., and Mercola, D. (2004). In silico dissection of cell-type-associated patterns of gene expression in prostate cancer, PNAS, 101(2), 615-620.

    Su, Y., Mruali, T., Pavlovic, V., Schaffer, M., and Kasif, S. (2003). RankGene : identification of diagnostics genes based on expression data, Bioinformatics, 19, 1578-1579.

    Tai, F. and Pan, W. (2007). Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data, Bioinformatics, 23(23), 3170-3177.

    Veer, L. J., Dai, H., Vijver, M. J. V., He., Y. D., Hart, A. A., Mao, M., Peterse, H. L., Kooy, K. V. D., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Rober, C., Linsley, P. S., Bernards, R., and Friend, S. H. (2002). Gene expression profiling predicts clinical outcome of breast cancer, Nature, 415, 530-536.

    Wang, W., Yang, J., and Muntz R. (1997). STING: A Statistical Information Grid Approach to Spatial Data Mining, Proceedings of the 23rd VLDB Conference on Very Large Data Bases, Athens, Greece, 186-195.

    West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., Olson, J. A., Jr., Marks, J. R., and Nevins, J. R. (2001). Predicting the clinical status of human breast cancer by using expression profiles, Proc Natl Acad Sci USA, 98(20), 11462-11467.

    Wong, T. T. and Hsu, C. H. (2008). Two-stage classification methods for microarray data, Expert Systems with Applications, 34(1), 375-383.

    下載圖示 校內:2012-07-20公開
    校外:2012-07-20公開
    QR CODE