| 研究生: |
林榆嘉 Lin, Yu-Chia |
|---|---|
| 論文名稱: |
以生物知識特性修復微陣列資料遺失值 Missing Value Estimation by Using Biological Knowledge in DNA Microarray Datasets |
| 指導教授: |
曾新穆
Tseng, Vincent S. |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2007 |
| 畢業學年度: | 95 |
| 語文別: | 中文 |
| 論文頁數: | 58 |
| 中文關鍵詞: | 遺失值 、資料探勘 、基因表現 |
| 外文關鍵詞: | missing value, gene expression, data mining |
| 相關次數: | 點閱:110 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
資料探勘(Data Mining)為近幾年來應用在微陣列分析(Microarray Analysis)上十分熱門的研究技術,其目的在從大量的基因表現(Gene Expression)資料上,萃取出有用的知識,提供研究生物領域的學者,進行研究時的參考。但往往微陣列資料在整個實驗的過程中,會因為人為操作因素不當或是在實驗的過程中,晶片受到了污染,而導致微陣列資料產生了遺失值(Missing Values),而遺失值的存在,會影響到資料分析的品質。而有關遺失值的回復,最主要可以分成兩個步驟,第一就是選擇前k個和含有遺失值基因在表現值上最相似的基因群出來,第二就是利用選出來基因群的資訊,來回填遺失值。而到目前為止,在第一個步驟下,大部份的方法,都是利用基因之間的表現值,來選擇前k個具有相似表現值的基因群,而這些方法,往往都忽略了生物上的意義。所以在本論文中除了考慮表現值的部份外,我們提出了一方法PGAKNN (Protein and Gene Annotation K Nearest Neighbors),整合了表現值、基因和蛋白質功能相似的特性,將之列入一起評估考量,而我們以四個酵母菌的微陣列資料來做實驗,結果顯示我們所提出來的方法在遺失值回復的準確率都比KNN和GOKNN (Gene Ontology KNN)的方法來的更為準確。
Data Mining has become a very popular area in DNA microarray gene expression data recently. We can use the technique to extract desirable knowledge from the microarray data. However gene expression data often contain missing values. These missing values can significantly affect statistical analysis. Effective missing value estimation methods have been proposed to solve the problem. But most imputation algorithms only consider the expression data in selection process. In our study we use the external information on functional and protein semantic similarity available to improve the missing value estimation. The new imputation algorithm (PGAKNN) has been compare with existing estimation techniques including K-nearest neighbors (KNN) and Gene Ontology KNN (GOKNN). We combine functional and protein similarity as an external information in selection gene process for missing value estimation. The experiment results in yeast cDNA microarray datasets shows that we have better accuracy than KNN and GOKNN. A concise theoretical framework has also been formulated to validate the improved performance of our imputation algorithm.
[1] E. Acuna and C. Rodriguez, “The treatment of missing value and its effect in the classifier and accuracy,” Clustering and Data Mining Application. Springer-verlag, Berlin, pp. 639-648, 2004.
[2] O. Alter, P. O. Brown and D. Botstein, “Generalized singular value decomposition for comparative analysis of genome-scale expression datasets of two different organisms,” Proc.Natl Acad.Sci. USA, 100, pp. 3351-3356, 2003.
[3] R. Apweiler, A. Bairoch, C. Wu, W. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. Martin and D.Natale, C. O’Donovan, N. Redaschi, L. Yeh, “UniProt: the universal protein knowledgebase,” Nucleic Acids Research, 32, (Databaseissue), pp. D115–D119, 2004.
[4] E. Camon, M. Magrane, D. Barrell, V. Lee, E. Dimmer, J. Maslen, D. Binns, N. Harte, R. Lopez and R. Apweiler, “The Gene Ontology Annotations (GOA) database: sharing knowledge in UniProt with Gene Ontology,” Nucleic Acids Research, 32, pp. 262–266, 2004.
[5] J. L. Derisi, V. R. Iver and P. O. Brown, “Exploring the metabolic and genetic control of gene expression on a genomic scale,” Science, 278, pp. 680-686, 1997.
[6] M.C. Francisco, J.S. Mario and C. Pedro, “Measuring semantic similarity between Gene Ontology terms,” Data & Knowledge Engineering, 2006.
[7] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. BloomField and E. S. Lander, “Molecular classification for cancer: class discovery and class prediction by gene expression monitoring,” Science, 286, pp. 531-537, 1999.
[8] G.H. Golub and C. F. van Loan, “Matrix Computations,” 3rd edn.Johns Hopkins University Press, Baltimore, CA, 1996.
[9] J. Han, M. Kember, “Data Mining: Concepts and Techniques,” Morgan Kaufmann, 2000.
[10] H. Kim, G.H. Golub and H. Park, “Missing value estimation for DNA microarray gene expression data: local least squares imputation,” Bioinformatics, 21, pp. 187-198, 2005.
[11] P.W. Lord, R.D. Stevens, A.Brass and C.A. Goble, “Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation,” Bioinformatics, 19, pp. 1275-1283, 2003.
[12] D. J. Lockhart and E. A. Winzeler, “Genome, gene expression and DNA arrays,” Nature, 405, pp. 827-836, 2000.
[13] N. Ogawa, J. DeRisi and P. O. Brown, “New components of a system forphosphate accumulation and polyphosphate metabolism in Saccharomyces cerevisiae,” genome expression analysis, Mol.Biol.Cell, 11, pp. 4309-4321, 2000.
[14] S. Oba, M. Sato, I. Takemasa, “A Bayesian missing value estimation method for gene expression profile data,” Bioinformatics, 19, pp.2088-2096, 2003.
[15] M. Ouyang, W. J. Welsh and P. Georgopoulos, “Gaussian mixture clustering and imputation of microarray data,” Bioinformatics, 20, pp. 917-923, 2004.
[16] K. Pearson, “Contributions to the mathematical theory of evolution,” Phil. Trans. R. Soc. London, 185, pp. 71-110.
[17] D. Plye, “Data Preparetion for Data Mining,” Morgan Kaufmann publishers, 1999.
[18] P. Resnik, “Using information content to evaluate semantic similarity in a taxonomy,” in Proc. of the 14th International Joint Conference on Artificial Intelligence, Montreal, pp. 448-453, 1995.
[19] M. S. B. Sehgal, I. Gondal, L. S Dooley, “Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data,” Bioinformatics 21, pp. 2417-2423, 2005.
[20] M. Schena, D. Shalon, R. W. Davis and P. O. Brown, “Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray,” Nature, 270, pp. 467-470, 1995.
[21] M. S. B. Sehgal, I. Gondal, L. S Dooley, “K-ranked covariance based missing values estimation for microarray data classification,” HIS’04, Japan, 2004.
[22] P. T. Spellman, G. Sherlock, M. QZhang, V.R. Iyer, K. Anders, M.B. Eisen, P.O. Brown, DBotstein, B.Futcher, “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by micro array hybridization,” Mol Biol Cell, 9, pp. 3273-3297, 1998.
[23] M. A. Shipp, K.N. Ross, P. Tamayo, A. P. Weng, J. L. Kutok, R. C. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich and G. S. Pinkus, “Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning,” Nat.Med, 8, pp. 68-74, 2002.
[24] The Gene Ontology Consortium,“The Gene Ontology (GO) database and informatics resource,” Nuclide Acids Research, 32:D258-D261, 2004.
[25] The Gene Ontology (GO) Consortium,“Creating the Gene Ontology Resource: Design and Implementation,” Genome Res.Vol. 11, pp. 1425-1433, 2001.
[26] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein and R.B. Altman, “Missing value estimation methods for DNA microarray,” Bioinformatics, 17, pp. 520–525, 2001.
[27] J. Tuikkala, L. Elo, Olli S. Nevalainen, T. Aittolallio, “Improving missing value estimation in microarray data with gene ontology,” Bioinformatics, 2006.
[28] V. Vapnik, “The Nature of statistical Learning Theory,” Springer-Verlag, New York, 1995.
[29] J. J. Wyrick, F. C. Holstege, E. G. Jennings, H. C. Causton, D. Shore, M. Grunstein, E.S. Lander, R.A. Young, “Chromosomal landscape of nucleosome dependent gene expression and silencing in yeast,” Nature, 402, pp. 418-421, 1999.
[30] 陳信木,林僅塋, “調查資料之遺漏值的處理-以熱卡插補法為例”,社會調查研究第三期.