簡易檢索 / 詳目顯示

研究生: 林榆嘉
Lin, Yu-Chia
論文名稱: 以生物知識特性修復微陣列資料遺失值
Missing Value Estimation by Using Biological Knowledge in DNA Microarray Datasets
指導教授: 曾新穆
Tseng, Vincent S.
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 中文
論文頁數: 58
中文關鍵詞: 遺失值資料探勘基因表現
外文關鍵詞: missing value, gene expression, data mining
相關次數: 點閱:110下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 資料探勘(Data Mining)為近幾年來應用在微陣列分析(Microarray Analysis)上十分熱門的研究技術,其目的在從大量的基因表現(Gene Expression)資料上,萃取出有用的知識,提供研究生物領域的學者,進行研究時的參考。但往往微陣列資料在整個實驗的過程中,會因為人為操作因素不當或是在實驗的過程中,晶片受到了污染,而導致微陣列資料產生了遺失值(Missing Values),而遺失值的存在,會影響到資料分析的品質。而有關遺失值的回復,最主要可以分成兩個步驟,第一就是選擇前k個和含有遺失值基因在表現值上最相似的基因群出來,第二就是利用選出來基因群的資訊,來回填遺失值。而到目前為止,在第一個步驟下,大部份的方法,都是利用基因之間的表現值,來選擇前k個具有相似表現值的基因群,而這些方法,往往都忽略了生物上的意義。所以在本論文中除了考慮表現值的部份外,我們提出了一方法PGAKNN (Protein and Gene Annotation K Nearest Neighbors),整合了表現值、基因和蛋白質功能相似的特性,將之列入一起評估考量,而我們以四個酵母菌的微陣列資料來做實驗,結果顯示我們所提出來的方法在遺失值回復的準確率都比KNN和GOKNN (Gene Ontology KNN)的方法來的更為準確。

    Data Mining has become a very popular area in DNA microarray gene expression data recently. We can use the technique to extract desirable knowledge from the microarray data. However gene expression data often contain missing values. These missing values can significantly affect statistical analysis. Effective missing value estimation methods have been proposed to solve the problem. But most imputation algorithms only consider the expression data in selection process. In our study we use the external information on functional and protein semantic similarity available to improve the missing value estimation. The new imputation algorithm (PGAKNN) has been compare with existing estimation techniques including K-nearest neighbors (KNN) and Gene Ontology KNN (GOKNN). We combine functional and protein similarity as an external information in selection gene process for missing value estimation. The experiment results in yeast cDNA microarray datasets shows that we have better accuracy than KNN and GOKNN. A concise theoretical framework has also been formulated to validate the improved performance of our imputation algorithm.

    第一章 導論 1 1.1 研究背景 1 1.2 研究動機 2 1.3 研究目的 2 1.4 研究方法 3 1.5 論文貢獻 3 1.6 論文架構 4 第二章 相關文獻 5 2.1 遺失值的定義及原因 5 2.2 遺失值的分類及型態 6 2.3 在微陣列資料上遺失值的處理方式 6 2.3.1 把整筆含有遺失值的基因資料刪除 6 2.3.2 把含有遺失值欄位刪除 6 2.3.3 列平均值法 (Row average) 7 2.3.4 KNN (K Nearest Neighbors) 7 2.3.5 LLS (Local Least Square Imputation) 9 2.3.6 KRCOV (K-Ranked Covariance-based) 11 2.3.7 CMVE (Collateral Missing Value Estimation) 12 2.3.8 GOKNN (Gene Ontology K Nearest Neighbors) 13 2.4 生物資訊資料庫 14 2.4.1 Gene Ontology 14 2.4.1.1 Gene Ontology 的基本架構 14 2.4.2 ExPASy (the Expert Protein Analysis System) 17 2.4.3 Gene Ontology Annotation (GOA) Database 18 2.5 結論 20 第三章 研究方法與設計 21 3.1 方法概念 21 3.2 符號定義 23 3.3 距離量測計算(Distance Measurements) 23 3.4 基因功能語意相似度(Gene semantic similarity) 計算 25 3.4.1 GO Term之權重值計算 25 3.4.2 基因語意相似度計算 (Semantic Similarity) 27 3.5 蛋白質功能語意相似度(Protein similarity)計算 29 3.6 遺失值回填方法 32 3.7 方法介紹-PGAKNN (Protein and Gene Annotation K Nearest Neighbors) 34 第四章 實驗分析 37 4.1 實驗資料集描述 37 4.2 實驗目的與設計 38 4.3 實驗效能評估 39 4.3.1 正規化均方根法(Normalized Root Mean Square) 39 4.3.2 整體改善率(Improvement Rate) 40 4.4 實驗結果 40 4.4.1 實驗資料設定 40 4.4.1.1 GAKNN (Gene Annotation KNN) 41 4.4.1.2 PAKNN (Protein Annotation KNN) 42 4.4.2 實驗一:誤差改善率 43 4.4.3 實驗二: 改變Missing Rate 實驗 45 4.4.4 實驗三: Top k和si實驗 47 4.4.5 實驗四: 改變樣本數大小(sample size)實驗 48 4.4.6 實驗五: Selection Order實驗 50 4.4.7 實驗六: Protein annotation 實驗 50 4.4.8 實驗七: Ontology 實驗 52 第五章 結論與未來方向 54 5.1 結論 54 5.2 未來研究方向 54 自述 58

    [1] E. Acuna and C. Rodriguez, “The treatment of missing value and its effect in the classifier and accuracy,” Clustering and Data Mining Application. Springer-verlag, Berlin, pp. 639-648, 2004.
    [2] O. Alter, P. O. Brown and D. Botstein, “Generalized singular value decomposition for comparative analysis of genome-scale expression datasets of two different organisms,” Proc.Natl Acad.Sci. USA, 100, pp. 3351-3356, 2003.
    [3] R. Apweiler, A. Bairoch, C. Wu, W. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. Martin and D.Natale, C. O’Donovan, N. Redaschi, L. Yeh, “UniProt: the universal protein knowledgebase,” Nucleic Acids Research, 32, (Databaseissue), pp. D115–D119, 2004.
    [4] E. Camon, M. Magrane, D. Barrell, V. Lee, E. Dimmer, J. Maslen, D. Binns, N. Harte, R. Lopez and R. Apweiler, “The Gene Ontology Annotations (GOA) database: sharing knowledge in UniProt with Gene Ontology,” Nucleic Acids Research, 32, pp. 262–266, 2004.
    [5] J. L. Derisi, V. R. Iver and P. O. Brown, “Exploring the metabolic and genetic control of gene expression on a genomic scale,” Science, 278, pp. 680-686, 1997.
    [6] M.C. Francisco, J.S. Mario and C. Pedro, “Measuring semantic similarity between Gene Ontology terms,” Data & Knowledge Engineering, 2006.
    [7] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. BloomField and E. S. Lander, “Molecular classification for cancer: class discovery and class prediction by gene expression monitoring,” Science, 286, pp. 531-537, 1999.
    [8] G.H. Golub and C. F. van Loan, “Matrix Computations,” 3rd edn.Johns Hopkins University Press, Baltimore, CA, 1996.
    [9] J. Han, M. Kember, “Data Mining: Concepts and Techniques,” Morgan Kaufmann, 2000.
    [10] H. Kim, G.H. Golub and H. Park, “Missing value estimation for DNA microarray gene expression data: local least squares imputation,” Bioinformatics, 21, pp. 187-198, 2005.
    [11] P.W. Lord, R.D. Stevens, A.Brass and C.A. Goble, “Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation,” Bioinformatics, 19, pp. 1275-1283, 2003.
    [12] D. J. Lockhart and E. A. Winzeler, “Genome, gene expression and DNA arrays,” Nature, 405, pp. 827-836, 2000.
    [13] N. Ogawa, J. DeRisi and P. O. Brown, “New components of a system forphosphate accumulation and polyphosphate metabolism in Saccharomyces cerevisiae,” genome expression analysis, Mol.Biol.Cell, 11, pp. 4309-4321, 2000.
    [14] S. Oba, M. Sato, I. Takemasa, “A Bayesian missing value estimation method for gene expression profile data,” Bioinformatics, 19, pp.2088-2096, 2003.
    [15] M. Ouyang, W. J. Welsh and P. Georgopoulos, “Gaussian mixture clustering and imputation of microarray data,” Bioinformatics, 20, pp. 917-923, 2004.
    [16] K. Pearson, “Contributions to the mathematical theory of evolution,” Phil. Trans. R. Soc. London, 185, pp. 71-110.
    [17] D. Plye, “Data Preparetion for Data Mining,” Morgan Kaufmann publishers, 1999.
    [18] P. Resnik, “Using information content to evaluate semantic similarity in a taxonomy,” in Proc. of the 14th International Joint Conference on Artificial Intelligence, Montreal, pp. 448-453, 1995.
    [19] M. S. B. Sehgal, I. Gondal, L. S Dooley, “Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data,” Bioinformatics 21, pp. 2417-2423, 2005.
    [20] M. Schena, D. Shalon, R. W. Davis and P. O. Brown, “Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray,” Nature, 270, pp. 467-470, 1995.
    [21] M. S. B. Sehgal, I. Gondal, L. S Dooley, “K-ranked covariance based missing values estimation for microarray data classification,” HIS’04, Japan, 2004.
    [22] P. T. Spellman, G. Sherlock, M. QZhang, V.R. Iyer, K. Anders, M.B. Eisen, P.O. Brown, DBotstein, B.Futcher, “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by micro array hybridization,” Mol Biol Cell, 9, pp. 3273-3297, 1998.
    [23] M. A. Shipp, K.N. Ross, P. Tamayo, A. P. Weng, J. L. Kutok, R. C. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich and G. S. Pinkus, “Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning,” Nat.Med, 8, pp. 68-74, 2002.
    [24] The Gene Ontology Consortium,“The Gene Ontology (GO) database and informatics resource,” Nuclide Acids Research, 32:D258-D261, 2004.
    [25] The Gene Ontology (GO) Consortium,“Creating the Gene Ontology Resource: Design and Implementation,” Genome Res.Vol. 11, pp. 1425-1433, 2001.
    [26] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein and R.B. Altman, “Missing value estimation methods for DNA microarray,” Bioinformatics, 17, pp. 520–525, 2001.
    [27] J. Tuikkala, L. Elo, Olli S. Nevalainen, T. Aittolallio, “Improving missing value estimation in microarray data with gene ontology,” Bioinformatics, 2006.
    [28] V. Vapnik, “The Nature of statistical Learning Theory,” Springer-Verlag, New York, 1995.
    [29] J. J. Wyrick, F. C. Holstege, E. G. Jennings, H. C. Causton, D. Shore, M. Grunstein, E.S. Lander, R.A. Young, “Chromosomal landscape of nucleosome dependent gene expression and silencing in yeast,” Nature, 402, pp. 418-421, 1999.
    [30] 陳信木,林僅塋, “調查資料之遺漏值的處理-以熱卡插補法為例”,社會調查研究第三期.

    下載圖示 校內:2008-08-28公開
    校外:2008-08-28公開
    QR CODE