簡易檢索 / 詳目顯示

研究生: 程瀚德
Cheng, Han-De
論文名稱: 藉由混和最小平方解演算法來修復微陣列基因序列資料的遺失值
Missing Value Estimation for Microarray Gene Expression Data by Hybrid Local Least Squares Imputation
指導教授: 莊哲男
Juang, Jer-Nan
學位類別: 碩士
Master
系所名稱: 工學院 - 工程科學系
Department of Engineering Science
論文出版年: 2011
畢業學年度: 99
語文別: 英文
論文頁數: 47
中文關鍵詞: 微陣列缺失值權重最小平方法語意相似度基因本體註解
外文關鍵詞: Microarray, missing values, weighted least-squares, semantic similarity, gene ontology annotation
相關次數: 點閱:164下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 基因表現微陣列資料已經被廣泛的應用在生物實驗上的分析。然而由於總總原因,微陣列資料經常有遺失值出現,進而導致大部分需要利用完整微陣列資料來分析基因表現的演算法受到影響,例如: 叢集分析、分類法以及建立生物網絡。因此,如何重建微陣列資料與提升估計值的準確率是一個重要的問題。大部分演算法的過程可分成兩步驟,第一步先選出沒有缺失值且與欲計算的目標基因相似度最高的前k個基因,第二步再利用這些相似基因結合不同的演算法來計算出缺失值。在本論文中,我們首先利用每個基因註解的功能 (gene ontology annotations) 算出的基因語意相似度 (semantic similarity) 來挑選每一個有遺失值的目標基因所對應的相似基因群,並提出了一個方法結合了 Iterated Local Least Squares (ILLSimpute) 以及 Sequential Local Least Squares (SLLSimpute) 的概念來估計遺失值。實驗用了四個微陣列資料,結果顯示此方法比起其他已有真正在使用的演算法有著更良好的準確率。

    Gene expression microarray data have been used widely for biological analyses. However, it usually contains missing values resulted from various reasons and affects most of the gene expression data analysis algorithms, such as clustering, classification and network design, which require complete information. Therefore how to reconstruct microarray data and to improve accuracy is an important issue. The procedure of most algorithms is mainly separated into two steps. In the first step, a specific number of top similar genes without missing values are chosen. In the second step, the chosen genes are used to estimate missing values with different methods. In this thesis, we first use semantic similarity originating from gene ontology annotations to select similar genes for every target gene containing missing values and propose a new method that uses the important features of iterated local least-squares and sequential local least-squares imputation methods to estimate missing values. The numerical simulations in four microarray datasets show that the performance of our method is better than other imputation methods currently used.

    中文摘要 i Abstract ii Acknowledgment iii Table of contents iv List of Tables vii List of Figures viii 1 Introduction 1 1.1 Background 1 1.2 Motivation 1 1.3 Method and Goal 2 1.4 Organization of the Thesis 3 2 Existing Methods 4 2.1 Definition and Causes of Missing Values 4 2.2 Common Estimation Methods of Similarity 5 2.2.1 Euclidean Distance 5 2.2.2 Pearson Correlation Coeffcient 5 2.3 Various Methods for Estimating Missing Values in DNA Microarrays 6 2.3.1 Deleting the Rows Containig Missing Values 6 2.3.2 Row Average 6 2.3.3 K-nearest Neighbors (KNN) 6 2.3.4 Least Squares Imputation (LSimpute) 7 LSimpute gene approach 7 LSimpute array approach 8 2.3.5 Local Least Squares Imputation (LLSimpute) 9 2.3.6 Iterated Local Least Squares Imputation (ILLSimpute) 9 2.3.7 Sequential Local Least Squares Imputation (SLLSimpute) 10 2.3.8 Gene Ontology Local Least Squares Imputation (GOLLS) 11 2.4 Gene Ontology 12 2.4.1 The Structure of Gene Ontology 12 2.5 Summary 13 3 A New Method: HLLS 14 3.1 Concept of the Method 14 3.2 Definition of Symbols 15 3.3 Sorting Target Genes 16 3.4 Distance Measurements 16 3.5 Gene Semantic Dissimilarity 16 3.5.1 Giving Weights to GO terms 17 3.5.2 Semantic Dissimilarity 18 3.5.3 Selecting the Weight of Semantic Dissimilarity 20 3.6 Imputation for Recovering Missing Values 20 3.6.1 Minimum-Norm Solution and Least-Squares Solution 21 3.6.2 Weighted Least-Squares Method 22 3.7 Iteration of the Whole Procedure 23 3.8 Flow Path of the Algorithm 25 4 Numerical Simulations 28 4.1 Introduction of the Datasets 28 4.2 Design and Purpose of Simulations 29 4.3 Normalized Root Mean Square (NRMSE) 30 4.4 Hybrid Approaches 30 4.4.1 Iterated Sequential Local Least Squares (ISLLS) 30 4.4.2 Sequential Weighted Least Squares (SWLS) 32 4.5 Simulation Results 33 4.5.1 Simulation 1: Number of Iterations 34 4.5.2 Simulation 2: Best Hybrid Approach 34 4.5.3 Simulation 3: Best Gene Ontology 38 4.5.4 Simulation 4: Comparison of All Methods 38 5 Concluding Remarks 43 References 44 自述 47

    [1] J. L. DeRisi, V. R. Iyer, and P. O. Brown, "Exploring the metabolic and genetic control of gene expression on a genomic scale," Science, vol. 278, no. 5338, pp. 680-686, 1997.
    [2] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfeld, and E. S. Lander, "Molecular classi cation of cancer: Class discovery and class prediction by gene expression monitoring," Science, vol. 286, no. 5439, pp. 531-537, 1999.
    [3] J. Han and M. Kamber, Data mining: concepts and techniques. Morgan Kaufmann, 2006.
    [4] M. Ouyang, W. J. Welsh, and P. Georgopoulos, "Gaussian mixture clustering and imputation of microarray data," Bioinformatics, vol. 20, no. 6, pp. 917-923, 2004.
    [5] Q. Xiang, X. Dai, Y. Deng, C. He, J. Wang, J. Feng, and Z. Dai, "Missing value imputation for microarray gene expression data using histone acetylation information," BMC Bioinformatics, vol. 9, no. 252, 2008.
    [6] H. Kim, G. H. Golub, and H. Park, "Missing value estimation for DNA microarray gene expression data: local least squares imputation," Bioinformatics, vol. 21, no. 2, pp. 187-198, 2005.
    [7] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman, "Missing value estimation methods for DNA microarrays," Bioinformatics, vol. 17, no. 6, pp. 520-525, 2001.
    [8] O. Alter, P. O. Brown, and D. Botstein, "Singular value decomposition for genome-wide expression data processing and modeling," Proc. Natl Acad. Sci., vol. 97, no. 18, pp. 10101-10106, 2000.
    [9] S. Oba, M. aki Sato, I. Takemasa, M. Monden, K. ichi Matsubara, and S. Ishii, "A Bayesian missing value estimation method for gene expression pro le data,"
    Bioinformatics, vol. 19, no. 16, pp. 2088-2096, 2003.
    [10] T. H. B , B. Dysvik, and I. Jonassen, "LSimpute: accurate estimation of missing values in microarray data with least squares methods," Nucleic Acids Res., vol. 32, no. 3, e34, 2004.
    [11] X. Zhang, X. Song, H. Wang, and H. Zhang, "Sequential local least squares imputation estimating missing value of microarray data," Computers in Biology and Medicine, vol. 38, pp. 1112-1120, 2008.
    [12] Z. CAI, M. HEYDARI, and G. LIN, "Iterated local least squares microarray missing value imputation," J Bioinform Comput Biol., vol. 4, no. 5, pp. 935-957, 2006.
    [13] J. Tuikkala, L. Elo, O. S. Nevalainen, and T. Aittokallio, "Improving missing value estimation in microarray data with gene ontology," Bioinformatics, vol. 22, no. 5, pp. 566-572, 2006.
    [14] P. W. Lord, R. D. Stevens, A. Brass, and C. A. Goble, "Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation," Bioinformatics, vol. 19, no. 10, pp. 1275-1283, 2003.
    [15] C. Fellbaum, "Wordnet," An electronic lexical database, 1998.
    [16] P. Resnik, "Using information content to evaluate semantic similarity in a taxonomy," Proc. of the 14th International Joint Conference on Artificial Intelli-gence, 1995.
    [17] N. Ogawa, J. DeRisi, and P. O. Brown, "New components of a system for phosphate accumulation and polyphosphate metabolism in Saccharomyces cerevisiae revealed by genomic expression analysis," Mol. Biol. Cell, vol. 11, pp. 4309-4321, 2000.
    [18] A. P. Gasch, M. Huang, S. Metzner, D. Botstein, S. J. Elledge, and P. O. Brown, "Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Meclp," Mol. Biol. Cell, vol. 12, pp. 2987-3003, 2001.
    [19] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown, D. Botstein, and B. Futcher, "Comprehensive identi cation of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization," Mol. Biol. Cell, vol. 9, pp. 3273-3297, 1998.
    [20] D. T. Ross, U. Scherf, M. B. Eisen, C. M. Perou, C. Rees, P. Spellman, V. Iyer, S. S. Jeffrey, M. V. de Rijn, M. Waltham, A. Pergamenschikov, J. C. Lee, D. Lashkari, D. Shalon, T. G. Myers, J. N. Weinstein, D. Botstein, and P. O.
    Brown, "Systematic variation in gene expression patterns in human cancer cell lines," Nature Genetics, vol. 24, pp. 227-235, 2000.
    [21] D. J. Allocco, I. S. Kohane, and A. J. Butte, "Quantifying the relationship between co-expression, co-regulation and gene function," BMC Bioinformatics, vol. 5, no. 18, 2004.
    [22] 林榆嘉,"以生物知識特性修復微陣列資料遺失值," 國立成功大學 資訊工程研究所, 2007.
    [23] J. Hu, H. Li, M. S. Waterman, and X. J. Zhou, "Integrative missing value estimation for microarray data," BMC Bioinformatics, vol. 7, no. 449, 2006.

    下載圖示 校內:2012-08-10公開
    校外:2012-08-10公開
    QR CODE