簡易檢索 / 詳目顯示

研究生: 周咏廸
Chou, Yung-Ti
論文名稱: 利用多變量卜瓦松分配插補次世代定序資料遺失值之統計評估
The statistical evaluation of imputing NGS data by using EM-algorithm of multivariate Poisson distribution
指導教授: 莊哲男
Juang, Jer-Nan
共同指導教授: 馬瀰嘉
Ma, Mi-Chia
學位類別: 碩士
Master
系所名稱: 工學院 - 工程科學系
Department of Engineering Science
論文出版年: 2016
畢業學年度: 105
語文別: 英文
論文頁數: 44
中文關鍵詞: EM演算法廣義估計方程式次世代定序遺失值
外文關鍵詞: Expectation-Maximization algorithm, Generalized estimation equation, Next Generation Sequencing, missing value
相關次數: 點閱:178下載:6
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文使用了多維普瓦松分布的EM演算法來插補次世代序列資料的遺失值,並且做了兩類模擬遺失值的資料來檢驗其性能。一類是隨機遺失值的資料,另一類為整行遺失值的資料,並且以孫孝芳教授提供的資料來作為實例。在遺失值為隨機的模擬資料中,使用EM演算法來跟最小鄰近 (K-Nearest Neighbor) 插補法、最小平方法做比較。在整行遺失值的模擬資料中,使用EM演算法來跟轉置最小鄰近插補法、迴歸插補法做比較,並且使用廣義估計方程式來估算真陽率 (true positive rate) 以及偽陽率 (false positive rate)。在實例資料中,使用EM演算法來跟轉置最小鄰近插補法、迴歸插補法做比較,並且使用廣義估計方程式來推測出致病基因。結果顯示EM演算法較其他演算法有效。

    This thesis presents an expectation-maximization (EM) algorithm for multivariate Poisson distribution of next-generation sequencing data. We set up two scenarios and conducted simulation studies to examine the performance of multiple imputations with EM algorithm and with other imputation methods. Two kinds of data were simulated—one with randomly removed values as missing values, and the other with the whole column values removed as missing values. Then, we used Professor H. Sunny Sun’s data to analyze the real data. In the data obtained by randomly removing values, the EM algorithm had higher accuracy rates than the weighted adjusted k-nearest neighbor (KNN) method and least squares method did. In the data obtained by removing whole columns, the EM algorithm had higher accuracy rates than the transpose k-nearest neighbor (tKNN) and linear imputation methods did. Also, we used the generalized estimation equation (GEE) to estimate the true positive rates and false positive rates. In the real data, we compared EM algorithm with the tKNN and linear imputation methods, and use GEE to find the diseased genes. The results reveal that compared with other methods, the proposed EM algorithm is more efficient in imputing NGS data for multivariate Poisson distribution.

    中文摘要 i Abstract ii Acknowledgements iii Contents iv List of Tables v List of Figures vi 1. Introduction 1 2. Literature Review 3 2.1 Reads Per Kilobase Million (RPKM) 3 2.2 Linear Imputation Method 4 2.3 Weighted KNN (K-Nearest Neighbor) Method 4 2.4 Least Squares (LS) Method 5 2.5 Expectation-Maximization (EM) Algorithm 5 2.6 Generalized Estimating Equation (GEE) 6 3. Materials and Methods 8 3.1 Materials 8 3.2 Imputation under Multivariate Normal Distribution 9 3.3 Imputation under Multivariate Poisson Distribution 11 3.4 Goodness-of-fit Test 13 4. Simulation 16 4.1 Simulation Process 16 4.2 Simulation Results and Analysis 25 5. Conclusion and Discussion 29 References 30 Appendix 33

    1. Bullard, J. H., Purdom, E., Hansen, K. D., & Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC bioinformatics, 11(1), 1.

    2. Ghitany, M. E., Karlis, D., Al-Mutairi, D. K., & Al-Awadhi, F. A. (2012). An EM algorithm for multivariate mixed Poisson regression models and its application. Applied Mathematical Sciences, 6(137), 6843-6856.

    3. Good, N. M., Suresh, K., Young, G. P., Lockett, T. J., Macrae, F. A., & Taylor, J. M. (2015). A prediction model for colon cancer surveillance data. Statistics in medicine, 34(18), 2662-2675.

    4. Halekoh, U., Højsgaard, S., & Yan, J. (2006). The R package geepack for generalized estimating equations. Journal of Statistical Software, 15(2), 1-11.

    5. Huang, S.-W. (2011) .Using Weighted Least Squares Method to Estimate Microarray Missing Values. Master's thesis, National Cheng-Kung University, 1-56.

    6. Johnson, N. L., Kotz, S., & Balakrishnan, N. (1997). Discrete multivariate distributions (Vol. 165). New York: Wiley.

    7. Johnson, R. A., & Wichern, D. W. (2002). Applied multivariate statistical analysis (Vol. 5, No. 8). Upper Saddle River, NJ: Prentice hall.

    8. Kim, T. M., Luquette, L. J., Xi, R., & Park, P. J. (2010). rSW-seq: algorithm for detection of copy number alterations in deep sequencing data. BMC bioinformatics, 11(1), 432.

    9. Klambauer, G., Schwarzbauer, K., Mayr, A., Clevert, D. A., Mitterecker, A., Bodenhofer, U., & Hochreiter, S. (2012). cn. MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic acids research, gks003.

    10. Liang, K. Y., & Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73(1), 13-22.

    11. McLachlan, G., & Krishnan, T. (2007). The EM algorithm and extensions (Vol. 382). John Wiley & Sons.

    12. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature methods, 5(7), 621-628.

    13. Nguyen, D. V., Wang, N., & Carroll, R. J. (2004). Evaluation of missing value estimation for microarray data. Journal of Data Science, 2(4), 347-370.

    14. Schwender, H. (2012). Imputing missing genotypes with weighted k nearest neighbors. Journal of Toxicology and Environmental Health, Part A, 75(8-10), 438-446.

    15. Sultan, M., Schulz, M. H., Richard, H., Magen, A., Klingenhoff, A., Scherf, M., ... & Schmidt, D. (2008). A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science, 321(5891), 956-960.

    16. Teo, S. M., Pawitan, Y., Ku, C. S., Chia, K. S., & Salim, A. (2012). Statistical challenges associated with detecting copy number variations with next-generation sequencing. Bioinformatics, 28(21), 2711-2718.

    17. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., ... & Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6), 520-525.

    18. Wu, Z.-W. (2015) .The Study on Missing Value Imputation for Modeling the Data of Next Generation Sequence. Master's thesis, National Cheng-Kung University, 1-129.

    19. Xie, C., & Tammi, M. T. (2009). CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC bioinformatics, 10(1), 80.

    20. Yan, J. & Fine, J.P. (2004) Estimating Equations for Association Structures Statistics in Medicine, 23, pp859--880.

    21. Yan, J (2002) geepack: Yet Another Package for Generalized Estimating Equations R-News, 2/3, pp12-14.

    22. Zeger, S. L., Liang, K. Y., & Albert, P. S. (1988). Models for longitudinal data: a generalized estimating equation approach. Biometrics, 1049-1060.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE