| 研究生: |
楊承翰 Yang, Cheng-Han |
|---|---|
| 論文名稱: |
第I型零膨脹卜瓦松模型分析微生物基因組數據 Type I Multivariate Zero-Inflated Poisson Model for Analyzing Microbial Metagenomics Data |
| 指導教授: |
馬瀰嘉
Ma, Mi-Chia |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 統計學系 Department of Statistics |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 中文 |
| 論文頁數: | 31 |
| 中文關鍵詞: | 菌相分析 、分群 、第I型多變量零膨脹 、主成份分析 |
| 外文關鍵詞: | Microflora analysis, Clustering analysis, Type I Multivariate Zero-Inflated Poisson, Principal component analysis |
| 相關次數: | 點閱:145 下載:18 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
次世代定序(next-generation sequencing, NGS)又稱為高通量測序技術,該技術已發展多年,與傳統技術相比,它可以一次對數十萬甚至數百萬條核酸分子進行序列測定,繼而讓研究人員對基因組序列進行比對與分析。而許多研究也證實人體許多表徵與疾病都與特定基因息息相關,因此能夠有效找出特定致病基因是研究的主要目的。其中有許多菌相分析研究是從統計的角度出發進行分析,例如Holmes et al. (2012)假設微生物群落服從狄利克雷多項式分配(Dirichlet-multinomial distribution)。然而此種微生物群落資料擁有離散及稀疏的性質,因此本研究考慮此兩種性質,依資料特性建立第I型多變量零膨脹卜瓦松模型(type I multivariate zero-inflated Poisson model, I-ZIP),並利用主成份分析(Principal Component Analysis, PCA)對高維度基因資料進行降維。我們考慮到某些微生物具有同時出現的現象,故先將微生物零相關較高的基因分在同一組,決定好分組的組數後再對受試者進行分群。最後利用統計模擬方式生成第I型多變量零膨脹卜瓦松資料,使用本研究所提出的分群模型進行分析,與常見的K組平均演算法(K-means algorithm)及階層式分群演算法(hierarchical clustering algorithm)進行結果比較。研究發現即使群與群之間的重疊處越多時,本研究所提出的方法也是相對有效,但偶而有幾次的模擬結果出現參數估計值暴增的情況。在實例資料中,我們提出的分群方法可以達到與使用狄利克雷多項式分配假設有相同的準確性。
Next-generation sequencing (NGS), also known as high-throughput sequencing technology, has been developed for many years. Compared with traditional technologies, it can sequence hundreds of thousands or even millions of nucleic acid molecules at a time. It helps many researchers to align and analyze genome sequences. Many studies have also confirmed that many manifestations and diseases of the human body are closely related to specific genes, so it is the main purpose of research to effectively identify specific pathogenic genes. Many of the microbial phase analysis studies are analyzed from a statistical point of view. For example, Holmes et al. (2012) assumed that the microbial community obeyed the Dirichlet multinomial distribution. However, this microbial community data has discrete and sparse properties. Therefore, this study considers these two properties and establishes a type I multivariate zero-inflated Poisson model (I-ZIP) according to the data characteristics.
Principal component analysis (PCA) was used to reduce the dimensionality of high-dimensional genetic data. Considering that some microorganisms have the phenomenon of co-occurrence, we first grouped the genes with high zero correlation of microorganisms into the same group; and then grouped the subjects after determining the number of groups. Finally, the statistical simulation is used to generate Type I Multivariate Zero-Inflated Poisson data. The clustering model proposed in this study, the common K-means algorithm and Hierarchical clustering algorithm are used to analyze and the results are compared with each other. The study found that the method proposed in this study is relatively effective even more overlaps between groups, but there were a few times that the simulation results have a sudden increase in the parameter estimates. In the real data, our proposed clustering method can achieve the same accuracy as using the Dirichlet multinomial distribution assumption.
[1] Hahn MW, Huemer A, Pitt A, Hoetzinger M. (2021) Opening a next‐generation black box: Ecological trends for hundreds of species‐like taxa uncovered within a single bacterial >99% 16S rRNA operational taxonomic unit. Molecular Ecology Resources, 21(7): 2471-2485.
[2] Huttenhower C, Gevers D. (2012) Structure, function and diversity of the healthy human microbiome. Nature, 486(1): 207-214.
[3] Shade A, Klimowicz AK, Spear RN, Linske M, Donato JJ, Hogan CS, McManus PS, Handelsman J. (2013) Streptomycin application has no detectable effect on bacterial community structure in apple orchard soil. Appl Environ Microbiol, 79(21): 6617-6625.
[4] Gilbert JA, Steele JA, Caporaso JG, Steinbrück L, Reeder J, Temperton B, Huse S, McHardy AC, Knight R, Joint I, Somerfield P, Fuhrman JA, Field D. (2012) Defining seasonal marine microbial community dynamics. ISME J., 6(2): 298-308.
[5] Carstens A, Dicksved J, Nelson R, Lindqvist M, Andreasson A, Bohr J, Tysk C, Talley NJ, Agréus L, Engstrand L, Halfvarson J. (2019) The Gut Microbiota in Collagenous Colitis Shares Characteristics With Inflammatory Bowel Disease-Associated Dysbiosis. Clin Transl Gastroenterol, 10(7): 1-10.
[6] Komesu YM, Dinwiddie DL, Richter HE, Lukacz ES, Sung VW, Siddiqui NY, Zyczynski HM, Ridgeway B, Rogers RG, Arya LA, Mazloomdoost D, Levy J, Carper B, Gantz MG. (2020) Defining the relationship between vaginal and urinary microbiomes. Am J Obstet Gynecol, 222(2): 154.e1-154.e10.
[7] Ott SJ, Musfeldt M, Wenderoth DF, Hampe J, Brant O, Fölsch UR, Timmis KN, Schreiber S. Schreiber. (2004) Reduction in diversity of the colonic mucosa associated bacterial microflora in patients with active inflammatory bowel disease. Gut, 53(5): 685-693.
[8] Luan C, Xie L, Yang X, Miao H, Lv N, Zhang R, Xiao X, Hu Y, Liu Y, Wu N, Zhu Y, Zhu B. (2015) Dysbiosis of fungal microbiota in the intestinal mucosa of patients with colorectal adenomas. Scientific Reports, 5(1): 1-9.
[9] Peters BA, Shapiro JA, Church TR, Miller G, Trinh-Shevrin C, Yuen E, Friedlander C, Hayes RB, Ahn J. (2018) A taxonomic signature of obesity in a large study of American adults. Scientific Reports, 8(1): 1-13.
[10] Holmes I, Harris K, Quince C. (2012) Dirichlet multinomial mixtures: generative models for microbial metagenomics. PloS One, 7(2): e30126.
[11] Liu Y, Tian GL. (2015) Type I multivariate zero-inflated Poisson distribution with applications. Computational Statistics & Data Analysis, 83(1): 200-222.
[12] Lewis, CD. (1982) Industrial and Business Forecasting Methods, Butterworths Publishing, London.