| 研究生: |
蘇靖雅 Su, Ching-Ya |
|---|---|
| 論文名稱: |
微生物基因數據之零膨脹廣義卜瓦松雙分群模型 A Block Model of Zero-Inflated Generalized Poisson for Microbial Genetic Data |
| 指導教授: |
馬瀰嘉
Ma, Mi-Chia |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 統計學系 Department of Statistics |
| 論文出版年: | 2024 |
| 畢業學年度: | 112 |
| 語文別: | 中文 |
| 論文頁數: | 59 |
| 中文關鍵詞: | 雙分群 、潛在集區模型 、第I型多變量零膨脹廣義卜瓦松分配 |
| 外文關鍵詞: | Biclustering, Latent Block Model, Type I Multivariate Zero-Inflated Generalized Poisson Distribution |
| 相關次數: | 點閱:53 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著次世代基因定序(next-generation sequencing, NGS)發展,現今可以一次獲得大量的基因序列數據,大量且複雜的基因序列數據加深分析難度,故生物上透過對基因序列進行比對與分類,將相似的基因序列劃分為相同的分類操作單元 (Operational Taxonomic Unit, OTU)。優點是方便用來分析組成與多樣性,缺點則是不同物種如果基因序列相似,可能被分為相同的OTU。Holmes等人假設微生物群落服從狄利克雷多項式分配 (Dirichlet-multinomial distribution),但只對病患做分群,本研究考量到OTU資料具有稀疏與離散的性質,並且對OTU進行分群以觀察微生物群落間結構的差異,故使用第I型多變量零膨脹廣義卜瓦松分配(type I multivariate zero-inflated generalized Poisson distribution, MZIGP)來建立雙分群(Biclustering)模型,雙分群模型藉由同時將受試者與OTU作分群,來尋找資料的結構矩陣,透過將資料分成多個區塊(blocks),區塊內會具有較高的相似性,藉此找出隱藏在資料中的潛在結構。
本研究在統計模擬與實證中,分別使用模擬數據和實例,以評估對病患和OTU資料同時做分群的表現。除了常見的K組平均演算法(K-means algorithm),也使用同樣是雙分群的潛在集區模型(Latent Block Model, LBM)與本研究的模型進行比較。在統計模擬中,我們生成2人群和3個OTU群的資料,來觀察不同參數設定下,上述3種方法對病患做分群和對OTU資料做分群的準確率。在實例中,我們使用Holmes文獻中的腸道微生物資料來比較各模型的分群表現,並分析不同人群間OTU組成與多樣性。
With next-generation sequencing (NGS), large amounts of genetic sequence data can be obtained at once, making analysis more complex. In biology, similar gene sequences are grouped into Operational Taxonomic Units (OTUs) to facilitate the analysis of composition and diversity.
Holmes et al. assumed that microbial communities follow a Dirichlet-multinomial distribution, but only focused on clustering individuals. Considering the sparse and discrete nature of the OTU data and the need to cluster OTUs to observe structural differences between microbial communities, the proposed model is based on a type I multivariate zero-inflated generalized Poisson distribution (MZIGP) to establish a block model. The block model simultaneously clusters individuals and OTUs to identify the structural matrix of the data by dividing it into multiple blocks, where each block represents a group with higher internal similarity.
In our research, we use simulated and real-world data to evaluate the biclustering performance of clustering individuals and OTUs. The comparison of the commonly used K-means algorithm, Latent Block Model (LBM) and our proposed model are also investigated. Numerical simulations with two groups of individuals and three types of OTUs are conducted to observe clustering accuracy under various parameters. We also applied the proposed method to analyze the gut microbiota data from Holmes’ literature for empirical comparison of the biclustering results.
1. Aryal, S., Alimadadi, A., Manandhar, I., Joe, B., & Cheng, X. (2020). Machine learning strategy for gut microbiome-based diagnostic screening of cardiovascular disease. Hypertension (Dallas, Tex.: 1979), 76(5), 1555–1562. https://doi.org/10.1161/HYPERTENSIONAHA.120.15885
2. Aubert J., Schbath S., and Robin S. (2021). Model-based biclustering for overdispersed count data with application in microbial ecology, Methods in Ecology and Evolution. 12, no. 6, 1050–1061, https://doi.org/10.1111/2041-210X.13582.
3. Camargo, A. P., Nayfach, S., Chen, I. A., Palaniappan, K., Ratner, A., Chu, K., Ritter, S. J., Reddy, T. B. K., Mukherjee, S., Schulz, F., Call, L., Neches, R. Y., Woyke, T., Ivanova, N. N., Eloe-Fadrosh, E. A., Kyrpides, N. C., & Roux, S. (2023). IMG/VR v4: An expanded database of uncultivated virus genomes within a framework of extensive functional, taxonomic, and ecological metadata. Nucleic Acids Research, 51(D1), D733–D743. https://doi.org/10.1093/nar/gkac1037
4. Consul, P. C., & Jain, G. C. (1973). A Generalization of the Poisson Distribution. Technometrics, 15(4), 791–799. https://doi.org/10.2307/1267389
5. Feio, M. J., Serra, S. R. Q., Mortágua, A., Bouchez, A., Rimet, F., Vasselon, V., & Almeida, S. F. P. (2020). A taxonomy-free approach based on machine learning to assess the quality of rivers with diatoms. Science of The Total Environment, 722, 137900. https://doi.org/10.1016/j.scitotenv.2020.137900
6. Govaert, G., & Nadif, M. (2003). Clustering with block mixture models. Pattern Recognition, 36(2), 463-473. https://doi.org/10.1016/S0031-3203(02)00074-2
7. Govaert, G., & Nadif, M. (2008). Block clustering with Bernoulli mixture models: Comparison of different approaches. Computational Statistics and Data Analysis, 52(6), 3233–3245. https://doi.org/10.1016/j.csda.2007.09.007
8. Ha, M. J., Kim, J., Galloway-Peña, J., Do, K.-A., & Peterson, C. B. (2020). Compositional zero-inflated network estimation for microbiome data. BMC Bioinformatics, 21 (Suppl 21), 581. https://doi.org/10.1186/s12859-020-03911-w
9. Holmes, I., Harris, K., & Quince, C. (2012). Dirichlet multinomial mixtures: Generative models for microbial metagenomics. PLoS ONE, 7(2). https://doi.org/10.1371/journal.pone.0030126
10. Huang, X. F., Tian, G. L., Zhang, C., & Jiang, X. (2017). Type I multivariate zero-inflated generalized Poisson distribution with applications. Statistics and Its Interface, 10(2), 291–311. https://doi.org/10.4310/SII.2017.v10.n2.a12
11. Lewis, C. D. (1982). Industrial and business forecasting methods: a practical guide to exponential smoothing and curve fitting. London: Butterworth Scientific.
12. Ma, M. C. & Yang, C. H. (2023). Use Type I Multivariate Zero-Inflated Poisson Model and Microbial Metagenomics Data to Group Subjects, Journal of Taiwan Intelligent Technologies and Applied Statistics, 20(2), 1-21.
13. Nguyen, NP., Warnow, T., Pop, M., & White, B. (2016). A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity. Npj Biofilms Microbiomes 2, 16004. https://doi.org/10.1038/npjbiofilms.2016.4
14. Ren, Z., Fan, Y., Li, A., Shen, Q., Wu, J., Ren, L., Lu, H., Ding, S., Ren, H., Liu, C., Liu, W., Gao, D., Wu, Z., Guo, S., Wu, G., Liu, Z., Yu, Z., & Li, L. (2020). Alterations of the Human Gut Microbiome in Chronic Kidney Disease. Advanced Science, 7(20), 2001936. https://doi.org/10.1002/advs.202001936
15. Roesch, L. F. W., Fulthorpe, R. R., Riva, A., Casella, G., Hadwin, A. K. M., Kent, A. D., Daroub, S. H., Camargo, F. A. O., Farmerie, W. G., & Triplett, E. W. (2007). Pyrosequencing enumerates and contrasts soil microbial diversity. ISME Journal, 1(4), 283–290. https://doi.org/10.1038/ismej.2007.53
16. Singer, D., Seppey, C. V. W., Lentendu, G., Dunthorn, M., Bass, D., Belbahri, L., Blandenier, Q., Debroas, D., de Groot, G. A., de Vargas, C., Domaizon, I., Duckert, C., Izaguirre, I., Koenig, I., Mataloni, G., Schiaffino, M. R., Mitchell, E. A. D., Geisen, S., & Lara, E. (2021). Protist taxonomic and functional diversity in soil, freshwater and marine ecosystems. Environment International, 146, 106262. https://doi.org/10.1016/j.envint.2020.106262
17. Turnbaugh, P. J., Quince, C., Faith, J. J., McHardy, A. C., Yatsunenko, T., Niazi, F., Affourtit, J., Egholm, M., Henrissat, B., Knight, R., & Gordon, J. I. (2010). Organismal, genetic, and transcriptional variation in the deeply sequenced gut microbiomes of identical twins. Proceedings of the National Academy of Sciences of the United States of America, 107(16), 7503–7508. https://doi.org/10.1073/pnas.1002355107
18. Wang, H., Altemus, J., Niazi, F., Green, H., Calhoun, B. C., Sturgis, C., Grobmyer, S. R., & Eng, C. (2017). Breast tissue, oral and urinary microbiomes in breast cancer. Oncotarget, 8(50), 88122–88138. https://doi.org/10.18632/oncotarget.21490
校內:2029-08-14公開