| 研究生: |
張韶恩 Chang, Shao-En |
|---|---|
| 論文名稱: |
對OTU資料分組和對病患分群之研究 A study for grouping of OTU data and patients |
| 指導教授: |
馬瀰嘉
Ma, Mi-Chia |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 統計學系 Department of Statistics |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 中文 |
| 論文頁數: | 55 |
| 中文關鍵詞: | 廣義卜瓦松分配 、EM演算法 、資料分類 |
| 外文關鍵詞: | Generalized Poisson distribution, EM algorithm, Data classification |
| 相關次數: | 點閱:44 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
分類操作單元 (Operational Taxonomic Unit; OTU)的作用是對群體進行分類。透過對生物的基因序列進行標記,將相似的基因序列片段分為一組,視為一個OTU。在基因定序技術愈發成熟的現代,OTU的運用也越來越廣,許多領域的研究都使用OTU來做分析。根據研究,直腸癌、口腔癌與人類乳突瘤病毒感染皆與人體微生物有關,透過對微生物進行分組,找出哪些病患有上述病症相關的OTU亦是現今醫學非常重要的課題。由於基因資料具有離散性和稀少性,本研究假設OTU資料呈現廣義卜瓦松模型,該模型適用於處理離散資料,並可以描述資料的離散程度。
本研究旨在對基因資料進行分組。首先,使用階層式分群將OTU分組,將相關性高的OTU歸為一組。然後使用廣義卜瓦松模型與具有相關性廣義卜瓦松模型對受試者進行分群。使用統計模擬與實際資料比較潛在集區模型、K-means方法和本研究提出的廣義卜瓦松模型的分群結果。
Holmes et al. (2012) 假設微生物群落服從狄利克雷多項式混合分配 (Dirichlet multinomial mixtures; DMM) ,在實例中,我們使用Holmes et al. (2012) 文獻提供的資料來比較廣義卜瓦松模型、K-means、階層式分群、潛在集區模型、DMM方法和隨機森林的分類結果。結果顯示,將OTU分為5組時,廣義卜瓦松模型的分類效果最佳,過多的OTU分組可能降低效果。比較準確率、靈敏度、特異度、陽性預測值和陰性預測值等指標後發現廣義卜瓦松模型在區分過瘦的受試者方面有出色的表現。此外,相關性廣義卜瓦松模型在整體分群的結果有不錯的表現。
The role of Operational Taxonomic Unit (OTU) is to classify populations. By labeling genetic sequences of organisms, similar gene sequence fragments are grouped together as an OTU. With the advancement of genetic sequencing technology, the use of OTUs has become increasingly widespread, and many fields of research utilize OTUs for analysis. According to studies, colorectal cancer, oral cancer, and human papillomavirus infection are all associated with human microbiota. Identifying which patients have OTUs related to these conditions through microbial grouping is a crucial topic in modern medicine. Since genetic data is discrete and sparse, this study assumes that the OTU data follows a generalized Poisson model, which is suitable for handling discrete data.
The objective of this study is to group genetic data. First, hierarchical clustering is used to group OTUs by assigning highly correlated OTUs to the same cluster. Then, the generalized Poisson model and the correlated generalized Poisson model are used to cluster the subjects. The clustering results of the latent block model, K-means method, and the proposed generalized Poisson model are compared using statistical simulations and real data.
Holmes et al. (2012) assumed that microbial communities follow a Dirichlet-multinomial mixtures (DMM) distribution. In this case, we compare the classification results of the generalized Poisson model, K-means, hierarchical clustering, latent block model, DMM method, and random forest using the data provided by Holmes et al. (2012). The results show that the generalized Poisson model performs the best when OTUs are divided into 5 groups, and excessive grouping of OTUs may reduce effectiveness. After comparing some metrics such as accuracy, sensitivity, specificity, positive predictive value, and negative predictive value, it is found that the generalized Poisson model exhibits excellent performance in distinguishing underweight subjects. Besides, the correlated generalized Poisson model shows good overall clustering results.
1. Sokal, R.R. and Sneath, The Principles and Practice of Numerical Taxonomy. Taxon, 1963. 12(5): p. 190-199.
2. Qu, K., Gao, F., Guo, F., & Zou, Q., Taxonomy dimension reduction for colorectal cancer prediction. Computational Biology and Chemistry, 2019. 83: p. 107160.
3. Koeppel, A.F. and M. Wu, Surprisingly extensive mixed phylogenetic and ecological signals among bacterial Operational Taxonomic Units. Nucleic Acids Research, 2013. 41(10): p. 5175-5188.
4. Moeseneder, M.M., C. Winter, and G.J. Herndl, Horizontal and vertical complexity of attached and free-living bacteria of the eastern Mediterranean Sea, determined by 16S rDNA and 16S rRNA fingerprints. Limnology and Oceanography, 2001. 46(1): p. 95-107.
5. Guerrero-Preston, R., Godoy-Vitorino, F., Jedlicka, A., Rodríguez-Hilario, A., González, H., Bondy, J., ... & Sidransky, D., 16S rRNA amplicon sequencing identifies microbiota associated with oral cancer, human papilloma virus infection and surgical treatment. Oncotarget, 2016. 7(32): p. 51320-51334.
6. Ortiz‐Estrada, Á. M., Gollas‐Galván, T., Martínez‐Córdova, L. R., & Martínez‐Porchas, M., Predictive functional profiles using metagenomic 16S rRNA data: a novel approach to understanding the microbial ecology of aquaculture systems. Reviews in Aquaculture, 2019. 11(1): p. 234-245.
7. Kollarcikova, M., Kubasova, T., Karasova, D., Crhanova, M., Cejkova, D., Sisak, F., & Rychlik, I., Use of 16S rRNA gene sequencing for prediction of new opportunistic pathogens in chicken ileal and cecal microbiota. Poultry Science, 2019. 98(6): p. 2347-2353.
8. Azaroual, S. E., Kasmi, Y., Aasfar, A., El Arroussi, H., Zeroual, Y., El Kadiri, Y., ... & Meftah Kadmiri, I., Investigation of bacterial diversity using 16S rRNA sequencing and prediction of its functionalities in Moroccan phosphate mine ecosystem. Scientific Reports, 2022. 12(1): p. 3741.
9. Govaert, G. and M. Nadif, Clustering with block mixture models. Pattern Recognition, 2003. 36(2): p. 463-473.
10. Aubert, J., S. Schbath, and S. Robin, Model-based biclustering for overdispersed count data with application in microbial ecology. Methods in Ecology and Evolution, 2021. 12(6): p. 1050-1061.
11. Consul, P.C. and G.C. Jain, A Generalization of the Poisson Distribution. Technometrics, 1973. 15(4): p. 791-799.
12. Holmes, I., K. Harris, and C. Quince, Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics. PLOS ONE, 2012. 7(2): p. e30126.
校內:2026-01-18公開