簡易檢索 / 詳目顯示

研究生: 張韶恩
Chang, Shao-En
論文名稱: 對OTU資料分組和對病患分群之研究
A study for grouping of OTU data and patients
指導教授: 馬瀰嘉
Ma, Mi-Chia
學位類別: 碩士
Master
系所名稱: 管理學院 - 統計學系
Department of Statistics
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 55
中文關鍵詞: 廣義卜瓦松分配EM演算法資料分類
外文關鍵詞: Generalized Poisson distribution, EM algorithm, Data classification
相關次數: 點閱:44下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 分類操作單元 (Operational Taxonomic Unit; OTU)的作用是對群體進行分類。透過對生物的基因序列進行標記,將相似的基因序列片段分為一組,視為一個OTU。在基因定序技術愈發成熟的現代,OTU的運用也越來越廣,許多領域的研究都使用OTU來做分析。根據研究,直腸癌、口腔癌與人類乳突瘤病毒感染皆與人體微生物有關,透過對微生物進行分組,找出哪些病患有上述病症相關的OTU亦是現今醫學非常重要的課題。由於基因資料具有離散性和稀少性,本研究假設OTU資料呈現廣義卜瓦松模型,該模型適用於處理離散資料,並可以描述資料的離散程度。
    本研究旨在對基因資料進行分組。首先,使用階層式分群將OTU分組,將相關性高的OTU歸為一組。然後使用廣義卜瓦松模型與具有相關性廣義卜瓦松模型對受試者進行分群。使用統計模擬與實際資料比較潛在集區模型、K-means方法和本研究提出的廣義卜瓦松模型的分群結果。
    Holmes et al. (2012) 假設微生物群落服從狄利克雷多項式混合分配 (Dirichlet multinomial mixtures; DMM) ,在實例中,我們使用Holmes et al. (2012) 文獻提供的資料來比較廣義卜瓦松模型、K-means、階層式分群、潛在集區模型、DMM方法和隨機森林的分類結果。結果顯示,將OTU分為5組時,廣義卜瓦松模型的分類效果最佳,過多的OTU分組可能降低效果。比較準確率、靈敏度、特異度、陽性預測值和陰性預測值等指標後發現廣義卜瓦松模型在區分過瘦的受試者方面有出色的表現。此外,相關性廣義卜瓦松模型在整體分群的結果有不錯的表現。

    The role of Operational Taxonomic Unit (OTU) is to classify populations. By labeling genetic sequences of organisms, similar gene sequence fragments are grouped together as an OTU. With the advancement of genetic sequencing technology, the use of OTUs has become increasingly widespread, and many fields of research utilize OTUs for analysis. According to studies, colorectal cancer, oral cancer, and human papillomavirus infection are all associated with human microbiota. Identifying which patients have OTUs related to these conditions through microbial grouping is a crucial topic in modern medicine. Since genetic data is discrete and sparse, this study assumes that the OTU data follows a generalized Poisson model, which is suitable for handling discrete data.
    The objective of this study is to group genetic data. First, hierarchical clustering is used to group OTUs by assigning highly correlated OTUs to the same cluster. Then, the generalized Poisson model and the correlated generalized Poisson model are used to cluster the subjects. The clustering results of the latent block model, K-means method, and the proposed generalized Poisson model are compared using statistical simulations and real data.
    Holmes et al. (2012) assumed that microbial communities follow a Dirichlet-multinomial mixtures (DMM) distribution. In this case, we compare the classification results of the generalized Poisson model, K-means, hierarchical clustering, latent block model, DMM method, and random forest using the data provided by Holmes et al. (2012). The results show that the generalized Poisson model performs the best when OTUs are divided into 5 groups, and excessive grouping of OTUs may reduce effectiveness. After comparing some metrics such as accuracy, sensitivity, specificity, positive predictive value, and negative predictive value, it is found that the generalized Poisson model exhibits excellent performance in distinguishing underweight subjects. Besides, the correlated generalized Poisson model shows good overall clustering results.

    摘要 I 誌謝 X 目錄 XI 表目錄 XIII 圖目錄 XIV 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的與方法 2 1.3 研究架構 2 第二章 文獻回顧 3 2.1 潛在集區模型 3 2.2 廣義卜瓦松分配 4 2.3 對基因資料分組方法 5 第三章 研究方法 6 3.1 階層式分群 6 3.2 廣義卜瓦松模型 6 3.3 相關性廣義卜瓦松模型 9 第四章 統計分析 12 4.1 OTU組數已知,且各組數目相同 12 4.2 OTU組數已知,且各組數目不全相等 16 4.3 改變資料分配假設 21 4.4 相關性廣義卜瓦松模型 24 4.5 實證分析 29 第五章 結論與未來展望 32 5.1 結論 32 5.2 未來研究方向建議 33 附錄 34 參考文獻 39

    1. Sokal, R.R. and Sneath, The Principles and Practice of Numerical Taxonomy. Taxon, 1963. 12(5): p. 190-199.
    2. Qu, K., Gao, F., Guo, F., & Zou, Q., Taxonomy dimension reduction for colorectal cancer prediction. Computational Biology and Chemistry, 2019. 83: p. 107160.
    3. Koeppel, A.F. and M. Wu, Surprisingly extensive mixed phylogenetic and ecological signals among bacterial Operational Taxonomic Units. Nucleic Acids Research, 2013. 41(10): p. 5175-5188.
    4. Moeseneder, M.M., C. Winter, and G.J. Herndl, Horizontal and vertical complexity of attached and free-living bacteria of the eastern Mediterranean Sea, determined by 16S rDNA and 16S rRNA fingerprints. Limnology and Oceanography, 2001. 46(1): p. 95-107.
    5. Guerrero-Preston, R., Godoy-Vitorino, F., Jedlicka, A., Rodríguez-Hilario, A., González, H., Bondy, J., ... & Sidransky, D., 16S rRNA amplicon sequencing identifies microbiota associated with oral cancer, human papilloma virus infection and surgical treatment. Oncotarget, 2016. 7(32): p. 51320-51334.
    6. Ortiz‐Estrada, Á. M., Gollas‐Galván, T., Martínez‐Córdova, L. R., & Martínez‐Porchas, M., Predictive functional profiles using metagenomic 16S rRNA data: a novel approach to understanding the microbial ecology of aquaculture systems. Reviews in Aquaculture, 2019. 11(1): p. 234-245.
    7. Kollarcikova, M., Kubasova, T., Karasova, D., Crhanova, M., Cejkova, D., Sisak, F., & Rychlik, I., Use of 16S rRNA gene sequencing for prediction of new opportunistic pathogens in chicken ileal and cecal microbiota. Poultry Science, 2019. 98(6): p. 2347-2353.
    8. Azaroual, S. E., Kasmi, Y., Aasfar, A., El Arroussi, H., Zeroual, Y., El Kadiri, Y., ... & Meftah Kadmiri, I., Investigation of bacterial diversity using 16S rRNA sequencing and prediction of its functionalities in Moroccan phosphate mine ecosystem. Scientific Reports, 2022. 12(1): p. 3741.
    9. Govaert, G. and M. Nadif, Clustering with block mixture models. Pattern Recognition, 2003. 36(2): p. 463-473.
    10. Aubert, J., S. Schbath, and S. Robin, Model-based biclustering for overdispersed count data with application in microbial ecology. Methods in Ecology and Evolution, 2021. 12(6): p. 1050-1061.
    11. Consul, P.C. and G.C. Jain, A Generalization of the Poisson Distribution. Technometrics, 1973. 15(4): p. 791-799.
    12. Holmes, I., K. Harris, and C. Quince, Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics. PLOS ONE, 2012. 7(2): p. e30126.

    無法下載圖示 校內:2026-01-18公開
    校外:2026-01-18公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE