| 研究生: |
黃健坤 Wong, Jian-Kun |
|---|---|
| 論文名稱: |
卷積神經網絡結合生物分類階層之疾病表現型預測模型 Using Taxonomy Rank for Convolutional Neural Networks to Predict Host Phenotypes |
| 指導教授: |
馬瀰嘉
Ma, Mi-Chia |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 數據科學研究所 Institute of Data Science |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 中文 |
| 論文頁數: | 45 |
| 中文關鍵詞: | 生物分類階層 、卷積神經網絡 、表現型預測 |
| 外文關鍵詞: | Taxonomy Rank, Convolutional Neural Network, Phenotype Prediction |
| 相關次數: | 點閱:162 下載:26 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
次世代基因定序 (Next Generation Sequencing, NGS) 的快速發展,讓基因檢測所需的時間和費用降低。透過NGS匹配16S rRNA所對應的操作分類單元 (Operational taxonomic unit, OTU),能夠獲得人體內細菌的OTU相對豐度 (relative abundance)。醫學上已知細菌的分佈會影響人類的健康狀態,過往的研究使用機器學習中的隨機森林 (Random Forest, RF) 和支援向量機 (Support Vector Machine, SVM) 或多層感知器神經網絡(Multiple-layer Perceptron Neural Network, MLPNN)進行疾病表現型的預測,但這些方法的輸入資料並沒有考慮細菌的生物分類階層 (Taxonomy Rank) 。
本研究參考Reiman et al. (2020)將親緣關係樹(Phylogenetic Tree)轉換成二維矩陣,並使用卷積神經網絡 (Convolutional neural network)進行疾病表現型預測的方法,將細菌的生物分類階轉換為樣本親緣關係樹結構,再將樹結構資料轉換為二維矩陣的同時填入樣本在每個節點的相對豐度。第一列為根節點;下一列所有的節點依照在父節點中的次序排列,前一個節點和後一個節點的間隔為前一個節點的葉節點個數減1,以此類推,使父節點在其第一個子節點的上方。本研究與Reiman et al.不同的是建構樣本親緣關係樹的過程中,利用實例中樣本的相對豐度對擁有相同父節點的子節點進行階層分群,重新排列節點在樣本親緣關係樹中的位置,而非以PhyloT親緣關係樹作為節點位置的模板,形成最終的樣本豐度關係樹,再將實例中每一個樣本的 OTU的相對豐度填入模板中的對應位置,以獲得每一個樣本的二維矩陣。本研究也參考Reiman et al.使用樣本的生物分類階層矩陣作為卷積神經網絡的輸入資料,以提取二維矩陣中生物分類階層關係的特徵,建立疾病表現型的預測模型。相較於Reiman et al.的方法,本研究建構矩陣的過程中不需要使用親緣關係樹做為模板,並且還能保留實例樣本間的資訊。在實例預測的比較中,和Reiman et al.的方法相比,本研究方法的AUROC在不同資料集皆有提升,其中AUROC最高提升0.03。
Advances in Next-Generation Sequencing (NGS) have reduced the time and cost of gene sequencing. NGS can obtain the operational taxonomic unit (OTU) of the bacteria in the human body and the relative abundance of OTU. Bacteria have been proven to affect health. In another study, machine learning methods were used to predict the phenotype of diseases but did not consider the taxonomy rank of bacteria.
Our method refers to Reiman et al. (2020) to convert the phylogenetic tree into a two-dimensional matrix and use Convolutional Neural Network to predict the disease phenotype. First, we convert the taxonomy rank of bacteria into a sample relative abundance tree structure and then convert the tree structure data into a two-dimensional matrix. Next, fill in the relative abundance of each sample at the corresponding position in the matrix. Then, we rearrange the position of nodes in the matrix with the same parent node by hierarchical clustering with the relative abundance of the samples. Finally, our method compares different arrangements of nodes in the matrix and the samples used in the hierarchical clustering. Compare with Reiman et al., our method constructs the sample relative abundance tree without using the PhyloT tool and includes the information of samples.
The best arrangements of nodes in the matrix are the root node of sample relative abundance tree in the first row and first column, all the nodes in the second column are arranged in the order of parent node, the distance between two nodes is the number of leaves node of the first node minus 1, and so on, so that the parent node is above the first child node. Thus, using all samples in hierarchical clustering is better than case samples. In the case studies, the AUROC of our method has improved by up to 0.03.
[1] Goodwin S, McPherson JD, McCombie WR: Coming of age: ten years of next-generation sequencing technologies. Nature Reviews Genetics 2016, 17(6):333.
[2] Schloss PD, Handelsman J: Toward a census of bacteria in soil. PLoS Comput Biol 2006, 2(7):e92.
[3] Ohshima C, Takahashi H, Insang S, Phraephaisarn C, Techaruvichit P, Khumthong R, Haraguchi H, Lopetcharat K, Keeratipibul S: Next-generation sequencing reveals predominant bacterial communities during fermentation of Thai fish sauce in large manufacturing plants. LWT 2019, 114:108375.
[4] Shin J, Lee S, Go M-J, Lee SY, Kim SC, Lee C-H, Cho B-K: Analysis of the mouse gut microbiome using full-length 16S rRNA amplicon sequencing. Scientific reports 2016, 6(1):1-10.
[5] Yu J, Feng Q, Wong SH, Zhang D, yi Liang Q, Qin Y, Tang L, Zhao H, Stenvang J, Li Y: Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut 2017, 66(1):70-78.
[6] Nielsen HB, Almeida M, Juncker AS, Rasmussen S, Li J, Sunagawa S, Plichta DR, Gautier L, Pedersen AG, Le Chatelier E: Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nature biotechnology 2014, 32(8):822-828.
[7] Zeller G, Tap J, Voigt AY, Sunagawa S, Kultima JR, Costea PI, Amiot A, Böhm J, Brunetti F, Habermann N: Potential of fecal microbiota for early‐stage detection of colorectal cancer. Molecular systems biology 2014, 10(11):766.
[8] Le Chatelier E, Nielsen T, Qin J, Prifti E, Hildebrand F, Falony G, Almeida M, Arumugam M, Batto J-M, Kennedy S: Richness of human gut microbiome correlates with metabolic markers. Nature 2013, 500(7464):541-546.
[9] Zhou Q, Zhang X, He R, Wang S, Jiao C, Huang R, He X, Zeng J, Zhao D: The composition and assembly of bacterial communities across the rhizosphere and phyllosphere compartments of Phragmites australis. Diversity 2019, 11(6):98.
[10] Chang H-X, Haudenshield JS, Bowen CR, Hartman GL: Metagenome-wide association study and machine learning prediction of bulk soil microbiome and crop productivity. Frontiers in Microbiology 2017, 8:519.
[11] Moitinho-Silva L, Steinert G, Nielsen S, Hardoim CCP, Wu Y-C, McCormack GP, López-Legentil S, Marchant R, Webster N, Thomas T et al: Predicting the HMA-LMA Status in Marine Sponges by Machine Learning. Frontiers in Microbiology 2017, 8(752).
[12] Pasolli E, Truong DT, Malik F, Waldron L, Segata N: Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS computational biology 2016, 12(7):e1004977.
[13] Mulenga M, Kareem SA, Sabri AQM, Seera M, Govind S, Samudi C, Mohamad SB: Feature Extension of Gut Microbiome Data for Deep Neural Network-Based Colorectal Cancer Classification. IEEE Access 2021, 9:23565-23578.
[14] Fioravanti D, Giarratano Y, Maggio V, Agostinelli C, Chierici M, Jurman G, Furlanello C: Phylogenetic convolutional neural networks in metagenomics. BMC bioinformatics 2018, 19(2):1-13.
[15] Reiman D, Metwally AA, Sun J, Dai Y: PopPhy-CNN: a phylogenetic tree embedded architecture for convolutional neural networks to predict host phenotype from metagenomic data. IEEE journal of biomedical and health informatics 2020, 24(10):2993-3001.
[16] McNeill J, Barrie F, Buck W, Demoulin V, Greuter W, Hawksworth D, Herendeen P, Knapp S, Marhold K, Prado J: International Code of Nomenclature for algae, fungi and plants (Melbourne Code), vol. 154: Koeltz Scientific Books Königstein; 2012.
[17] PhyloT : A phylogenetic tree generator, based on NCBI or GTD taxonomy [https://phylot.biobyte.de/]
[18] Qin J, Li Y, Cai Z, Li S, Zhu J, Zhang F, Liang S, Zhang W, Guan Y, Shen D: A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 2012, 490(7418):55-60.
[19] Pasolli E, Schiffer L, Manghi P, Renson A, Obenchain V, Truong DT, Beghini F, Malik F, Ramos M, Dowd JB: Accessible, curated metagenomic data through ExperimentHub. Nature methods 2017, 14(11):1023.
[20] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V: Scikit-learn: Machine learning in Python. the Journal of machine Learning research 2011, 12:2825-2830.
[21] Linnaeus C: Systema naturae, vol. 1: Stockholm Laurentii Salvii; 1758.
[22] DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, Huber T, Dalevi D, Hu P, Andersen GL: Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Applied and environmental microbiology 2006, 72(7):5069-5072.
[23] Yoon S-H, Ha S-M, Kwon S, Lim J, Kim Y, Seo H, Chun J: Introducing EzBioCloud: a taxonomically united database of 16S rRNA gene sequences and whole-genome assemblies. International journal of systematic and evolutionary microbiology 2017, 67(5):1613.
[24] Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen A, McGarrell DM, Marsh T, Garrity GM: The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic acids research 2009, 37(suppl_1):D141-D145.
[25] Li T, Lai X-L, Zhong Y: Introduction to the methods of constructing phylogenetic trees with DNA sequences. Yi Chuan= Hereditas 2004, 26(2):205-210.
[26] Entringer R: Distance in graphs: trees. Journal of Combinatorial Mathematics and Combinatorial Computing 1997, 24:65-84.
[27] Stuessy TF, König C: Patrocladistic classification. Taxon 2008, 57(2):594-601.
[28] Cox MA, Cox TF: Multidimensional scaling. In: Handbook of data visualization. Springer; 2008: 315-347.
[29] Rosenblatt F: The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review 1958, 65(6):386.
[30] LeCun Y, Bottou L, Bengio Y, Haffner P: Gradient-based learning applied to document recognition. Proceedings of the IEEE 1998, 86(11):2278-2324.
[31] Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, Liu T, Wang X, Wang G, Cai J: Recent advances in convolutional neural networks. Pattern Recognition 2018, 77:354-377.
[32] Bottou L: Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes 1991, 91(8):12.
[33] Kingma DP, Ba J: Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980 2014.
[34] Han J, Moraga C: The influence of the sigmoid function parameters on the speed of backpropagation learning. In: International Workshop on Artificial Neural Networks: 1995. Springer: 195-201.
[35] Agarap AF: Deep learning using rectified linear units (relu). arXiv preprint arXiv:180308375 2018.
[36] Simonyan K, Zisserman A: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556 2014.
[37] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M: Imagenet large scale visual recognition challenge. International journal of computer vision 2015, 115(3):211-252.
[38] Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A: Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition: 2015. 1-9.
[39] Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:12070580 2012.
[40] Breiman L: Random forests. Machine learning 2001, 45(1):5-32.
[41] Breiman L: Bagging predictors. Machine learning 1996, 24(2):123-140.
[42] Ho TK: The random subspace method for constructing decision forests. IEEE transactions on pattern analysis and machine intelligence 1998, 20(8):832-844.
[43] Cortes C, Vapnik V: Support-vector networks. Machine learning 1995, 20(3):273-297.
[44] Boser BE, Guyon IM, Vapnik VN: A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory: 1992. 144-152.
[45] Ward Jr JH: Hierarchical grouping to optimize an objective function. Journal of the American statistical association 1963, 58(301):236-244.
[46] Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M: Tensorflow: A system for large-scale machine learning. In: 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16): 2016. 265-283.
[47] Sokol H, Leducq V, Aschard H, Pham H-P, Jegou S, Landman C, Cohen D, Liguori G, Bourrier A, Nion-Larmurier I: Fungal microbiota dysbiosis in IBD. Gut 2017, 66(6):1039-1048.