簡易檢索 / 詳目顯示

研究生: 張凱鈞
Jhang, Kai-Jyun
論文名稱: 應用機器學習方法從基因體預測真菌在生態上扮演的角色
Applying machine learning model to predict ecological role of fungi on the basis of genomic profile
指導教授: 黃兆立
Huang, Chao-Li
學位類別: 碩士
Master
系所名稱: 生物科學與科技學院 - 熱帶植物與微生物科學研究所
Institute of Tropical Plant Sciences
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 65
中文關鍵詞: 真菌營養機器學習
外文關鍵詞: Fungi, Trophic, Machine Learning
相關次數: 點閱:319下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著基因體定序速度快、成本低,基因體學研究和基因體定序規模的不斷擴大,相對的應用研究也越來越多。目前EnsemblFungi 基因體資料庫提供豐富的真菌菌株的基因體序列、註解、變異、轉錄體等資料。另外 FUNGuild真菌功能性註解的開放性資料庫將許多基因體進行生物功能標記(例如病原性、腐生性和共生性),但仍有許多已定序基因體未被標註其生態角色。這個問題引領我們開發機器學習模型針對基因體組成對於未知生物進行生態功能預測。我們利用 Pfam 蛋白質數據庫並結合EnsemblFungi和Funguild的資料庫並找出生物功能相對應的蛋白質域,之後透過多種蛋白質功能性區域的組合進行特徵提取,引入拼圖方法以增加輸入空間,並改良基於資訊理論和統計檢定的特徵篩選演算法有效低保留訓練資料,開發出集成學習預測架構。我們針對每種算法的預測性能進行評估,多數演算法(特別是Random Forest、Extremely Randomized Trees)的Matthews相關係數(MCC)都顯著高於隨機期望,這意味著拼圖方法和啟發式特徵篩選得以應用,綜合各項預測指標也顯示集成學習的預測性能優於所有其他算法。透過總體基因體組裝結果的驗證,相較於 FUNGuild 的分類資訊推測法,機器學習模型能預測出更多未知真菌的營養型態。分析結果也顯示某些真菌目可能具有多元化的生態角色,說明從基因體資訊推導其功能之重要性。

    Advancements in genome sequencing technology have led to the production of vast amounts of genomic data for fungal organisms. Comparative analysis of this data is facilitated by the EnsemblFungi database. Meanwhile, the FUNGuild database assigns known taxa with trophic characteristics, such as pathographic, saprotrophic, and symbiotrophic, in order to help researchers understand the ecological roles of different fungi. However, many fungi still lack sufficient data for their ecological roles to be determined. This study aimed to develop a machine learning model that could predict the ecological roles of unannotated fungi based on their genomic sequences. To achieve this, the study utilized the protein database Pfam to predict protein domains with known trophic functions assigned by the FUNGuild database. The study then extracted the features associated with saprotrophs based on different combinations of protein domains. Jigsaw puzzle methodology was employed to increase the instance space, and a heuristic ensemble feature selection method by statistics and information theory was developed. The Random Forest and Extremely Randomized Trees algorithms both showed significantly higher Matthews correlation coefficient (MCC) scores than random expectation, indicating that data augmentation and heuristic feature selection were successful. The Ensemble Model outperformed all other algorithms in terms of MCC scores. Furthermore, evaluation on metagenome-assembled genomes demonstrated that the machine learning model outperformed FUNGuild in predicting the trophic type of fungi in different environments. The findings suggest that certain orders of fungi may exhibit diverse ecological roles.

    INTRODUCTION 1 Functional Taxonomy Database of Fungi 3 Next-generation sequencing technology and fungal genome progress 5 Relative machine learning tools of fungi 7 Research motivation and purpose 9 MATERIAL AND METHOD 11 Data Preprocessing 11 Feature extraction 11 Feature Selection 12 Feature Selection - Singleton Removal Method 14 Feature Selection - Permutation and Discrimination Method 14 Feature Selection - Informative Method 15 Feature Selection - Inner Independence method 15 Feature Selection - Ensemble Feature Selection 15 Ensemble Model 16 RESULTS 17 DISCUSSION 22 Protein Domain Recombination for Trophic Mode Prediction: Challenges and Opportunities 22 Exploring the Role of Fungal Orders in Ecosystems 23 CONCULSION 27 REFERENCE 28 Fig. 1 Flowchart of the Machine Learning Pipeline. 35 Fig. 2 Venn plot diagrams comparing agreement between FUNGuild database and EnsemblFungi database. 36 Fig. 3 Number of trophic modes in integrated data. 37 Fig. 4 Relationship between protein domain frequency and trophic mode and complexity of dimension. 38 Fig. 5 Data Augmentation. 39 Fig. 6 Concept dataset corresponding to feature selection. 39 Fig. 7 Ensemble Feature Selection 40 Fig. 8 Flowchart of Ensemble Machine Learning. 41 Fig. 9 Model performance evaluated by F-measurement (F1) 42 Fig. 10 Model performance evaluated by Matthew’s correlation coefficient (MCC). 43 Fig. 11 Principal Component Analysis plot showed clustering of trophic mode in existence or not. 45 Fig. 12 The ratio of consistency and disagreement in different trophic mode amongst different orders. 46 Table 1. Contingency table about feature. 47 Table 2. Hyperparameter of machine learning algorithms. 48 Table 3. Prediction performance evaluated by F1 and MCC. 50 Table 4. P value of Ensemble Model performance compared to single algorithms by Wilcoxon signed-rank test 52 Table 5. A comprehensive evaluation of the machine learning performance 53 Table 6. Pearson and Spearman correlation coefficients along with the corresponding p-values for different evaluation metrics. 55 Table 7. P value of Parametric and Non-parametric Multiple Comparison Test amongst different distance in trophic mode. 56 Table 8. Number of consistency and disagreement on different orders of fungi. 57

    Ahmed, S.A. et al. Chromoblastomycosis Caused by Phialophora-Proven Cases from Mexico. J Fungi (Basel) 2021;7(2):95.
    Aiyer, H. et al. Choice of cover crop influences soil fungal and bacterial communities in Prince Edward Island, Canada. Can J Microbiol 2022;68(7):465-482.
    Akiba, T. et al. Optuna: A next-generation hyperparameter optimization framework. In, Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 2019. p. 2623-2631.
    Akoglu, H. User's guide to correlation coefficients. Turk J Emerg Med 2018;18(3):91-93.
    Almario, J. et al. Unearthing the plant-microbe quid pro quo in root associations with beneficial fungi. New Phytol 2022;234(6):1967-1976.
    Apic, G., Gough, J. and Teichmann, S.A. An insight into domain combinations. Bioinformatics 2001;17 Suppl 1:S83-89.
    Bao, Y. et al. The fungal community in non-rhizosphere soil of Panax ginseng are driven by different cultivation modes and increased cultivation periods. PeerJ 2020;8:e9930.
    Baskaran, P. et al. Nitrogen dynamics of decomposing Scots pine needle litter depends on colonizing fungal species. FEMS Microbiol Ecol 2019;95(6):fiz059.
    Benos, L. et al. Machine Learning in Agriculture: A Comprehensive Updated Review. Sensors (Basel) 2021;21(11):3758.
    Bhaskara, R.M. and Srinivasan, N. Stability of domain structures in multi-domain proteins. Sci Rep 2011;1:40.
    Blum, A.L. and Langley, P. Selection of relevant features and examples in machine learning. Artificial intelligence 1997;97(1-2):245-271.
    Caporaso, J.G. et al. Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. ISME J 2012;6(8):1621-1624.
    Ceci, A. et al. Roles of saprotrophic fungi in biodegradation or transformation of organic and inorganic pollutants in co-contaminated sites. Appl Microbiol Biotechnol 2019;103(1):53-68.
    Chadwick, M. et al. Sesquiterpenoids lactones: benefits to plants and people. Int J Mol Sci 2013;14(6):12780-12805.
    Chen, J. et al. Strategies of carbon and nitrogen acquisition by saprotrophic and ectomycorrhizal fungi in Finnish boreal Picea abies-dominated forests. Fungal Biol 2019;123(6):456-464.
    Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In, Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016. p. 785-794.
    Costa, F.F. et al. Environmental prospecting of black yeast-like agents of human disease using culture-independent methodology. Sci Rep 2020;10(1):14229.
    Darwish, R.M., AlMasri, M. and Al-Masri, M.M. Mucormycosis: The hidden and forgotten disease. J Appl Microbiol 2022;132(6):4042-4057.
    David-Palma, M. et al. The Untapped Australasian Diversity of Astaxanthin-Producing Yeasts with Biotechnological Potential-Phaffia australis sp. nov. and Phaffia tasmanica sp. nov. Microorganisms 2020;8(11):1651.
    El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res 2019;47(D1):D427-D432.
    Esteves, A.C. et al. Botryosphaeriales fungi produce extracellular enzymes with biotechnological potential. Can J Microbiol 2014;60(5):332-342.
    Farhan, Y. et al. The Effect of Simulated Lepidopteran Ear Feeding Injury on Mycotoxin Accumulation in Grain Corn (Poales: Poaceae). J Econ Entomol 2020;113(5):2187-2196.
    Feldman, D., Yarden, O. and Hadar, Y. Seeking the roles for fungal small-secreted proteins in affecting saprophytic lifestyles. Frontiers in Microbiology 2020;11:455.
    Fisher, R.A. Design of experiments. British Medical Journal 1936;1(3923):554.
    Friesen, T.L. et al. Host-specific toxins: effectors of necrotrophic pathogenicity. Cell Microbiol 2008;10(7):1421-1428.
    Fu, J. et al. The first complete mitochondrial genome of edible and medicinal fungus Chroogomphus rutilus (Gomphidiaceae, Boletales) and insights into its phylogeny. Mitochondrial DNA Part B 2021;6(8):2355-2357.
    Galperin, M.Y. Structural classification of bacterial response regulators: diversity of output domains and domain combinations. J Bacteriol 2006;188(12):4169-4182.
    Ganesan, P. et al. Molecular Mechanisms of Antifungal Resistance in Mucormycosis. Biomed Res Int 2022;2022:6722245.
    Goffeau, A. et al. Life with 6000 genes. Science 1996;274(5287):546, 563-547.
    Gonzalez-Lopez, J., Ventura, S. and Cano, A. Distributed multi-label feature selection using individual mutual information measures. Knowledge-Based Systems 2020;188:105052.
    Gonzalez-Lopez, J., Ventura, S. and Cano, A. Distributed Selection of Continuous Features in Multilabel Classification Using Mutual Information. IEEE Trans Neural Netw Learn Syst 2020;31(7):2280-2293.
    Greener, J.G. et al. A guide to machine learning for biologists. Nat Rev Mol Cell Biol 2022;23(1):40-55.
    He, F. et al. Comparative transcriptomics of two Valsa pyri isolates uncover different strategies for virulence and growth. Microb Pathog 2018;123:478-486.
    Hoenigl, M. et al. COVID-19-associated fungal infections. Nat Microbiol 2022;7(8):1127-1140.
    Hu, F. et al. First Case of Subcutaneous Mycoses Caused by Dirkmeia churashimaensis and a Literature Review of Human Ustilaginales Infections. Front Cell Infect Microbiol 2021;11:711768.
    Hu, L. et al. Feature-specific mutual information variation for multi-label feature selection. Information Sciences 2022;593:449-471.
    Isewon, I. et al. Machine learning algorithms: their applications in plant omics and agronomic traits’ improvement. F1000Research 2022;11(1256):1256.
    Köppen, M. The curse of dimensionality. In, 5th online world conference on soft computing in industrial applications (WSC5). 2000. p. 4-8.
    Kanaan, S.P. et al. Inferring protein-protein interactions from multiple protein domain combinations. Methods Mol Biol 2009;541:43-59.
    Karan, B., Mahapatra, S. and Sahu, S.S. Prediction of protein interactions in rice and blast fungus using Machine Learning. In, 2019 International Conference on Information Technology (ICIT). IEEE; 2019. p. 33-36.
    Ke, G. et al. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 2017;30.
    Koehler, P., Tacke, D. and Cornely, O.A. Bone and joint infections by Mucorales, Scedosporium, Fusarium and even rarer fungi. Crit Rev Microbiol 2016;42(1):158-171.
    Kruys, Å., Eriksson, O.E. and Wedin, M. Phylogenetic relationships of coprophilous Pleosporales (Dothideomycetes, Ascomycota), and the classification of some bitunicate taxa of unknown position. Mycological research 2006;110(5):527-536.
    Kues, U. et al. Genome analysis of medicinal Ganoderma spp. with plant-pathogenic and saprotrophic life-styles. Phytochemistry 2015;114:18-37.
    Kuo, F.Y. and Sloan, I.H. Lifting the curse of dimensionality. Notices of the AMS 2005;52(11):1320-1328.
    Kyaschenko, J. et al. Below‐ground organic matter accumulation along a boreal forest fertility gradient relates to guild interaction within fungal communities. Ecology letters 2017;20(12):1546-1555.
    Layne, E. et al. Supervised learning on phylogenetically distributed data. Bioinformatics 2020;36(Suppl_2):i895-i902.
    Lee, J. and Kim, D.-W. Feature selection for multi-label classification using multivariate mutual information. Pattern Recognition Letters 2013;34(3):349-357.
    Lee, J. and Kim, D.-W. Fast multi-label feature selection based on information-theoretic feature ranking. Pattern Recognition 2015;48(9):2761-2771.
    Lee, J. and Kim, D.-W. Mutual information-based multi-label feature selection using interaction information. Expert Systems with Applications 2015;42(4):2013-2025.
    Lees, J.G. et al. Functional innovation from changes in protein domains and their combinations. Current opinion in structural biology 2016;38:44-52.
    Lemaître, G., Nogueira, F. and Aridas, C.K. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research 2017;18(1):559-563.
    Li, M. et al. Polysaccharide from Pycnoporus sanguineus ameliorates dextran sulfate sodium-induced colitis via helper T cells repertoire modulation and autophagy suppression. Phytother Res 2020;34(10):2649-2664.
    Li, R. et al. Machine learning meets omics: applications and perspectives. Briefings in Bioinformatics 2022;23(1):bbab460.
    Li, Y. et al. Biodiversity and human-pathogenicity of Phialophora verrucosa and relatives in Chaetothyriales. Persoonia 2017;38(1):1-19.
    Liakos, K.G. et al. Machine learning in agriculture: A review. Sensors 2018;18(8):2674.
    Lin, C.-Y. and Liu, J.C. Modular protein domains: an engineering approach toward functional biomaterials. Current opinion in biotechnology 2016;40:56-63.
    Lin, Y. et al. Streaming feature selection for multilabel learning based on fuzzy mutual information. IEEE Transactions on Fuzzy Systems 2017;25(6):1491-1507.
    Lombard, L. et al. Generic concepts in Nectriaceae. Stud Mycol 2015;80(1):189-245.
    Lundberg, S.M. and Lee, S.-I. A unified approach to interpreting model predictions. Advances in neural information processing systems 2017;30.
    Ma, Y. et al. Community composition and trophic mode diversity of fungi associated with fruiting body of medicinal Sanghuangporus vaninii. BMC Microbiol 2022;22(1):251.
    Mahood, E.H., Kruse, L.H. and Moghe, G.D. Machine learning: A powerful tool for gene function prediction in plants. Appl Plant Sci 2020;8(7):e11376.
    Mayer, V.E. et al. Volatile Organic Compounds in the Azteca/Cecropia Ant-Plant Symbiosis and the Role of Black Fungi. J Fungi (Basel) 2021;7(10):836.
    Meswaet, Y. et al. Unravelling unexplored diversity of cercosporoid fungi (Mycosphaerellaceae, Mycosphaerellales, Ascomycota) in tropical Africa. MycoKeys 2021;81:69-138.
    Mistry, J. et al. Pfam: The protein families database in 2021. Nucleic Acids Res 2021;49(D1):D412-D419.
    Molina, L.C., Belanche, L. and Nebot, À. Feature selection algorithms: A survey and experimental evaluation. In, 2002 IEEE International Conference on Data Mining, 2002. Proceedings.: IEEE; 2002. p. 306-313.
    Morel, S. et al. Antiproliferative and Antioxidant Activities of Wild Boletales Mushrooms from France. Int J Med Mushrooms 2018;20(1):13-29.
    Mukaka, M.M. Statistics corner: A guide to appropriate use of correlation coefficient in medical research. Malawi Med J 2012;24(3):69-71.
    Muszynska, B. et al. Fomitopsis officinalis: a Species of Arboreal Mushroom with Promising Biological and Medicinal Properties. Chem Biodivers 2020;17(6):e2000213.
    Nayarisseri, A. et al. Artificial Intelligence, Big Data and Machine Learning Approaches in Precision Medicine & Drug Discovery. Curr Drug Targets 2021;22(6):631-655.
    Nguyen, N.H. et al. FUNGuild: an open annotation tool for parsing fungal community datasets by ecological guild. Fungal Ecology 2016;20:241-248.
    Nilsson, R.H. et al. Mycobiome diversity: high-throughput sequencing and identification of fungi. Nat Rev Microbiol 2019;17(2):95-109.
    Nord, C.L. et al. Sesquiterpenes from the saprotrophic fungus Granulobasidium vellereum (Ellis & Cragin) Julich. Phytochemistry 2014;102:197-204.
    Noroozi, M. and Favaro, P. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. In.; 2016. p. arXiv:1603.09246.
    Nussbaum, R.L., McInnes, R.R. and Willard, H.F. Thompson & Thompson genetics in medicine e-book. Elsevier Health Sciences; 2015.
    Pedregosa, F. et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research 2011;12:2825-2830.
    Petrikkos, G. and Tsioutis, C. Recent Advances in the Pathogenesis of Mucormycoses. Clin Ther 2018;40(6):894-902.
    Picard, M. et al. Integration strategies of multi-omics data for machine learning analysis. Comput Struct Biotechnol J 2021;19:3735-3746.
    Porter, L.L. and Rose, G.D. A thermodynamic definition of protein domains. Proc Natl Acad Sci U S A 2012;109(24):9420-9425.
    Qiao, C. et al. Patterns of fungal community succession triggered by C/N ratios during composting. J Hazard Mater 2021;401:123344.
    Quan, Y. et al. Black fungi and ants: a genomic comparison of species inhabiting carton nests versus domatia. IMA Fungus 2022;13(1):4.
    Quazi, S. Artificial intelligence and machine learning in precision and genomic medicine. Med Oncol 2022;39(8):120.
    Quintero-Cabello, K.P. et al. Antioxidant Properties and Industrial Uses of Edible Polyporales. J Fungi (Basel) 2021;7(3):196.
    Reuter, K. et al. PreTIS: A Tool to Predict Non-canonical 5' UTR Translational Initiation Sites in Human and Mouse. PLoS Comput Biol 2016;12(10):e1005170.
    Rohani, A. and Mamarabadi, M. Free alignment classification of dikarya fungi using some machine learning methods. Neural Computing and Applications 2019;31:6995-7016.
    Rybak, K. et al. A functionally conserved Zn2Cys6 binuclear cluster transcription factor class regulates necrotrophic effector gene expression and host‐specific virulence of two major Pleosporales fungal pathogens of wheat. Molecular plant pathology 2017;18(3):420-434.
    Sarrocco, S. Dung-inhabiting fungi: a potential reservoir of novel secondary metabolites for the control of plant pathogens. Pest Manag Sci 2016;72(4):643-652.
    Sharma, K.K. Fungal genome sequencing: basic biology to biotechnology. Crit Rev Biotechnol 2016;36(4):743-759.
    Shen, W. and Ren, H. TaxonKit: A practical and efficient NCBI taxonomy toolkit. J Genet Genomics 2021;48(9):844-850.
    Steinbrink, J.M. and Miceli, M.H. Mucormycosis. Infect Dis Clin North Am 2021;35(2):435-452.
    Stollar, E.J. and Smith, D.P. Uncovering protein structure. Essays in Biochemistry 2020;64(4):649-680.
    Teixeira, M.M. et al. Exploring the genomic diversity of black yeasts and relatives (Chaetothyriales, Ascomycota). Stud Mycol 2017;86(1):1-28.
    Tian, R. et al. Purification and Structure Characterization of the Crude Polysaccharide from the Fruiting Bodies of Butyriboletus pseudospeciosus and Its Modulation Effects on Gut Microbiota. Molecules 2023;28(6):2679.
    van der Lee, T.A.J. and Medema, M.H. Computational strategies for genome-based natural product discovery and engineering in fungi. Fungal Genet Biol 2016;89:29-36.
    Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature methods 2020;17(3):261-272.
    Vogel, C. et al. Structure, function and evolution of multidomain proteins. Curr Opin Struct Biol 2004;14(2):208-216.
    Wang, H., Bell, D. and Murtagh, F. Relevance approach to feature subset selection. In, Feature Extraction, Construction and Selection. Springer; 1998. p. 85-99.
    Wei, X. et al. Ericoid mycorrhizal fungi as biostimulants for improving propagation and production of ericaceous plants. Frontiers in Plant Science 2022;13.
    White III, R.A. et al. The past, present and future of microbiome analyses. Nature Protocols 2016;11(11):2049-2053.
    Wu, P. et al. The mitogenomes of two saprophytic Boletales species (Coniophora) reveals intron dynamics and accumulation of plasmid-derived and non-conserved genes. Comput Struct Biotechnol J 2021;19:401-414.
    Xie, G. et al. Fungal community succession contributes to product maturity during the co-composting of chicken manure and crop residues. Bioresour Technol 2021;328:124845.
    Yan, L. et al. Beneficial effects of endophytic fungi colonization on plants. Appl Microbiol Biotechnol 2019;103(8):3327-3340.
    yan Tian, L. et al. New records of Celoporthe guangdongensis and Cytospora rhizophorae on mangrove apple in China. Biodiversity Data Journal 2020;8.
    Yang, N. et al. Interaction among soil nutrients, plant diversity and hypogeal fungal trophic guild modifies root-associated fungal diversity in coniferous forests of Chinese Southern Himalayas. Plant and Soil 2022:1-14.
    Yang, R.H. et al. Bacterial Profiling and Dynamic Succession Analysis of Phlebopus portentosus Casing Soil Using MiSeq Sequencing. Front Microbiol 2019;10:1927.
    Yang, T. et al. Families, genera, and species of Botryosphaeriales. Fungal Biol 2017;121(4):322-346.
    Yang, Y. et al. Genome sequencing and comparative genomics analysis revealed pathogenic potential in Penicillium capsulatum as a novel fungal pathogen belonging to Eurotiales. Frontiers in Microbiology 2016;7:1541.
    Yates, A.D. et al. Ensembl Genomes 2022: an expanding genome resource for non-vertebrates. Nucleic acids research 2022;50(D1):D996-D1003.
    Yates, A.D. et al. Ensembl Genomes 2022: an expanding genome resource for non-vertebrates. Nucleic Acids Res 2022;50(D1):D996-D1003.
    Yoshinaga, Y. et al. Genome Sequencing. Methods Mol Biol 2018;1775:37-52.
    Zavodna, M. et al. The accuracy, feasibility and challenges of sequencing short tandem repeats using next-generation sequencing platforms. PLoS One 2014;9(12):e113862.
    Zhang, Y. et al. Pleosporales. Fungal diversity 2012;53:1-221.
    Zhang, Y. et al. Multi-locus phylogeny of Pleosporales: a taxonomic, ecological and evolutionary re-evaluation. Stud Mycol 2009;64(1):85-102S105.

    下載圖示 校內:2025-04-30公開
    校外:2025-04-30公開
    QR CODE