| 研究生: |
巫子軒 Wu, Tzu-Hsuan |
|---|---|
| 論文名稱: |
基於蛋白質結構能量的機器學習模型預測單胺基酸變異之致病性 A Machine Learning Approach based on Structural Energies to Predict the Functional Consequence of Missense Variants |
| 指導教授: |
謝孫源
Hsieh, Sun-Yuan |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 醫學資訊研究所 Institute of Medical Informatics |
| 論文出版年: | 2020 |
| 畢業學年度: | 108 |
| 語文別: | 英文 |
| 論文頁數: | 53 |
| 中文關鍵詞: | 機器學習 、致病性預測 、蛋白質結構能量 、單胺基酸變異 、單核甘酸多形性 |
| 外文關鍵詞: | Machine learning, pathogenicity prediction, protein structure energy, Single amino acid variants, SNP |
| 相關次數: | 點閱:141 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
現今用於預測單胺基酸變異之致病性的工具大部分都是基於序列進行開發,單胺基酸變異可能會改變蛋白質的結構與功能,然而,很少有工具是基於凡德瓦爾力或雙硫鍵等蛋白質結構能量進行預測。本論文使用Rosetta能量函式來計算蛋白質結構能量,並結合機器學習以進行單胺基酸變異之致病性預測。本論文提出之模型的對測試資料進行預測之準確率達0.76,其表現優於其他六種預測工具。進一步的分析顯示參考能量之差值、吸引力之差值以及極性原子間的溶劑化自由能之差值在判斷良性與惡性中扮演很重要的角色,這些特徵表示出胺基酸的物化性質,且僅能在3D的結構中觀察到。最後,我們藉由將本論文提出之能量特徵替換Rhapsody工具中原始的結構特徵,進而達到增進Rhapsody工具之表現,這個結果也表示出本論文所提出的能量特徵能夠更合適且更詳細的表現出單胺基酸變異之致病性。
The most popular tools for predicting pathogenicity of single amino acid variants (SAVs) were developed based on sequence-based techniques. The SAVs may change protein structure and function. However, no method directly predicts the impact of mutations on the energies of the protein structure such as van der Waals force and disulfide bridge. In the study, we combined machine learning methods and energy scores of protein structures calculated by Rosetta energy function to predict SAVs pathogenicity. The Accuracy of our model, achieved 0.76, is better than six prediction tools.
Further analyses revealed that the differential reference energies, attractive energies, and solvation of polar atoms between wild-type and mutant side-chain played essential roles in distinguishing benign from pathogenic variants. These features implied the physicochemical property of amino acid, which could be observed in 3D structures instead of sequences. Last, we improved Rhapsody's performance, a prediction tool from which we utilize the dataset, by appending sixteen features from our method. The results indicated that these energy scores were more appropriate and more detailed on representing the pathogenicity of SAVs.
[1] I. Adzhubei, D. M. Jordan, and S. R. Sunyaev,“Predicting functional effect of human missense mutations using polyphen-2,” Current protocols in human genetics, vol. 76, no. 1, pp. 7–20, 2013.
[2] I. A. Adzhubei, S. Schmidt, L. Peshkin, V. E. Ramensky, A. Gerasimova, P. Bork, A. S. Kondrashov, and S. R. Sunyaev, “A method and server for predicting damaging missense mutations,” Nature methods, vol. 7, no. 4, pp. 248–249, 2010.
[3] R. F. Alford, A. Leaver-Fay, J. R. Jeliazkov, M. J. O’Meara, F. P. DiMaio, H. Park, M. V. Shapovalov, P. D. Renfrew, V. K. Mulligan, K. Kappel et al., “The rosetta all- atom energy function for macromolecular modeling and design,” Journal of chemical theory and computation, vol. 13, no. 6, pp. 3031–3048, 2017.
[4] J. Bendl, J. Stourac, O. Salanda, A. Pavelka, E. D. Wieben, J. Zendulka, J. Bre zovsky, and J. Damborsky, “Predictsnp: robust and accurate consensus classifier for prediction of disease-related mutations,” PLoS Comput Biol, vol. 10, no. 1, p. e1003440, 2014.
[5] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne, “The protein data bank,” Nucleic acids research, vol. 28, no. 1, pp. 235–242, 2000.
[6] S. K. Burley, H. M. Berman, C. Bhikadiya, C. Bi, L. Chen, L. Di Costanzo, C. Christie, K. Dalenberg, J. M. Duarte, S. Dutta et al., “Rcsb protein data bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy,” Nucleic acids research, vol. 47, no. D1, pp. D464–D474, 2019.
[7] S. Chaudhury, S. Lyskov, and J. J. Gray, “Pyrosetta: a script-based interface for implementing molecular modeling algorithms using rosetta,” Bioinformatics, vol. 26, no. 5, pp. 689–691, 2010.
[8] Y. Choi, G. E. Sims, S. Murphy, J. R. Miller, and A. P. Chan, “Predicting the functional effect of amino acid substitutions and indels,” PloS one, vol. 7, no. 10, p. e46688, 2012.
[9] U. Consortium, “Uniprot: a worldwide hub of protein knowledge,” Nucleic acids research, vol. 47, no. D1, pp. D506–D515, 2019.
[10] G. M. Cooper and R. E. Hausman, The cell: Molecular approach. Medicinska naklada, 2004.
[11] J. M. Dana, A. Gutmanas, N. Tyagi, G. Qi, C. O’Donovan, M. Martin, and S. Ve lankar, “Sifts: updated structure integration with function, taxonomy and sequences resource allows 40-fold increase in coverage of structure-based annotations for pro- teins,” Nucleic acids research, vol. 47, no. D1, pp. D482–D489, 2019.
[12] H. Johansson, K. Nordling, T. E. Weaver, and J. Johansson, “The brichos domain- containing c-terminal part of pro-surfactant protein c binds to an unfolded poly- val transmembrane segment,” Journal of Biological Chemistry, vol. 281, no. 30, pp. 21 032–21 039, 2006.
[13] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” in Advances in neural information processing systems, 2017, pp. 3146–3154.
[14] A. Keller, H. R. Eistetter, T. Voss, and K. Scha¨fer, “The pulmonary surfactant protein c (sp-c) precursor is a type ii transmembrane protein,” Biochemical Journal, vol. 277, no. 2, pp. 493–499, 1991.
[15] E. H. Kellogg, A. Leaver-Fay, and D. Baker, “Role of conformational sampling in computing mutation-induced changes in protein structure and stability,” Proteins: Structure, Function, and Bioinformatics, vol. 79, no. 3, pp. 830–838, 2011.
[16] R. J. Kinsella, A. K¨aha¨ri, S. Haider, J. Zamora, G. Proctor, G. Spudich, J. Almeida- King, D. Staines, P. Derwent, A. Kerhornou et al., “Ensembl biomarts: a hub for data retrieval across taxonomic space,” Database, vol. 2011, 2011.
[17] M. J. Landrum, J. M. Lee, M. Benson, G. Brown, C. Chao, S. Chitipiralla, B. Gu, J. Hart, D. Hoffman, J. Hoover et al., “Clinvar: public archive of interpretations of clinically relevant variants,” Nucleic acids research, vol. 44, no. D1, pp. D862–D868, 2016.
[18] M.-X. Li, J. S. Kwan, S.-Y. Bao, W. Yang, S.-L. Ho, Y.-Q. Song, and P. C. Sham, “Predicting mendelian disease-causing non-synonymous single nucleotide variants in exome sequencing studies,” PLoS Genet, vol. 9, no. 1, p. e1003143, 2013.
[19] J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, no. 14. Oakland, CA, USA, 1967, pp. 281–297.
[20] D. P. Minde, Z. Anvarian, S. G. Ru¨diger, and M. M. Maurice, “Messing up disorder: how do missense mutations in the tumor suppressor protein apc lead to cancer?” Molecular cancer, vol. 10, no. 1, pp. 1–9, 2011.
[21] A. Mottaz, F. P. David, A.-L. Veuthey, and Y. L. Yip, “Easy retrieval of single amino-acid polymorphisms and phenotype information using swissvar,” Bioinformatics, vol. 26, no. 6, pp. 851–852, 2010.
[22] L. M. Nogee, A. E. Dunbar, S. Wert, F. Askin, A. Hamvas, and J. A. Whitsett, “Mutations in the surfactant protein c gene associated with interstitial lung disease,” Chest, vol. 121, no. 3, pp. 20S–21S, 2002.
[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon del, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in python,” the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011.
[24] L. Ponzoni, D. A. Pen˜aherrera, Z. N. Oltvai, and I. Bahar, “Rhapsody: Predicting the pathogenicity of human missense variants,” Bioinformatics, vol. 36, no. 10, pp. 3084–3092, 2020.
[25] B. Reva, Y. Antipin, and C. Sander, “Predicting the functional impact of protein mutations: application to cancer genomics,” Nucleic acids research, vol. 39, no. 17, pp. e118–e118, 2011.
[26] C. A. Rohl, C. E. Strauss, K. M. Misura, and D. Baker, “Protein structure prediction using rosetta,” in Methods in enzymology. Elsevier, 2004, vol. 383, pp. 66–93.
[27] J. M. Schwarz, D. N. Cooper, M. Schuelke, and D. Seelow, “MutationTaster2: mu tation prediction for the deep-sequencing age,” Nat. Methods, vol. 11, no. 4, pp. 361–362, Apr 2014.
[28] J. M. Schwarz, C. Ro¨delsperger, M. Schuelke, and D. Seelow, “Mutationtaster evaluates disease-causing potential of sequence alterations,” Nature methods, vol. 7, no. 8, pp. 575–576, 2010.
[29] A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A. Zˇ´ıdek, A. W. Nelson, A. Bridgland et al., “Improved protein structure prediction using potentials from deep learning,” Nature, vol. 577, no. 7792, pp. 706–710, 2020.
[30] H. A. Shihab, J. Gough, D. N. Cooper, P. D. Stenson, G. L. Barker, K. J. Edwards, I. N. Day, and T. R. Gaunt, “Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden markov models,” Human mutation, vol. 34, no. 1, pp. 57–65, 2013.
[31] N.-L. Sim, P. Kumar, J. Hu, S. Henikoff, G. Schneider, and P. C. Ng, “Sift web server: predicting effects of amino acid substitutions on proteins,” Nucleic acids research, vol. 40, no. W1, pp. W452–W457, 2012.
[32] P. D. Stenson, M. Mort, E. V. Ball, K. Evans, M. Hayden, S. Heywood, M. Hussain, A. D. Phillips, and D. N. Cooper, “The human gene mutation database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies,” Human genetics, vol. 136, no. 6, pp. 665–677, 2017.
[33] J. Thusberg, A. Olatubosun, and M. Vihinen, “Performance of mutation pathogenic- ity prediction methods on missense variants,” Human mutation, vol. 32, no. 4, pp. 358–368, 2011.
[34] J. Thusberg and M. Vihinen, “Pathogenic or not? and if so, then how? studying the effects of missense mutations using bioinformatics methods,” Human mutation, vol. 30, no. 5, pp. 703–714, 2009.
[35] S. Velankar, J. M. Dana, J. Jacobsen, G. Van Ginkel, P. J. Gane, J. Luo, T. J. Old- fi C. O’Donovan, M.-J. Martin, and G. J. Kleywegt, “Sifts: structure integration with function, taxonomy and sequences resource,” Nucleic acids research, vol. 41, no. D1, pp. D483–D489, 2012.
[36] K. Wang, M. Li, and H. Hakonarson, “Annovar: functional annotation of genetic variants from high-throughput sequencing data,” Nucleic acids research, vol. 38, no. 16, pp. e164–e164, 2010.
[37] Z. Wang and J. Moult, “Snps, protein structure, and disease,” Human mutation, vol. 17, no. 4, pp. 263–270, 2001.
[38] Wikipedia contributors, “Single nucleotide polymorphism — Wikipedia, the free encyclopedia,” 2020.
[39] A. D. Yates, P. Achuthan, W. Akanni, J. Allen, J. Allen, J. Alvarez-Jarreta, M. R. Amode, I. M. Armean, A. G. Azov, R. Bennett et al., “Ensembl 2020,” Nucleic acids research, vol. 48, no. D1, pp. D682–D688, 2020.
[40] Y. L. Yip, M. Famiglietti, A. Gos, P. D. Duek, F. P. David, A. Gateau, and A. Bairoch, “Annotating single amino acid polymorphisms in the uniprot/swiss-prot knowledgebase,” Human mutation, vol. 29, no. 3, pp. 361–366, 2008.
[41] P. Yue, Z. Li, and J. Moult, “Loss of protein structure stability as a major causative factor in monogenic disease,” Journal of molecular biology, vol. 353, no. 2, pp. 459– 473, 2005.
[42] W. Zhou, T. Chen, Z. Chong, M. A. Rohrdanz, J. M. Melott, C. Wakefield, J. Zeng, J. N. Weinstein, F. Meric-Bernstam, G. B. Mills et al., “Transvar: a multilevel variant annotator for precision genomics,” Nature methods, vol. 12, no. 11, pp. 1002–1003, 2015.