| 研究生: |
林佳明 Lin, Chia-Ming |
|---|---|
| 論文名稱: |
網絡推論Python 套件和蛋白質功能預測資料庫開發 Construction of a Python package for network inference and a database for prediction of protein function |
| 指導教授: |
吳馬丁
Torbjörn Nordling |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 機械工程學系 Department of Mechanical Engineering |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 英文 |
| 論文頁數: | 82 |
| 中文關鍵詞: | 網絡推論 、蛋白質功能 、基因本體 、基因調控系統 、Python 、GeneSPIDER 、ProtFunAI 、JATNIpy 、UniProt 、Swiss-Prot 、FFANEprot |
| 外文關鍵詞: | network inference, protein function, gene ontology, gene regulatory network, Python, GeneSPIDER, ProtFunAI, JATNIpy, UniProt, Swiss-Prot, FFANEprot |
| 相關次數: | 點閱:80 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
研究背景: 軟體的開發在生物資訊學扮演著重要的角色,它可以幫助人們儲存、檢索、操作及分析生物的數據。在論文中,我們開發了兩個工具為JATNIpy 以及ProtFunAI。JATNIpy 是一個用於網絡推論的Python 套件,和由Tjärnberg et al. (2017)開發的Matlab 套件GeneSPIDER 有著相同的功能。ProtFunAI 是一個蛋白質功能預測的Web 服務,內部預測模型FFANEprot 是由Liou et al. (2018) 做出可以從單獨序列預測蛋白質功能的卷積神經網絡。
網絡推論
研究介紹: 網絡推論用於辨別變數之間的關係基於觀察擾動後的變化。在過去二十年中已經開發了許多不同的網絡推論方法,特別是應用於推論基因調控網絡。對給定數據組的方法所期望的準確度仍然是一個懸而未決的問題。Tjärnberg et al. (2017) 開發了Matlab 套件GeneSPIDER,用於調整、運行和評估觀察數據和合成數據的推論演算法。GeneSPIDER 還包含用於生成合成網絡和具有所需屬性的數據之方法,而這些屬性可實現推論演算法的基準。
研究目標: 我們的目標是開發一個名為JATNIpy 的Python 套件,透過重新實現GeneSPIDER 的功能,使更多的研究人員可以使用與網絡推論相關的方法。Python 是種熱門的程式語言,廣泛用於機器學習和生物資訊學。
研究方法: 與GeneSPIDER 一樣,JATNIpy 假設系統可以描述為網絡從擾動到響應的線性映射,並進而演變為變數誤差問題。這些演算法是使用常見的Python 形式實現的,以便熟悉Scikit-learn 使用者。
研究結果: 我們在JATNIpy 中重新實現了GeneSPIDER 的功能,並透過重現之前用於展示GeneSPIDER 的範例結果來展示它。目前在JATNIpy 中只實現了四種網絡推論演算法。原始碼可見於jatnipy.nordlinglab.org,且該套件已上傳至PyPI。
蛋白質功能預測
研究介紹: 關於蛋白質功能的知識對於了解其在一般健康和病理條件下的作用至關重要。各種計算方法已經應用於僅從蛋白質序列預測蛋白質功能的問題上,並且其中一些方法目前已可透過Web 服務使用。最近,Liou et al. (2018)開發了FFANEprot,它是一個深度卷積神經網絡,由來自Swiss-Prot 資料庫的81,267 種蛋白質和1,169 種基因本體中分子功能數據組進行訓練。這個人工智慧模型用於由單獨序列來預測蛋白質功能達成訓練和測試Matthews 相關係數(準確度)分別為0.52(98.84 %)和0.49(98.67 %)。
研究目標: 為了使生物學家和醫學研究人員能夠在不熟悉程式設計的情況下也能使用FFANEprot,我們開發了用於預測蛋白質功能的ProtFunAI Web 服務。
研究方法: ProtFunAI 包括前端和後端部分。前端提供可以在Web 瀏覽器中輸入查詢和可視化結果的使用者圖形介面。所有數據都是透過我們的應用程式介面從我們的資料庫中檢索的。後端利用PostgreSQL 資料庫提供資料儲存,並使用FFANEprot 模型預測蛋白質功能。
研究結果: ProtFunAI Web 服務是由20,405 個已審查的人類蛋白質分子功能預測資料庫和預測服務所組成,該服務可在大約一分鐘內預測任何蛋白質序列的分子功能。所有預測均由FFANEprot 完成。我們的使用者介面還展示來自Uniprot 資料庫(www.uniprot.org) 的每種蛋白質的分子功能,並透過點擊可方便地連結以尋找其他信息。據我們所知,ProtFunAI 在可以單獨從序列可預測每個蛋白質多個分子功能的方法中具有最高的準確度。然而,一些機器學習方法只能同時預測一個功能或功能類別。ProtFunAI Web 服務可於http://protfunai.nordlinglab.org/使用。
研究結論: 我們開發了兩種新工具,希望能為醫學發現和治療做出貢獻。
Development of software tools plays an important role in bioinformatics to help people save, retrieve, manipulate and analyze biological data. In this thesis, we present two tools–JATNIpy and ProtFunAI. JATNIpy is a Python package for network inference implementing the same functionality as the Matlab package GeneSPIDER by Tjärnberg et al. (2017). ProtFunAI is a web service for prediction of protein function from sequence alone based on the convolutional neural network FFANEprot by Liou et al. (2018).
Network Inference
Introduction: Network inference (NI) is used to identify relationships between variables based on observations of changes following perturbations. Many different NI methods have been developed over the past two decades and, in particular, applied to infer gene regulatory networks. What accuracy to expect from a method on a given dataset remains an open question. Tjärnberg et al. (2017) created the Matlab package GeneSPIDER for tuning, running, and evaluating inference algorithms on observed and synthetic data. GeneSPIDER also contains methods for generating synthetic networks and data with desired properties that enable benchmarking of inference algorithms.
Objective: We aim to make methods related to network inference available to a broader group of researchers by developing a Python package, called JATNIpy, starting by reimplementing the functionality of GeneSPIDER. Python is a popular open source language widely used for machine learning and bioinformatics.
Method: JATNIpy, like GeneSPIDER, assumes the system can be described as a linear mapping by the network from perturbations to responses, resulting in an errors-invariables problem. The algorithms are implemented using common Python formalism for familiarity of Scikit-learn users.
Result: We reimplemented the functionality of GeneSPIDER in JATNIpy and demonstrate it by reproducing the result of examples previously used to demonstrate GeneSPIDER. Currently only four NI algorithms are implemented in JATNIpy. The source code is available at jatnipy.nordlinglab.org and the package has been added to PyPI.
Prediction of protein function
Introduction: Knowledge about the function of a protein is essential for understanding its role in both normal healthy and pathological conditions. Various computational methods have been applied to the challenging problem of predicting protein functions from protein sequence alone and a handful of these are currently available as web services. Recently, Liou et al. (2018) created FFANEprot–a deep convolutional neural network trained on a dataset of 81,267 proteins and 1,169 Gene Ontology (GO) terms of molecular function (MF) from the Swiss-Prot database. This AI model for prediction of protein function from sequence alone achieved training and test Matthews correlation coefficients (accuracies) of 0.52 (98.84%) and 0.49 (98.67%), respectively.
Objective: To enable biologists and medical researchers to use FFANEprot without having to know programming, we create the ProtFunAI web service for prediction of protein function.
Method: ProtFunAI consist of a frontend and backend part. The frontend provides the graphical user interface for entry of the query and visualization of the result in the users web browser. All the data is retrieved through our application programming interface (API) from our database. The backend provides data storage in a PostgreSQL database and prediction of the protein function using FFANEprot.
Results: The ProtFunAI web service consist of a database of MF predictions of 20,405 reviewed human proteins and a prediction service that can predict the MF of any supplied protein sequence within roughly a minute. All predictions are made by FFANEprot. Our user interface also shows the MF of each protein from Uniprot (www.uniprot.org) with convenient linkage to look up additional information through a single click. To the best of our knowledge ProtFunAI has the highest accuracy among the methods that can predict multiple MF terms per protein from sequence alone. Some machine learning methods can predict only one function or function category at once. The ProtFunAI web service is available at http://protfunai.nordlinglab.org/.
Conclusions: We have implemented two new tools that hopefully will contribute to medical discoveries and remedies.
Aibar, S., González-Blas, C. B., Moerman, T., Imrichova, H., Hulselmans, G., Rambow, F., Marine, J.-C., Geurts, P., Aerts, J., van den Oord, J., et al. (2017). Scenic: single-cell regulatory network inference and clustering. Nature methods, 14(11):1083.
Apweiler, R., Biswas, M., Fleischmann, W., Kanapin, A., Karavidopoulou, Y., Kersey, P., Kriventseva, E. V., Mittard, V., Mulder, N., Phan, I., et al. (2001). Proteome analysis database: online application of interpro and clustr for the functional classification of proteins in whole genomes. Nucleic acids research, 29(1):44–48.
Attwood, T. K., Bradley, P., Flower, D. R., Gaulton, A., Maudling, N., Mitchell, A. L., Moulton, G., Nordle, A., Paine, K., Taylor, P., et al. (2003). Prints and its automatic supplement, preprints. Nucleic acids research, 31(1):400–402.
Bairoch, A., Apweiler, R., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., et al. (2005). The universal protein resource (uniprot). Nucleic acids research, 33(suppl_1):D154–D159.
Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E. L., et al. (2004). The pfam protein families database. Nucleic acids research, 32(suppl_1):D138–D141.
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., and Bourne, P. E. (2000). The protein data bank. Nucleic acids research, 28(1): 235–242.
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.-C., Estreicher, A., Gasteiger, E., Martin, M. J., Michoud, K., O’donovan, C., Phan, I., et al. (2003). The swiss-prot protein knowledgebase and its supplement trembl in 2003. Nucleic acids research, 31(1):365–370.
Bru, C., Courcelle, E., Carrère, S., Beausse, Y., Dalmar, S., and Kahn, D. (2005). The prodom database of protein domain families: more emphasis on 3d. Nucleic acids research, 33(suppl_1):D212–D215.
Cock, P. J., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., et al. (2009). Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11): 1422–1423.
Delgado, F. M. and Gómez-Vela, F. (2018). Computational methods for gene regulatory networks reconstruction and analysis: A review. Artificial intelligence in medicine.
Diniz, W. and Canduri, F. (2017). Bioinformatics: an overview and its applications. Genet. Mol. Res, 16:1–21.
Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1.
Grant, M. A. (2011). Integrating computational protein function prediction into drug discovery initiatives. Drug development research, 72(1):4–16.
Hawkins, T., Chitale, M., Luban, S., and Kihara, D. (2009). Pfp: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data. Proteins: Structure, Function, and Bioinformatics, 74(3):566–582.
Huang, L., Liao, L., and Wu, C. H. (2016). Inference of protein-protein interaction networks from multiple heterogeneous data. EURASIP Journal on Bioinformatics and Systems Biology, 2016(1):8.
Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., De Castro, E., Langendijk-Genevaux, P. S., Pagni, M., and Sigrist, C. J. (2006). The prosite database. Nucleic acids research, 34(suppl_1):D227–D230.
Hunter, S., Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty, L., Duquenne, L., et al. (2008). Interpro: the integrative protein signature database. Nucleic acids research, 37(suppl_1):D211–D215.
Jensen, L. J., Gupta, R., Staerfeldt, H.-H., and Brunak, S. (2003). Prediction of human protein function according to gene ontology categories. Bioinformatics, 19(5):635–642.
Jiang, R., Zhang, X., and Zhang, M. Q. (2016). Basics of Bioinformatics. Springer.
Klopfenstein, D., Zhang, L., Pedersen, B. S., Ramírez, F., Vesztrocy, A. W., Naldi, A., Mungall, C. J., Yunes, J. M., Botvinnik, O., Weigel, M., et al. (2018). Goatools: A python library for gene ontology analyses. Scientific reports, 8(1):10872.
Lecca, P. and Priami, C. (2013). Biological network inference for drug discovery. Drug discovery today, 18(5-6):256–264.
Liou, Y.-F., Tsai, P.-J., Huang, Z.-Y., Chiou, P.-C., Chu, H.-W., Ciou, L.-P., and Nordling, T. E. M. (2018). FFANEprot: Predicting Protein Functions using a Weight-sharing Multitask Neural Network Optimized by a Firefly Algorithm with Natural Enemy Strategy. In 17th International Conference on Bioinformatics (INCoB-2018), New Delhi, India. Asia Pacific Bioinformatics Network (APBioNet).
Lobley, A. E., Nugent, T., Orengo, C. A., and Jones, D. T. (2008). Ffpred: an integrated feature-based function prediction server for vertebrate proteomes. Nucleic acids research, 36(suppl_2):W297–W302.
Luscombe, N. M., Greenbaum, D., and Gerstein, M. (2001). What is bioinformatics? a proposed definition and overview of the field. Methods of information in medicine, 40(04): 346–358.
Martin, D. M., Berriman, M., and Barton, G. J. (2004). Gotcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC bioinformatics, 5(1):178.
Masson, P., Pedruzzi, I., Pfeiffenberger, E., Porras, P., Raghunath, A., Roechert, B., Orchard, S., and Hermjakob, H. (2012). The intact molecular interaction database. Nucleic Acids Res, 40.
MATLAB (2014). 8.3.0.532 (r2014a).
Matsumoto, H., Kiryu, H., Furusawa, C., Ko, M. S., Ko, S. B., Gouda, N., Hayashi, T., and Nikaido, I. (2017). Scode: an efficient regulatory network inference algorithm from single-cell rna-seq during differentiation. Bioinformatics, 33(15):2314–2321.
Meyer, P. E., Kontos, K., Lafitte, F., and Bontempi, G. (2007). Information-theoretic inference of large transcriptional regulatory networks. EURASIP journal on bioinformatics and systems biology, 2007:8–8.
Mizuguchi, K., Deane, C. M., Blundell, T. L., and Overington, J. P. (1998). Homstrad: a database of protein structure alignments for homologous families. Protein science, 7(11): 2469–2471.
Montes, R. A. C., Coello, G., González-Aguilera, K. L., Marsch-Martínez, N., de Folter, S., and Alvarez-Buylla, E. R. (2014). Aracne-based inference, using curated microarray data, of arabidopsis thaliana root transcriptional regulatory networks. BMC plant biology, 14(1): 97.
Nordling, T. E. and Jacobsen, E. (2013). Robust inference of gene regulatory networks. PhD thesis, PhD thesis, KTH School of Electrical Engineering, Automatic Control Lab.
Paroni, A., Graudenzi, A., Caravagna, G., Damiani, C., Mauri, G., and Antoniotti, M. (2016). Cabernet: a cytoscape app for augmented boolean models of gene regulatory networks. BMC bioinformatics, 17(1):64.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830.
Pineda, A. L. and Gopalakrishnan, V. (2015). Novel application of junction trees to the interpretation of epigenetic differences among lung cancer subtypes. AMIA Summits on Translational Science Proceedings, 2015:31.
Powers, D. M. (2011). Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation.
Roy, A., Kucukural, A., and Zhang, Y. (2010). I-tasser: a unified platform for automated protein structure and function prediction. Nature protocols, 5(4):725.
Sahraeian, S. M., Luo, K. R., and Brenner, S. E. (2015). Sifter search: a web server for accurate phylogeny-based protein function prediction. Nucleic acids research, 43(W1):W141–W147.
Shahdoust, M., Pezeshk, H., Mahjub, H., and Sadeghi, M. (2017). F-map: A bayesian approach to infer the gene regulatory network using external hints. PloS one, 12(9):e0184795.
Shen, H.-B. and Chou, K.-C. (2007). Ezypred: a top–down approach for predicting enzyme functional classes and subclasses. Biochemical and biophysical research communications, 364(1):53–59.
Tjärnberg, A., Morgan, D. C., Studham, M., Nordling, T. E., and Sonnhammer, E. L. (2017). Genespider–gene regulatory network inference benchmarking with controlled network and data properties. Molecular BioSystems, 13(7):1304–1312.
Tjärnberg, A., Nordling, T. E., Studham, M., Nelander, S., and Sonnhammer, E. L. (2015). Avoiding pitfalls in l 1-regularised inference of gene networks. Molecular Biosystems, 11(1):287–296.
Tjärnberg, A., Nordling, T. E., Studham, M., and Sonnhammer, E. L. (2013). Optimal sparsity criteria for network inference. Journal of Computational Biology, 20(5):398–408.
Wang, Z., Zhao, C., Wang, Y., Sun, Z., and Wang, N. (2018). Panda: Protein function prediction using domain architecture and affinity propagation. Scientific reports, 8(1):3484.
Wass, M. N., Barton, G., and Sternberg, M. J. (2012). Combfunc: predicting protein function using heterogeneous data sources. Nucleic acids research, 40(W1):W466–W470.
Wass, M. N. and Sternberg, M. J. (2008). Confunc—functional annotation in the twilight zone. Bioinformatics, 24(6):798–806.
Wikipedia contributors (2019). Confusion matrix — Wikipedia, the free encyclopedia. [Online; accessed 24-May-2019].
Wu, C. H., Yeh, L.-S. L., Huang, H., Arminski, L., Castro-Alvear, J., Chen, Y., Hu, Z., Kourtesis, P., Ledley, R. S., Suzek, B. E., et al. (2003). The protein information resource. Nucleic acids research, 31(1):345–347.
Xiong, J. (2006). Essential bioinformatics. Cambridge University Press.
Xu, Y. and Luo, X.-C. (2018). Pypathway: Python package for biological network analysis and visualization. Journal of Computational Biology, 25(5):499–504.
Yachdav, G., Kloppmann, E., Kajan, L., Hecht, M., Goldberg, T., Hamp, T., Hönigschmid, P., Schafferhans, A., Roos, M., Bernhofer, M., et al. (2014). Predictprotein—an open resource for online prediction of protein structural and functional features. Nucleic acids research, 42(W1):W337–W343.