研究生: |
方沛涵 Fang, Pei-Han |
---|---|
論文名稱: |
結合隱性特徵與明確特徵預測非激酶特異性磷酸化位點 Combining implicit features and explicit features to predict non-kinase-specific phosphorylation sites |
指導教授: |
張天豪
Chang, Tien-Hao |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2018 |
畢業學年度: | 106 |
語文別: | 中文 |
論文頁數: | 28 |
中文關鍵詞: | 蛋白質磷酸化 、位置特異性得分矩陣 、XGBoost 、深度學習 |
外文關鍵詞: | protein phosphorylation, position-specific scoring matrix, XGBoost, deep learning |
相關次數: | 點閱:93 下載:9 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
磷酸化反應是真核生物中最重要的翻譯後修飾之一,並在細胞內的許多反應過程中有著至關重要的作用。關於激酶及其底物的研究對於理解細胞中的信號傳導網絡非常重要,並且有助於一些疾病開發新的治療方法,如癌症。由於相關的實驗需要耗費大量的時間與人力成本,因此磷酸化位點的預測變得很重要,多年來也發展出許多相關的研究以及工具,大抵上分為兩類 ─ 激酶特異性與非激酶特異性。激酶特異性的磷酸化位點預測需要同時輸入序列以及該序列的激酶,再預測序列中的絲氨酸 ( S )、蘇氨酸 ( T ) 及酪氨酸 ( Y ) 是否為磷酸化位點;非激酶特異性的磷酸化位點預測只需要輸入序列即可進行預測。隨著定序技術的發展,有很多序列還未確定的激酶或是某些激酶的已知序列過少,難以使用激酶特異性的方法來預測,因此非激酶特異性磷酸化位點的預測在此時就越漸重要。
在本研究中使用了兩大類特徵來進行非特異性磷酸化位點預測,分別是明確特徵與隱性特徵。明確特徵包括夏儂熵、相對熵、蛋白質二級結構的預測值、蛋白質非穩定結構預測值、溶劑可及區域、重疊性質、平均累計疏水性質、K 近鄰算法以及位置特異性矩陣。隱性特徵由卷積神經網絡以及循環神經網路產生。最後再將這兩大類特徵輸入XGBoost來進行預測。此方法在S / T / Y磷酸化位點上在測試資料集得到的 AUC 值分別為 0.8598 / 0.7547 / 0.6842,優於現行的其他方法。
Phosphorylation is one of the most important post-translational modification in Eukaryotes, and it plays a vital role of many reactions in cells. Because the related experiments require a lot of time and labor costs, the prediction of phosphorylation sites becomes more important. Many related research and tools have also been developed over the years, and they mostly divided into two categories – kinase-specific and non-kinase specific. The prediction of kinase-specific phosphorylation sites needs to input sequences and the kinase of sequences simultaneously, and then predict Serine, Threonine and Tyrosine is phosphorylation site or not. The prediction of non-kinase-specific phosphorylation sites only needs to input sequence. With the development of sequencing technology, the kinases of many sequences are unsure, and the corresponding sequences of some kinases are too few. So, the prediction of non-kinase-specific phosphorylation sites become more and more important.
In our research, we use explicit and implicit features to predict non-kinase-specific phosphorylation sites. Explicit features include shannon entropy, relative entropy, the prediction value of protein second structure, the prediction value of protein disorder, solvent accessible area, overlapping properties, averaged cumulative hydrophobicity, KNN and position-specific scoring matrix. Implicit features are generated by convolutional neural network and recurrent neural network. Finally, we input these features XGBoost to predict. The AUC of this method on S / T / Y phosphorylation sites are 0.8598 / 0.7547 / 0.6842 respectively, and it is better than other methods currently.
1. Trost B, Kusalik A: Computational prediction of eukaryotic phosphorylation sites. Bioinformatics 2011, 27(21):2927-2935.
2. Blom N, Gammeltoft S, Brunak S: Sequence and structure-based prediction of eukaryotic protein phosphorylation sites1. Journal of molecular biology 1999, 294(5):1351-1362.
3. Hjerrild M, Stensballe A, Rasmussen TE, Kofoed CB, Blom N, Sicheritz-Ponten T, Larsen MR, Brunak S, Jensen ON, Gammeltoft S: Identification of phosphorylation sites in protein kinase A substrates using artificial neural networks and mass spectrometry. Journal of proteome research 2004, 3(3):426-433.
4. Iakoucheva LM, Radivojac P, Brown CJ, O’Connor TR, Sikes JG, Obradovic Z, Dunker AK: The importance of intrinsic disorder for protein phosphorylation. Nucleic acids research 2004, 32(3):1037-1049.
5. Biswas AK, Noman N, Sikder AR: Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information. BMC bioinformatics 2010, 11(1):273.
6. Gao J, Thelen JJ, Dunker AK, Xu D: Musite, a tool for global prediction of general and kinase-specific phosphorylation sites. Molecular & Cellular Proteomics 2010, 9(12):2586-2600.
7. Dou Y, Yao B, Zhang C: PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine. Amino acids 2014, 46(6):1459-1469.
8. Wang D, Zeng S, Xu C, Qiu W, Liang Y, Joshi T, Xu D: MusiteDeep: a deep-learning framework for general and kinase-specific phosphorylation site prediction. Bioinformatics 2017, 33(24):3909-3916.
9. Bairoch A, Bucher P, Hofmann K: The PROSITE database, its status in 1997. Nucleic Acids Research 1997, 25(1):217-221.
10. Hunter T: The Croonian Lecture 1997. The phosphorylation of proteins on tyrosine: its role in cell growth and disease. Philosophical Transactions of the Royal Society B: Biological Sciences 1998, 353(1368):583-605.
11. Johnson LN, Noble ME, Owen DJ: Active and inactive protein kinases: structural basis for regulation. Cell 1996, 85(2):149-158.
12. Johnson LN, Lowe ED, Noble ME, Owen DJ: The structural basis for substrate recognition and control by protein kinases. FEBS letters 1998, 430(1-2):1-11.
13. Pinna LA, Ruzzene M: How do protein kinases recognize their substrates? Biochimica et Biophysica Acta (BBA)-Molecular Cell Research 1996, 1314(3):191-225.
14. Graves LM, Bornfeldt KE, Krebs EG: Historical perspectives and new insights involving the MAP kinase cascades. Advances in second messenger and phosphoprotein research 1997, 31:49.
15. Chen T, Guestrin C: Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining: 2016. ACM: 785-794.
16. LeCun Y, Bottou L, Bengio Y, Haffner P: Gradient-based learning applied to document recognition. Proceedings of the IEEE 1998, 86(11):2278-2324.
17. Mnih V, Heess N, Graves A: Recurrent models of visual attention. In: Advances in neural information processing systems: 2014. 2204-2212.
18. Bahdanau D, Cho K, Bengio Y: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:14090473 2014.
19. Chung J, Gulcehre C, Cho K, Bengio Y: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:14123555 2014.
20. Diella F, Gould CM, Chica C, Via A, Gibson TJ: Phospho. ELM: a database of phosphorylation sites—update 2008. Nucleic acids research 2007, 36(suppl_1):D240-D244.
21. Heazlewood JL, Durek P, Hummel J, Selbig J, Weckwerth W, Walther D, Schulze WX: PhosPhAt: a database of phosphorylation sites in Arabidopsis thaliana and a plant-specific phosphorylation site predictor. Nucleic acids research 2007, 36(suppl_1):D1015-D1021.
22. Durek P, Schmidt R, Heazlewood JL, Jones A, MacLean D, Nagel A, Kersten B, Schulze WX: PhosPhAt: the Arabidopsis thaliana phosphorylation site database. An update. Nucleic acids research 2009, 38(suppl_1):D828-D834.
23. Zulawski M, Braginets R, Schulze WX: PhosPhAt goes kinases—searchable protein kinase target information in the plant phosphorylation site database PhosPhAt. Nucleic acids research 2012, 41(D1):D1176-D1184.
24. Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22(13):1658-1659.
25. Capra JA, Singh M: Predicting functionally important residues from sequence conservation. Bioinformatics 2007, 23(15):1875-1882.
26. Mihalek I, Reš I, Lichtarge O: A family of evolution–entropy hybrid methods for ranking protein residues by importance. Journal of molecular biology 2004, 336(5):1265-1282.
27. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 1997, 25(17):3389-3402.
28. McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction server. Bioinformatics 2000, 16(4):404-405.
29. Ward JJ, McGuffin LJ, Bryson K, Buxton BF, Jones DT: The DISOPRED server for the prediction of protein disorder. Bioinformatics 2004, 20(13):2138-2139.
30. Ahmad S, Gromiha MM, Sarai A: RVP-net: online prediction of real valued accessible surface area of proteins from single sequences. Bioinformatics 2003, 19(14):1849-1851.
31. Taylor WR: The classification of amino acid conservation. Journal of theoretical Biology 1986, 119(2):205-218.
32. Sweet RM, Eisenberg D: Correlation of sequence hydrophobicities measures similarity in three-dimensional protein structure. Journal of molecular biology 1983, 171(4):479-488.