研究生: |
黃璿宇 Huang, Hsuan-Yu |
---|---|
論文名稱: |
修改位置加權矩陣以提升蛋白質溶劑可接觸性之預測 Improving Prediction of Protein Solvent Accessibility with Modified Position Specific Scoring Matrix |
指導教授: |
張天豪
Chang, Tien-Hao (Darby) |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
論文出版年: | 2008 |
畢業學年度: | 96 |
語文別: | 中文 |
論文頁數: | 50 |
中文關鍵詞: | 溶劑可接觸面積 、溶劑可接觸性 、支援向量迴歸 |
外文關鍵詞: | solvent accessibility, accessible surface area (ASA), support vector regression (SVR) |
相關次數: | 點閱:95 下載:2 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在生命科學這個領域中直接透過蛋白質一級序列來預測其三級結構仍然是一個很大的挑戰,蛋白質摺疊(protein folding)的過程主要是由核心殘基(疏水殘基)的疏水效應(solvent aversion)所造成,因此,能夠準確的預測蛋白質殘基溶劑可接觸性(solvent accessibility),對於預測蛋白質三級結構有莫大的幫助。傳統上溶劑可接觸性的預測被視為一種兩狀態(“暴露”或“埋藏”)或三狀態(“暴露”、“居中”或“埋藏”)的分類問題,然而真實的蛋白質結構並沒有所謂的溶劑接觸狀態,於是近來有部分的研究開始利用各種迴歸(regression)技術,直接預測溶劑可接觸面積(accessible surface area, ASA)。
大部分的ASA預測方法先將蛋白質殘基編碼成特徵向量(feature vector),然後搭配一般的迴歸工具進行分析。近來,位置加權矩陣(position specific scoring matrix, PSSM)已經被證實有助於ASA的預測,廣泛的應用於殘基編碼的過程中。本論文沿續之前的研究,提出了一套改進位置加權矩陣的編碼方法,以提升ASA的預測效能。該方法透過結合相似殘基的位置加權矩陣值來產生新的特徵,在產生的過程中,我們設計了一個遞迴的特徵挑選演算法來確保結合的殘基皆具有相似的物化特性以及相似的溶劑接觸傾向。
另外,我們將本論文所提出的編碼方法搭配支援向量迴歸(support vector regression, SVR)實作出一個ASA預測器,與五個現有的ASA預測器進行比較,來評估本論文所提出的編碼方法。實驗中本論文所提出的預測器達到14.2~14.8%的平均絕對誤差(MAE),優於其他預測器14.9~19.0%的平均絕對誤差,這些結果說明了該編碼方法所產生的特徵有助於蛋白質ASA的預測。
Predicting protein tertiary structures directly from one-dimensional sequences still remains a challenging problem in life science. The process of protein folding is driven to the solvent aversion of some of the residues. Therefore, prediction of protein solvent accessibility is an important step for tertiary structure prediction. Traditionally, predicting solvent accessibility is regarded as either a two- (“exposed” or “buried”) or three-state (“exposed”, “intermediate” or “buried”) classification problem. However, the states of solvent accessibility are not well-defined in real protein structures. Thus, recent studies have started to directly predict the accessible surface area (ASA) based on various regression techniques.
Most ASA predictors encoded residues into feature vectors, which can be incorporated with general regression tools for ASA prediction. Recently, position specific scoring matrix (PSSM) has been demonstrated helpful for ASA prediction and wildly used in the encoding process. In this study, we propose a systematic method to enhance the PSSM-based encoding scheme for ASA prediction. This method accumulates the PSSM values of similar residues to generate novel features. An iterative feature selection is designed to ensure the grouped residues have similar physicochemical properties and similar ASA propensities.
In addition, we incorporate the proposed encoding scheme with support vector regression (SVR) to construct an ASA predictor. The performance of our predictor is evaluated by comparion with five existing predictors. Experimental results show that the proposed predictor achieved a mean absolute error (MAE) of 14.2~14.8%, which is better than the 14.9~19.0% MAE of other predictors. These results demonstrate that the features generated by the proposed encoding scheme are informative for protein ASA prediction.
[1]P. E. Bourne and H. Weissig, Structural Bioinformatics: John Wiley & Sons, 2005.
[2]H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne, "The Protein Data Bank " Nucleic Acids Research, vol. 28, pp. 235-242, 2000.
[3]D. T. Jones, "GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences," Journal of Molecular Biology, vol. 287, pp. 797-815, 1999.
[4]J. A. Cuff and G. J. Barton, "Evaluation and improvement of multiple sequence methods for protein secondary structure prediction," Proteins, vol. 34, pp. 508 - 519, 1999.
[5]B. Rost and C. Sander, "Improved prediction of protein secondary structure by use of sequence profiles and neural networks," Proceedings of the National Academy of Sciences of the United States of America vol. 90, pp. 7558-7562, 1993.
[6]D. T. Jones, "Protein secondary structure prediction based on position-specific scoring matrices," Journal of Molecular Biology, vol. 292, pp. 195-202, 1999.
[7]O. Dor and Y. Zhou, "Real-SPINE: An integrated system of neural networks for real-value prediction of protein structural properties," Proteins, vol. 68, pp. 76-81, 2007.
[8]Z. Yuan and B. Huang, "Prediction of protein accessible surface areas by support vector regression," Proteins, vol. 57, pp. 558 - 564, 2004.
[9]S. Ahmad and M. M. Gromiha, "NETASA: neural network based prediction of solvent accessibility," Bioinformatics, vol. 18, pp. 819-824 2002.
[10]O. Carugo, "Predicting residue solvent accessibility from protein sequence by considering the sequence environment " Protein Engineering, vol. 13, pp. 607-609, 2000.
[11]H. Naderi-Manesh, M. Sadeghi, S. Arab, and A. A. M. Movahedi, "Prediction of protein surface accessibility with information theory," Proteins, vol. 42, pp. 452 - 459, 2001.
[12]Z. Yuan, K. Burrage, and J. S. Mattick, "Prediction of protein solvent accessibility using support vector machines," Proteins, vol. 48, pp. 566 - 570, 2002.
[13]S. Ahmad, M. M. Gromiha, and A. Sarai, "Real value prediction of solvent accessibility from amino acid sequence," Proteins, vol. 50, pp. 629 - 635, 2003.
[14]A. J. Smola and B. Scholkopf, "A tutorial on support vector regression " Statistics and Computing, vol. 14, pp. 199-222, 2004.
[15]E. Jacob and R. Unger, "A tale of two tails: why are terminal residues of proteins exposed?," Bioinformatics, vol. 23, pp. 225-230, 2007.
[16]G. A. Petsko and D. Ringe, Protein Structure and Function: New Science Press in association with BioMed Central 2008.
[17]B. Lee and F. M. Richards, "The interpretation of protein structures: Estimation of static accessibility," Journal of Molecular Biology, vol. 55, pp. 379-380 1971.
[18]A. Shrake and J. A. Rupley, "Environment and exposure to solvent of protein atoms. Lysozyme and insulin," Journal of Molecular Biology, vol. 79, pp. 351-364 1973.
[19]U. Samanta, R. P. Bahadur, and P. Chakrabarti, "Quantifying the accessible surface area of protein residues in their local environment," Protein Engineering, vol. 15, pp. 659-667, 2002.
[20]S. R. Holbrook, S. M. Muskal, and S.-H. Kim, "Predicting surface exposure of amino acids from protein sequence " Protein Engineering, vol. 3, pp. 659-665, 1990.
[21]B. Rost and C. Sander, "Conservation and prediction of solvent accessibility in protein families," Proteins, vol. 20, pp. 216 - 226, 1994.
[22]S. Pascarella, R. D. Persio, F. Bossa, and P. Argos, "Easy method to predict solvent accessibility from multiple protein sequence alignments," Proteins, vol. 32, pp. 190 - 199, 1998.
[23]P. Fariselli and R. Casadio, "RCNPRED: prediction of the residue co-ordination numbers in proteins " Bioinformatics vol. 17, pp. 202-204 2001.
[24]X. Li and X.-M. Pan, "New method for accurate prediction of solvent accessibility from protein sequence," Proteins, vol. 42, pp. 1-5, 2000.
[25]G. Pollastri, P. Baldi, P. Fariselli, and R. Casadio, "Prediction of coordination number and relative solvent accessibility in proteins," Proteins, vol. 47, pp. 142 - 153, 2002.
[26]H. Kim and H. Park, "Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3D local descriptor " Proteins, vol. 54, pp. 557 - 562, 2003.
[27]C. J. Richardson and D. J. Barlow, "The bottom line for prediction of residue solvent accessibility " Protein Engineering, vol. 12, pp. 1051-1054, 1999.
[28]M. N. Nguyen and J. C. Rajapakse, "Two-stage support vector regression approach for predicting accessible surface areas of amino acids," Proteins, vol. 63, pp. 542 - 550, 2006.
[29]C. Cortes and V. Vapnik1, "Support-vector networks " Machine Learning, vol. 20, pp. 273-297, 1995.
[30]J. A. Cuff and G. J. Barton, "Application of multiple sequence alignment profiles to improve protein secondary structure prediction," Proteins, vol. 40, pp. 502 - 511, 2000.
[31]W. Kabsch and C. Sander, "Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features," Biopolymers, vol. 22, pp. 2577 - 2637, 1983.
[32]F. Eisenhaber and P. Argos, "Improved strategy in analytic surface calculation for molecular systems: Handling of singularities and computational efficiency," Journal of Computational Chemistry, vol. 14, pp. 1272 - 1280, 1993.
[33]T. Ooi, M. Oobatake, G. Nemethy, and H. A. Scheraga, "Accessible Surface Areas as a Measure of the Thermodynamic Parameters of Hydration of Peptides," Proceedings of the National Academy of Sciences of the United States of America, vol. 84, pp. 3086-3090 1987.
[34]S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic Local Alignment Search Tool," Journal of Molecular Biology, vol. 215, pp. 403-410, 1990.
[35]D. T. Jones and J. J. Ward, "Prediction of disordered regions in proteins from position specific score matrices," Proteins, vol. 53, pp. 573 - 578, 2003.
[36]K. Shimizu, s. Hirose, T. Noguchi, and Y. Muraoka, "Predicting the protein disordered region using modified position specific scoring matrix," Genome Informatics, p. 150, 2004.
[37]C.-T. Su, C.-Y. Chen, and Y.-Y. Ou, "Protein disorder prediction by condensed PSSM considering propensity for order or disorder," BMC Bioinformatics vol. 7, p. 319, 2006.
[38]C.-C. Chang and C.-J. Lin, "LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm ", 2001.
[39]J.-Y. Wang, H.-M. Lee, and S. Ahmad, "Prediction and evolutionary information analysis of protein solvent accessibility using multiple linear regression," Proteins, vol. 61, pp. 481-491, 2005.
[40]A. Garg, H. Kaur, and G. P. S. Raghava, "Real value prediction of solvent accessibility in proteins using multiple sequence alignment and secondary structure," Proteins, vol. 61, pp. 318-324, 2005.
[41]G. Gianese, F. Bossa, and S. Pascarella, "Improvement in prediction of solvent accessibility by probability profiles " Protein Engineering vol. 16, pp. 987-992, 2003.
[42]M. N. Nguyen and J. C. Rajapakse, "Prediction of protein relative solvent accessibility with a two-stage SVM approach," Proteins, vol. 59, pp. 30 - 37, 2005.