簡易檢索 / 詳目顯示

研究生: 陳韋志
Chen, Wei-Jhih
論文名稱: 利用序列與結構資訊之隱藏式馬可夫模型之與去氧核醣核酸結合蛋白質預測
Hidden Markov Model Based DNA-binding Proteins Prediction by Mining on Sequence and Structure Information
指導教授: 高宏宇
Kao, Hung-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 醫學資訊研究所
Institute of Medical Informatics
論文出版年: 2008
畢業學年度: 96
語文別: 英文
論文頁數: 63
中文關鍵詞: 與去氧核醣核酸結合蛋白質隱藏式馬可夫模型雌激素應答因子機器學習
外文關鍵詞: Estrogen response elements, DNA-binding Proteins, Hidden Markov Model, Machine learning
相關次數: 點閱:154下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在後基因體時代,蛋白質區塊的結構被發表的相當迅速。為了瞭解細胞內部的生物系統運作情形,了解蛋白質及去氧核醣核酸序列之間的相互作用為近年來生物資訊學家們亟欲研究的主題,但至目前為止,此類的相互作用關係尚未得到完整的了解。一些基於機器學習的方法曾被嘗試於解決相關的議題。目前為止,少數的研究能夠成功的將蛋白質三級結構的特性轉化成合適的特徵屬性以供現有機器學習方法用來進行與去氧核醣核酸結合蛋白質之預測。在本研究中,一個結合序列與結構資訊的機器學習機制將被提出於進行與去氧核醣核酸結合蛋白質之預測。在此機制內將利用隱藏式馬可夫模型來充分表現蛋白質於胺基酸序列以及三級結構兩方面的特徵。此外,一些其它對於預測與去氧核醣核酸結合蛋白質的有用資訊將被納入機制的考量內,例如:胺基酸成分的組合情形、片段結構的組成情形以及暴露於蛋白質結構表面的胺基酸分布情形…等。 在此研究中,一個支持向量機的分類器被發表於與去氧核醣核酸結合蛋白質之預測,並且在交叉驗證裡達到88.45%的整體正確率。更進一步的,我們發展了六個對於特定應答因子的分類器,並應用於對於特定應答因子的與去氧核醣核酸結合蛋白質之預測。在此部分的實驗中,分類器達到了96.57%的平均精確度以及88.83%的平均查全率。最後,我們利用此高正確率之特定應答因子的分類器來預測乳腺癌細胞株內可能與雌激素應答因子結合之蛋白質。

    In the post-genome period, the protein domain structures have been published rapidly. For figuring out the cell function, the mechanism of protein-DNA interaction is an important subject in resent bioinformatics research and has not been comprehensively studied. Several machine learning based methods have been attempted to solve this issue. Until recently, few studies have been successful in translating the tertiary structure characteristics of proteins into appropriate features for utilizing the learning mechanism to predict DNA-binding Proteins. In this work, a novel machine learning approach based on using HMMs (hidden Markov Models) to express the characteristics of DNA-binding Proteins in the both aspects of amino acid sequence and tertiary structure has been presented. Moreover, several helpful features of DNA-binding Proteins have also been utilized in the proposed method, such as residue composition, structure pattern composition and accessible surface area of residues. We also develop a SVM (Support Vector Machine) based classifier to predict general DNA-binding Proteins, and obtain the accuracy of 88.45% through 5-folds cross-validation. Furthermore, a response element specific classifier is constructed for predicting response element specific DNA-binding Proteins, and is obtained the precision of 96.57% with recall rate as 88.83% in average. Finally, this high accuracy classifier is employed to predict the DNA-binding Proteins from MCF-7 which likely to bind to estrogen response elements.

    中文摘要 IV ABSTRACT V TABLE LISTING IX FIGURE LISTING X 1. INTRODUCTION 1 1.1. MOTIVATION 1 1.2. METHOD 2 2. RELATED WORK 4 2.1 DATA RESOURCE 4 2.1.1. PDB 4 2.1.2. Swiss-Prot 4 2.1.3. DBD 5 2.1.4. Pfam 5 2.1.5. TRANSFAC 6 2.1.6. SADB 7 2.2 RELATED RESEARCH 8 3. METHOD 10 3.1 OVERVIEW 10 3.2 SQ-HMM AND ST-HMM 11 3.3 FEATURE SELECTION 12 3.3.1 Featurestat 13 3.3.2 FeatureHMM 22 3.4 DATA ACCESS 31 3.4.1 Training set 31 3.4.2 SPDBP-set 32 3.4.3 MCF-set 33 3.5 ACCURACY MEASURE 34 4. EXPERIMENTS 37 4.1 RESIDUE COMPOSITION 38 4.2 STRUCTURE ALPHABET COMPOSITION 39 4.3 HYDROPHOBICITY 41 4.4 RESIDUE AND ALPHABET PATTERN 42 4.5 HMM 43 4.6 GROUP OF HMM 47 4.7 ROC CURVES 48 4.8 COMPARE WITH PSSM METHOD 49 4.9 ACTUALLY DBP DETECTED BY ST-HMM 50 4.10 EVALUATE THE SIGNIFICANCE OF EACH TYPE OF FEATURES 52 4.11 RESPONSE ELEMENT SPECIFIC DNA-BINDING PROTEINS PREDICTION 53 4.12 PREDICT DNA-BINDING PROTEINS FROM MCF-7 54 5. CONCLUSION AND FUTURE WORK 57 6. APPENDIX 58 7. REFERENCES 61

    1. http://3d-blast.life.nctu.edu.tw/download.php.
    2. http://www.biobase-international.com/pages/index.php.
    3. http://www.gene-regulation.com/.
    4. http://www.imtech.res.in/raghava/dnabinder/download.html.
    5. http://www.rcsb.org/pdb/home/home.do.
    6. The Protein Data Bank. Methods Biochem Anal, 2003. 44: p. 181-98.
    7. Ahmad, S., M.M. Gromiha, and A. Sarai, Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics, 2004. 20(4): p. 477-86.
    8. Bairoch, A., et al., Swiss-Prot: juggling between evolution and stability. Brief Bioinform, 2004. 5(1): p. 39-55.
    9. Baldi, P., et al., Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 2000. 16(5): p. 412-24.
    10. Bateman, A., et al., The Pfam protein families database. Nucleic Acids Res, 2002. 30(1): p. 276-80.
    11. Berman, H., K. Henrick, and H. Nakamura, Announcing the worldwide Protein Data Bank. Nat Struct Biol, 2003. 10(12): p. 980.
    12. Bhardwaj, N., et al., Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucleic Acids Res, 2005. 33(20): p. 6486-93.
    13. Bhardwaj, N. and H. Lu, Residue-level prediction of DNA-binding sites and its application on DNA-binding protein predictions. FEBS Lett, 2007. 581(5): p. 1058-66.
    14. Cai, Y.D. and S.L. Lin, Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim Biophys Acta, 2003. 1648(1-2): p. 127-33.
    15. Chang, C.C. and C.J. Lin, LIBSVM: a library for support vector machines. Software available at http://www. csie. ntu. edu. tw/cjlin/libsvm, 2001. 80: p. 604–611.
    16. Djordjevic, M., A.M. Sengupta, and B.I. Shraiman, A biophysical approach to transcription factor binding site discovery. Genome Res, 2003. 13(11): p. 2381-90.
    17. Doyle, L.A., et al., A multidrug resistance transporter from human MCF-7 breast cancer cells. Proc Natl Acad Sci U S A, 1998. 95(26): p. 15665-70.
    18. Eddy, S.R., Profile hidden Markov models. Bioinformatics, 1998. 14(9): p. 755-63.
    19. Gasteiger, E., et al., ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res, 2003. 31(13): p. 3784-8.
    20. Gough, J., et al., Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol, 2001. 313(4): p. 903-19.
    21. Guex, N. and M.C. Peitsch, SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling. Electrophoresis, 1997. 18(15): p. 2714-23.
    22. Jones, S., et al., Using electrostatic potentials to predict DNA-binding sites on DNA-binding proteins. Nucleic Acids Res, 2003. 31(24): p. 7189-98.
    23. Kabsch, W. and C. Sander, Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 1983. 22(12): p. 2577-637.
    24. Krogh, A., et al., Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol, 1994. 235(5): p. 1501-31.
    25. Kumar, M., M.M. Gromiha, and G.P. Raghava, Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics, 2007. 8: p. 463.
    26. Kummerfeld, S.K. and S.A. Teichmann, DBD: a transcription factor prediction database. Nucleic Acids Res, 2006. 34(Database issue): p. D74-81.
    27. Latchman, D.S., Transcription factors: an overview. Int J Biochem Cell Biol, 1997. 29(12): p. 1305-12.
    28. Luscombe, N.M., et al., An overview of the structures of protein-DNA complexes. Genome Biol, 2000. 1(1): p. REVIEWS001.
    29. Marabotti, A., G. Colonna, and A. Facchiano, New computational strategy to analyze the interactions of ERalpha and ERbeta with different ERE sequences. J Comput Chem, 2007. 28(6): p. 1031-41.
    30. Paillard, G. and R. Lavery, Analyzing protein-DNA recognition mechanisms. Structure, 2004. 12(1): p. 113-22.
    31. Samanta, U., R.P. Bahadur, and P. Chakrabarti, Quantifying the accessible surface area of protein residues in their local environment. Protein Eng, 2002. 15(8): p. 659-67.
    32. Sarai, A. and H. Kono, Protein-DNA recognition patterns and predictions. Annu Rev Biophys Biomol Struct, 2005. 34: p. 379-98.
    33. Sayle, R.A. and E.J. Milner-White, RASMOL: biomolecular graphics for all. Trends in Biochemical Sciences, 1995. 20(9): p. 374-376.
    34. Shanahan, H.P., et al., Identifying DNA-binding proteins using structural motifs and the electrostatic potential. Nucleic Acids Res, 2004. 32(16): p. 4732-41.
    35. Stawiski, E.W., L.M. Gregoret, and Y. Mandel-Gutfreund, Annotating nucleic acid-binding function based on protein structure. J Mol Biol, 2003. 326(4): p. 1065-79.
    36. Stegmaier, P., A.E. Kel, and E. Wingender, Systematic DNA-binding domain classification of transcription factors. Genome Inform, 2004. 15(2): p. 276-86.
    37. Thompson, J.D., D.G. Higgins, and T.J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 1994. 22(22): p. 4673-80.
    38. West, M., et al., Functional mapping of the DNA binding domain of bovine papillomavirus E1 protein. J Virol, 2001. 75(24): p. 11948-60.
    39. Wingender, E., et al., TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res, 1996. 24(1): p. 238-41.
    40. Witten, I.H., et al., Weka: Practical Machine Learning Tools and Techniques with Java Implementations. ICONIP/ANZIIS/ANNES, 1999. 99: p. 192–196.
    41. Yang, J.M. and C.H. Tung, Protein structure database search and evolutionary classification. Nucleic Acids Res, 2006. 34(13): p. 3646-59.
    42. Yu, X., et al., Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. J Theor Biol, 2006. 240(2): p. 175-84.

    下載圖示 校內:2009-08-25公開
    校外:2009-08-25公開
    QR CODE