簡易檢索 / 詳目顯示

研究生: 徐振議
Hsu, Chen-yi
論文名稱: 以演化分析為基礎之蛋白質超家族序列分群演算法之建立
Protein Sequence Clustering Based on Phylogenetic Analysis
指導教授: 王惠嘉
Wang, Hei-chia
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2007
畢業學年度: 95
語文別: 中文
論文頁數: 44
中文關鍵詞: 蛋白質家族序列分群演化樹
外文關鍵詞: Sequence clustering, Phylogenetic tree, Protein family
相關次數: 點閱:111下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著生物資訊的蓬勃發展,生物學家常運用蛋白質序列來進行分析研究,常見的應用有未知序列的功能預測(Predict)、註解(Annotate)等方面,在進行這些研究時,最常需要的前置作業就是「序列分群」。序列在經過分群的動作後,可將具有相同功能的序列區分於同一群集裡,因此可利用群集的特性來做後續的研究發展。目前的序列分群大多是利用序列彼此之間的相似度來進行分群,但已經有學者認為單純這樣的作法是不足的,而且會造成一定程度的誤差。另外,有些生物學家想知道的分群結果不單單只是群集內的成員或群集的個數而已,他們還想要知道群集內與群集間的關係。
    本研究要解決的問題是著重於蛋白質超家族(Super-Family)的序列分群,這是一般學者較少處理的部份。超家族的序列分群是有別於一般的序列分群,因為超家族內還包含不同的子家族,而不同的子家族可能具有不同的特性,因此除了需要建立在演化關係上來進行分群外,僅使用單一的工具或方法是無法完善的解決問題。基於上述的問題,本研究將利用演化的觀念來進行序列的分群。首先,利用已廣為生物學家使用的演化套件(Phylip)來進行演化樹的建置,接著經由距離解析、門檻值選擇、切割、合併、再合併等一連串的流程,最後將演化樹轉換成許多不同的子演化樹,而每一子演化樹即可代表一個群集。本研究的分群結果是基於演化樹而來的,因此比單純利用序列相似度的分群方法,是更具有演化上的意義,而且會有較佳的分群結果。

    With the flourishing development of bioinformatics, biologists often use protein sequences to do analysis like predict and annotate unknown proteins. While carrying on these researches, the necessary leading work is clustering. Once the sequences are clustered, the sequences with similar function will be at same cluster and it is well to do follow-up researches by the cluster characteristic. Recently, the most common way to do clustering is exploiting the similarity degrees from sequences each other, but some researchers have already proved that this simple way is insufficient and will cause lots of errors. In addition, some biologists want to understand not only the members of the clusters or the number of the clusters, but also the relation in and between the clusters.
    The problem we want to solve is to focus on the protein superfamily and this area is researches’ less treatment. It is different from general sequences clustering to protein superfamily clustering because there are still some different subfamilies in the superfamily. These subfamilies maybe have different characteristics and relations so it needs to be clustered based on evolution. Therefore only using simple tool or method couldn’t cope with this kind of problem. On the basis of the above mentions, we use the phylogenetic tree to cluster the sequences in protein superfamily. First, the standard package – Phylip is used to reconstruct the phylogenetic tree and then analyze it via a succession of procedures such as distance parsing, threshold choosing, splitting, merging, re-merging, and so on. Finally, the phylogenetic tree will be transformed into several sub-trees, and each sub-tree can represent one cluster. The method is based on phylogenetic tree, so it has evolutionary meanings and better clustering results than the methods based on sequences similarities.

    1. 緒論 1 1.1. 研究背景 1 1.2. 研究動機 2 1.3. 研究目的 4 1.4. 論文大綱 5 2. 文獻探討 7 2.1. 距離與相似度的計算 7 2.1.1. 一般距離的計算 7 2.1.2. 序列相似度與演化距離 8 2.2. 演化樹的建構方式 9 2.2.1. 演化樹介紹 9 2.2.2. 演化樹建構方法 10 2.2.2.1. 距離法 11 2.2.2.2. 最大概似法 14 2.2.2.3. 最大簡約法 14 2.3. 相似度為基礎之序列分群演算法 16 2.3.1. 圖形式分群演算法 16 2.3.2. 分割式分群演算法 18 2.3.3. 階層式分群演算法 19 2.3.4. 模型式分群演算法 20 2.3.5. 密度式與網格式分群演算法 20 3. 研究方法 21 3.1. 切割點找尋模組 21 3.1.1. 原始序列轉換方法 22 3.1.2. 相似序列擷取與轉換方法 23 3.2. 建構與修剪模組 24 3.2.1. 演化樹建構 24 3.2.2. 演化樹切割與調整 25 3.3. 家族建立模組 28 3.3.1. 計算群集相似度 28 3.3.2. 建立蛋白質家族 28 4. 實作驗證 30 4.1. 系統建構 30 4.1.1. 前處理 31 4.2. 實驗方法 31 4.2.1. 資料來源與比較對象 31 4.2.2. 評估指標與實驗方法 32 4.3. 實驗結果與分析 33 4.3.1. 本研究方法之實驗結果 33 4.3.2. Blastclust之實驗結果 34 4.3.3. BAG之實驗結果 36 4.3.4. 實驗分析 37 5. 結論與未來研究方向 39 5.1. 研究結果與貢獻 39 5.2. 未來研究方向 40 參考文獻 41

    英文文獻
    Abascal, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J., Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403-410, 1990.
    Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P., Automatic subspace clustering of high dimensional data for data mining applications. Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, Washington, United States, 94-105, 1998.
    Ankerst, M., Breunig, M. M., Kriegel, H. P., and Sander, J., OPTICS: Ordering Point To Identify the Clustering Structure. Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, United States, 49-60, 1999.
    Berkhin, P., Survey of Clustering Data Mining Techniques. Technical report, Accrue Software, San Jose, CA, 2002.
    Bezdek, J. C., Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981.
    Cheng, J., Cline, M., Martin, J., Finkelstein, D., Awad, T., Kulp, D., and Siani-Rose M. A., A knowledge-based clustering algorithm driven by gene ontology. Journal of Biopharmaceutical Statistics, 14(3), 687-400, 2004.
    Chen, C. Y., Oyang, Y. J., and Juan, H. F., Incremental generation of summarized clustering hierarchy for protein family analysis. Bioinformatics, 20(16), 2586-2596, 2004.
    Chen, Y., Reilly, K. D., Sprague, A. P., and Guan Z., SEQOPTICS: a protein sequence clustering system. BMC Bioinformatics, 7(4):S10, 2006.
    Dayhoff, M. O., Schwartz R. M., and Orcutt, B. C., Atlas of Protein Sequence and Structure, National Biomedical Research Foundation, Washington, D.C., 1978
    Dembele, D., and Kastner, P., Fuzzy C-means method for clustering microarray data. Bioinformatics, 19(8), 973-980, 2003.
    Ester, M., Kriegel, H. P., Sander, J., and Xu, X., A density-based algorithm for discovering c1usters in large spatial databases with noise. Proceeding of 2nd International Conference of Knowledge Discovery and Data Mining, Porland, Orgon, 226-231, 1996.
    Frickey, T., and Lupas A. N., CLANS: a Java application for visualization protein families based on pairwise similarity. Bioinformatics, 20(18),3702-3704, 2004.
    Henikoff, S., and Henikoff, J. G., Amino-acid substitution matrices from protein blocks. Proceedings of The National Academy of Sciences, USA, 89, 10915-10919, 1992.
    Holland, J. H., Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor, 1975.
    Jain, A. K., Murty, M. N., and Flynn, P. J., Data Clustering: A Review. ACM Computing Surveys, 31(3), 264-323, 1999.
    Kaufman, L., and Rousseeuw, P. J., Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons, 1990.
    Kim, S., and Lee, J. (in press). BAG: A graph theoretic sequence clustering algorithm. International Journal of Data Mining and Bioinformatics.
    Kimura, M., The Neutral Theory of Molecular Evolution, Cambridge University Press, UK, 1983.
    Kohonen, T., Self-Organization and Associative Memory, Springer-Verlag, Berlin, Heidelberg, 1989.
    Krane, D. E., and Raymer, M. L., Fundamental concepts of bioinformatics, Benjamin Cummings, San Francisco, 2003.
    Kruskal, J. B., On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society, 7(1), 48-50, 1956.
    Lee, J., Kim, S., Cluster Utility: A new metric for clustering biological sequences. IEEE Computer Society Computational Systems Bioinformatics Conference Workshops, CA, U.S.A, 45-46, 2005.
    Liu, J., and Rost, B., Domains, motifs and clusters in the protein universe. Current Opinion in Chemical Biology, 7, 5-11, 2003.
    McQueen, J. B., Some methods of classification and analysis of multivariate observations. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, University of California, Berkeley, United States, 281-297, 1967.
    Paccanaro, A., Casbon, J. A., and Saqi, M. A. S., Spectral clustering of protein sequences. Nucleic Acids Research, 34(5), 1571-1580, 2006.
    Pearson, W. R., and Lipman, D. J., Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, U.S.A., 85(8), 2444-2448, 1988.
    Raychaudhuri, S., Chang, J. T., Imam, F., and Altman, R. B., The computational analysis of scientific literature to define and recognize gene expression clusters. Nucleic Acids Research, 31(15), 4553-4560, 2003.
    Saitou, N., and Nei, M., The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4, 406-425, 1987.
    Salemi, M., and Vandamme, A. M., The phylogenetic handbook: a practical approach to DNA and Protein phylogeny, Cambridge University Press, U.K, 2003.
    Studier, J. A., and Keppler, K. J., A note on the neighbor-joining algorithm of Saitou and Nei. Molecular Biology and Evolution, 5, 729-731, 1988.
    Sneath, P. H., and Sokal, R. R., Numerical Taxonomy. W.H. Freeman and Company, San Francisco, California, USA, 1973.
    Wang, W., Jiong, Y., and Richard, M., STING: a statistical information grid approach to spatial data mining. Proceedings of the 23rd VLDB Conference, Athens, Greece, 186-195, 1997.
    Zhan, C. T., Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, C-20(1), 68-86, 1970.
    Zhong, W., Altun, G.., Harrison, R., Tai, P. C., and Pan, Y., Improved k-means clustering algorithm for exploring local protein sequence motifs representing common structural property. IEEE Transactions on NanoBioscience, 4(3), 255-265, 2005.

    網站資料
    PIR
    (http://pir.georgetown.edu/)
    PHYLIP
    (http://evolution.genetics.washington.edu/phylip.html)
    PFam
    (http://www.sanger.ac.uk/Software/Pfam/)
    SCOP
    (http://scop.mrc-lmb.cam.ac.uk/scop/)
    Swiss-Prot
    (http://www.expasy.org/sprot/)
    TROP's home page
    (http://www.icp.ucl.ac.be/~opperd/)

    下載圖示 校內:2010-06-29公開
    校外:2010-06-29公開
    QR CODE