簡易檢索 / 詳目顯示

研究生: 曾朝鴻
Tseng, Chao-Hung
論文名稱: 結合比對過濾機制之蛋白質序列分群方法
指導教授: 王惠嘉
Wang, Hei-Chia
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2004
畢業學年度: 92
語文別: 中文
論文頁數: 67
中文關鍵詞: 序列分群、multi-domain蛋白質、比對過濾、suffix array
外文關鍵詞: sequence clustering、multi-domain protein、align
相關次數: 點閱:54下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  •   分群技術的使用可以將相似的物件群集在一起,利用同一個群集內物件具有相似屬性的特性,可推得物件間尚未完全發現的特徵。近年來生物科技的進步,使得蛋白質資料庫的資料量呈現爆炸性的成長,將分群技術運用在蛋白質序列上,使相似程度高的蛋白質序列群集在一起,來協助生物學家來推論未知蛋白質序列的功能性和結構性,是現在重要的研究議題之一。
      但現行分群技術應用於生物序列時,仍有些許問題存在,第一個是一般分群技術的問題,許多分群方法大部分的時間都是秏費在序列間的兩兩相似比對上,當序列數量很龐大時,秏費在序列比對上的時間將會非常的可觀。為了能夠有效的改進序列分群的效率,本研究在分群方法的前置步驟提出以suffix array為基礎的序列比對過濾機制,來過濾掉一些不必要的序列比對,使得此分群方法能夠在不影響分群品質下,節省序列分群的時間。
    另一個問題是蛋白質序列才會引發的問題,蛋白質序列通常是利用遞移性 (transitivity)的關係來分群,適當的利用遞移性關係,可以有效的找出存在於模糊區域 (twilight zone)的同源蛋白質,但因為有些蛋白質序列具有multi-domain的特性,因此這種遞移關係的使用,會造成不同群集間的不當連結,而造成錯誤的分群結果。為了要解決這個問題,本研究參考幾位學者的研究,提出了一個能夠適用於有multi-domain的蛋白質序列分群方法,利用Single-Linkage的方式先將一般蛋白質序列分群,再藉由多功能領域蛋白質序列對應表 (multi-domain sequence mapping table)的內容, 來將具有multi-domain的蛋白質序列分入到適當的群集中。相較於其它的方法,本研究所提出的方法能夠更有效率、更精準的將蛋白質序列分群。

      In recent years, the advancement of biotechnology makes the enormous growth of public sequence databases. Applying the clustering technology to protein sequences, we can group protein sequences according to similarity information, and this will help biologists to predict the function and structure of unknown proteins.
      However, when clustering technology is applied to biologic sequences, there are still some problems existing. First, most contemporary sequence clustering methods are based on some kind of all-against-all comparison, resulting in a quadratic time complexity. When the amount of sequence is huge, the time spent on all-against-all alignment will be considerable. Thus, in order to promote the sequence clustering efficiency, we propose a alignment filtering mechanism based on suffix array to filter out unnecessary sequence alignments and reduce the needed clustering time.
      Another problem occurs when clustering protein sequences. We often use the concept of transitivity to identify remote homologues in twilight zone. However, some proteins are multi-domain proteins, and the usage of transitivity will create artificial links between unrelated proteins through intermediates that contain several domains. Thus, the multi-domain proteins may constitute a problem in the use of transitivity and cause some clustering errors. In order to solve the problem, we propose a novel multi-domain protein clustering method. It uses Single-Linkage to get initial clusters and adds multi-domain proteins into more than one cluster according to the multi-domain sequence mapping table. Comparing with other methods, our method can cluster protein sequences more efficiently and preciously.

    英文摘要 I 中文摘要 II 誌謝 III 目錄 IV 圖目錄 VI 表目錄 VII 第一章 緒論 1 第一節 研究背景 1 第二節 研究動機 2 第三節 研究目的 4 第四節 論文大綱 6 第二章 文獻探討 7 第一節 資料分群 7 2.1.1 距離計算 7 2.1.2 各種分群方法 8 2.1.2.1 分割式分群演算法 8 2.1.2.2 階層式分群演算法 8 2.1.2.3 密度式分群演算法 11 2.1.2.4 網格式分群演算法 11 2.1.2.5 模型式分群演算法 11 第二節 生物序列資料的分群 12 2.2.1 序列距離函式 12 2.2.2 序列分群方法 12 第三節 蛋白質序列分群 14 2.3.1 蛋白質序列特性 14 2.3.2 利用遞移分群 14 2.3.3 各種multi-domain蛋白質序列的分群方法 15 第四節 資訊過濾 18 2.4.1 序列過濾 18 2.4.2 比對過濾 18 第五節 Suffix tree 介紹 19 2.3.1 Suffix tree 歷史 20 2.3.2 Suffix tree在生物資訊上的應用 20 第三章 研究方法 22 第一節 現存蛋白質序列分群方法的問題 22 第二節 分群方法流程 26 第三節 分群前的前置處理 27 3.3.1 序列過濾 27 3.3.2 比對過濾 28 第四節 序列相似分數計算 30 3.4.1 對稱性相似分數 31 3.4.2 非對稱性相似分數 31 第五節 蛋白質序列分群 33 3.5.1 相似矩陣 33 3.5.2 相似矩陣對稱化 33 3.5.3 找出長度較短之multi-domain序列 35 3.5.4 以Single-Linkage 方法分群 36 3.5.5 Multi-domain蛋白質序列的處理 37 第六節 產出分群的結果 38 第七節 小結 38 3.7.1 實例 38 第四章 實作驗證 43 第一節 系統建構及資料來源 43 4.1.1 系統架構 43 4.1.2 實作環境 43 4.1.3 使用套件、模組 43 4.1.4 參數設定 44 4.1.5 資料來源 44 第二節 實驗方法與比較項目 45 4.2.1 實驗方法 45 4.2.2 評估項目 46 第三節 實驗結果分析 47 4.3.1 儲存空間分析 47 4.3.2 比對過濾效能分析 49 4.3.3 正確性分析 52 4.3.4 與其它方法比較結果分析 56 第四節 討論 57 第五章 結論及未來研究方向 60 第一節 研究成果 60 第二節 未來研究方向 61 參考文獻 63

    Abascal, F., & Valencia, A. (2002). Clustering of proximal sequence space for the identification of protein families. Bioinformatics, 18, 908-921.
    Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. Proceedings of the 1998 ACM SIGMOD international conference on Management of data, Seattle, Washington, United States, 94-105.
    Ankerst, M., Breunig, M.M., Kriegel, H.P., & Sander, J. (1999). OPTICS: ordering points to identify the clustering structure. Proceedings of the 1999 ACM SIGMOD international conference on Management of data, Philadephia, Pennsylvania, United States, 49-60.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., & Lipman, D.J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215, 403-410.
    Bairoch, A., & Apweiler, R. (1997). The Swiss-Prot protein sequence data bank and its supplement TREMBL. Nucleic Acids Research, 25, 31-36.
    Bolten, E., Schliep, A., Schneckener, S., Schomburg, D. & Schrader, R. (2001). Clustering protein sequences-structure prediction by transitive homology. Bioinformatics, 17, 935-941.
    Burke, J., Davison, D., & Hide, W. (1999). d2_cluster: A Validated Method for Clustering EST and Full-Length cDNA Sequence. Genome Research, 9, 1135-1142.
    Claverie, J.M., & States, D.J. (1993). Information enhancement methods for large scale sequence analysis. Computers and Chemistry, 17, 191-201.
    Enright, A.J., & Ouzounis, C.O. (2000). GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics, 16, 451-457.
    Ester, M., Kriegel, H.P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, Orgon, 226-231.
    Etzold, T., Ulyanov, A., & Argos, P. (1996). SRS: Information Retrieval System for Molecular Biology Data Banks. Methods in Enzymolog, 226, 114-128.
    Farach, M. (1997). Optimal suffix tree construction with large alphabets. Proceedings of the 38th Annual Symposium on the Foundations of Computer Science, New York, United States, 137-143.
    Fisher, D. (1987). Improving Inference through Conceptual Clustering. Proceedings of 1987 AAAI Conferences, Seattle, Washington, United States, 461-465.
    Gusfield, D. (1997). Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. New York: Cambridge University Press.
    Henikoff, S., & Henikoff, J. (1993). Performance evaluation of amino acid substitution matrices. Proteins, 17, 49-61.
    Hou, J., & Zhang, Y. (2003). Utilizing Hyperlink Transitivity to Improve Web Page Clustering. Proceedings of the 14th Australasian Database Conference, Adelaide, Australia, 49-57.
    Kaufman, L., & Rousseeuw, P. J. (1990). Finding groups in data: an Introduction to cluster analysis. New York: John Wiley & Sons.
    Kohonen,T. (1990). The self-organizing map. Proceedings of the IEEE, 1464-1480.
    Letunic, I., Goodstadt, L., Dickens, N.J., Doerks, T., Schultz, J., Mott, R., Ciccarelli, F., Copley, R.R., Ponting, C.P., & Bork, P. (2002). Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Research, 30, 242-244.
    Li, W., Jaroszewski, L., & Godzik, A. (2001). Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17, 282-283.
    Li, W., Jaroszewski, L., & Godzik, A. (2002). Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics, 18, 77-82.
    Malde, K., Coward, E., & Jonassen, I. (2003). Fast sequence clustering using a suffix array algorithm. Bioinformatics, 19, 1221-1226.
    McCreight, E.M. (1976). A space-economical suffix tree construction algorithm based system and its evaluation. Journal of the ACM, 262-272.
    McQueen, J.B. (1967). Some Methods of Classification and Analysis of Multivariate Observations. Proceedings of the 5th Berkeley Symposium on Mathematical
    Statistics and Probability, Univ. of California, Berkeley, United States, 281-297.
    Pearson, W.R. (1996). Effective protein sequence comparison. Methods Enzymol, 266, 228-258.
    Pearson, W.R., & Lipman, D.J. (1988). Improved tools for biological sequence comparison. Proceedings of the National Academy of Science, United States, 2444-2448.
    Pedretti, K. (2001). Accurate, parallel clustering of EST (gene) sequences, Master’s Thesis, University of Iowa, Department of Electrical and Computer Engineering.
    Pipenbacher, P., Schliep, A., Schneckener, S., Schonhuth, A., Schomburg, D., & Schrader, R. (2002). ProClust: improved clustering of protein sequences with an extended graph-based approach. Bioinformatics, 18, 181-191.
    Promponas, V.J., Enright, A.J., Tsoka, S., Kreil, D.P., Leroy, C., Hamodrakas, S., Sander, C., & Ouzounis, C.A. (2000). CAST: an iterative algorithm for the complexity analysis of sequence tracts. Bioinformatics, 16, 915-922.
    Sasson, O., Linial, N., & Linial, M. (2002). The metric space of proteins-comparative study of clustering algorithms. Bioinformatics, 18, 14-21.
    Sheikholeslami, G., Chatterjee, S., & Zhang, A. (1998). WaveCluster: A multi-resolution clustering approach for very large spatial databases. Proceedings of 24th International Conference on Very Large Data Bases , New York, United States, 428-439.
    Shibuya, T. (2000). Generalization of a Suffix Tree for RNA Structural Pattern Matching. SWAT 2000, Bergen, Norway, 393-406.
    Shi, J., & Malik, J. (1997). Normalized Cuts and Image Segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, 731-737.
    Smith, T.F., & Waterman, M.F. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147, 195-197.
    Spang, R., & Vingron, M. (2001). Limits of homology detection by pairwise sequence comparison. Bioinformatics, 17, 338-342.
    Thompson, J.D., Higgs, D.G., & Gibson, T.J. (1994). CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties, and weight matrix choice. Nucleic Acids Research, 22, 4673-4680.
    Trelles, O., Andrade, M.A., Valencia, A., Zapata, E.L., & Carazo, J.M. (1998). Computational space reduction and parallelization of a new clustering approach for large groups of sequence. Bioinformatics, 14, 439-451.
    Ukkonen, E. (1995). Online construction of suffix trees. Algorithmica, 14, 249-260.
    Wang, W., Jiong, Y., & Richard, M. (1997). STING: a statistical information grid approach to spatial data mining. Proceedings of the 23rd VLDB Conference, Athens, Greece, 186-195.
    Wootton, J.C., & Federhen, S. (1993). Statistics of local complexity in amino-acid-sequences and sequence databases. Computers and Chemistry, 17, 149-163.
    Wootton, J.C. (1994). Sequences with ‘unusual’ amino acid compositions. Current Opinion in Structural Biology, 4, 413-421.
    Wootton, J.C., & Federhen, S. (1996). Analysis of compositionally biased regions in sequence databases. Methods Enzymol, 266, 554-571.

    下載圖示 校內:2005-06-30公開
    校外:2005-06-30公開
    QR CODE