簡易檢索 / 詳目顯示

研究生: 吳宗翰
Wu, Tsung-Han
論文名稱: 運用圖形探勘方法在蛋白質三維結構裡找尋頻繁子圖
Finding Frequent Subgraphs in Protein 3-D Structures Using Graph Mining Approach
指導教授: 謝孫源
Hsieh, Sun-Yuan
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 醫學資訊研究所
Institute of Medical Informatics
論文出版年: 2012
畢業學年度: 100
語文別: 英文
論文頁數: 55
中文關鍵詞: 頻繁子圖搜尋蛋白質結構基序蛋白質三維結構原子位移參數
外文關鍵詞: Frequent subgraph mining, amino acid, protein three-dimensional(3D) structure, B-factor
相關次數: 點閱:132下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 搜尋頻繁子圖問題是一個基本問題存在於資訊勘測的研究當中。在這份論文裡,我們利用子圖搜尋的方法應用到蛋白質結構並找尋可能的頻繁子圖。這些頻繁子圖稱為經常性蛋白質結構基序且用來表現蛋白質結構裡的特徵。利用簡單連接圖來代表每個蛋白質的三維結構,圖形裡的每一個點用來表示二十種蛋白質的胺基酸,當兩點間的距離小於設定的門檻值時兩點間有邊相連。更進一步的,我們利用原子位移參數有效降低圖形裡點和邊的複雜程度並加快程式執行速度。
    在實驗的部分,我們藉由使用實際的蛋白質資料庫作為評估,我們可以顯示出,我們的方法表現比早先已知的方法優越。而且,我們的方法是一個簡單、彈性和容易地被實施的方法。

    The problem of finding frequent subgraphs is a fundamental problem in the data mining research. In this paper, we apply mining approach to find frequent subgraphs in protein structure, and they are also known as recurring protein structure motif and the characteristic of protein structure. Using the simple connected graph to express the protein three-dimensional (3D) structure. Each protein structure in dataset, node represents the twenty types of amino acid, and edge represents a contact distance between two amino acids. Furthermore, we found the graph representation based on B-factor(atomic displacement parameters) can significantly reduce the number of nodes and edges in the graph representation and hence present computational advantage. By using a set of graphs representing a real-world dataset, we demonstrate that the performance of our method is superior to previously known methods. Moreover, our method is a simple, flexible, and easily implemented one for the problem.

    Contents v List of Figures vii List of Tables x 1 Introduction 1 1.1 Motivation 1 1.2 Results 2 1.3 Organization 3 2 Bioinoformatics Background 5 2.1 Amino acids and Protein 5 2.2 Protein Structure 7 2.3 Four types of protein structure 8 2.4 NP-hard and NP-complete 12 2.5 Basics of Graph Theory 14 2.6 Graph Isomorphism Problems 17 2.6.1 Notations 17 2.6.2 Graph Isomorphism 18 2.6.3 Graph Isomorphism Problems 19 2.6.4 Subgraph Isomorphism Problems 20 3 Related Methods 23 3.1 Search strategy (BFS) 23 3.2 Search strategy (DFS) 24 4 Preliminaries 25 5 Method 29 5.1 Building protein graphs 29 5.2 Weighted Adjacency Matrix of Labeled Graphs 31 5.3 Algorithm Detail 32 6 Experimental Results 40 6.1 Chemical Compound Datasets 40 6.2 The protein data bank Dataset 41 6.3 The meaning of frequent subgraph 46 7 Conclusion 51 Bibliography 52

    [1] R. Agrawal, R. Srikant, "Fast Algorithms for Mining Association Rules," in Proceedings of the 20th International Conference on Very Large Data Bases, pp. 487-499, September 12-15, 1994.
    [2] P. Aloy, E. Querol, F. X. Aviles, and M. J. E. Sternberg, "Automates Structure-based Prediction of Functional Sites in Proteins: Applications to Assessing the Validity of Inheriting Protein Function from Homology in Genome Annotation and to Protein Docking," Journal of Molecular Biology, vol. 311, pp. 395-408, 2001.
    [3] Z. Aung and K. L. Tan, "Automatic Protein Structure Classification through Structural Fingerprinting," the Fourth IEEE Symposium on Bioinformatics and Bioengineering, pp. 508-515, 2004.
    [4] D. Banyopadhyay, J. Huan, J. Liu, J. Prins, J. Snoeyink, A. Tropsha, and W. Wang, "Using Fast Subgraph Isomorphism Checking for Protein Functional Annotation Using SCOP and Gene Ontology," UNC CS Technical Report, 2005.
    [5] D. Bandyopadhyay, J. Huan, J. Liu, J. Prins, J. Snoeyink, W. Wang, and A. Tropsha, "Functional Neighbors: Inferring Relationships between Nonhomologous Protein Families Using Family-Specific Packing Motifs," IEEE Transactions on Information Technology in Biomedicine, vol. 14, no. 5, pp. 1137-1143, 2010.
    [6] H. M. Berman, "The Protein Data Bank: a historical perspective," Acta Crystallographica Section A: Foundations of Crystallography, A64, pp. 88-95, 2008.
    [7] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov and P. E. Bourne, "The Protein Data Bank," Nucleic Acids Res, vol. 28, pp. 235-242, 2000.
    [8] S. Chakravarthy and S. Pradhan, "DB-FSG: An SQL-Based Approach for Frequent Subgraph Mining," Lecture Notes in Computer Science, vol. 5181, pp. 684-692, 2008.
    [9] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms (Second Eition), MIT Press, Cambridge, 2001.
    [10] L. B. Holder, D. J. Cook, and S. Djoko, "Substructure Discovery in the SUBDUE System," in Proceedings of the AAAI Workshop on Knowledge Discovery in Database, pp. 169-180, 1994.
    [11] J. Huan, D. Bandyopadhyay, W. Wang, J. Snoeyink, J. Prins, and A. Tropsha, "Comparing Graph Representations of Protein Structure for Mining Family-Specific Residue-Based Packing Motifs," Journal of Computational Biology, vol. 12, pp. 657-671, 2005.
    [12] J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha, "Mining Protein Family Specific Residue Packing Patterns From Protein Structure Graphs," Proceedings of the eighth annual international conference on Research in computational molecular biology, pp. 308-315, March 27-31, 2004.
    [13] J. Huan, W. Wang, and J. Prins, "Efficient Mining of Frequent Subgraph in the Presence of Isomorphism," in Proceedings of the Third IEEE International Conference on Data Mining, pp. 549-552, November 19-22, 2003.
    [14] A. Inokuchi, T. Washio, and H. Motoda, "Derivation of the topology structure from massive graph data," Discovery Science: Proceedings of the Second International Conference, DS'99, pp. 330-332, 1999.
    [15] A. Inokuchi, T. Washio, and H. Motoda, "An Apriori-based Algorithm for Mining Frequent Substructures from Graph Data," the 4th European Conference on Principles and Practice of Knowledge Discovery in Data Mining (PKDD2000), pp. 13-23, 2000.
    [16] V. Krishna, N. N. R. R. Suri, and G. Athithan, "A comparative survey of algorithms for frequent subgraph discovery," Current Science, vol. 100, no. 25, 2011.
    [17] M. Kuramochi and G. Karypis, "Frequent Subgraph Discovery," IEEE Conference on Data Mining}, pp. 313-320, 2001.
    [18] W. W. M. Lam, and K. C. C. Chan, "A graph mining algorithm for classifying chemical compounds," in BIBM '08: Proceedings of the 2008 IEEE International Conference on Bioinformatics and Biomedicine, (Washington, DC, USA), pp. 321-324, IEEE Computer Society, 2008.
    [19] S. Nijssen and J. N. Kok, "A Quickstart in Frequent Structure Mining can make a Difference," Proc. 10th ACM SIGKDD, pp. 647-652, 2004.
    [20] S. Nijssen and J. N. Kok, "Frequent Graph Mining and its Application to Molecular Databases," in Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 5, pp. 4571-4577, 2004.
    [21] S. Nijssen and J. N. Kok, "The Gaston Tool for Frequent Subgraph Mining," Proc. Int'l Workshop on Graph-Based Tools (Grabats), Electronic Notes in Theoretical Computer Science, pp. 77-87, 2004.
    [22] S. Parthasarathy and M. R. N. Murthy, "Protein thermal stability: insights from atomic displacement parameters (B-values)," Protein Engineering, vol. 13, no. 1, pp. 9-13, 2000.
    [23] A. Smalter, J. Huan, Y. Jia, and G. Lushington, "GPD: A Graph Pattern Diffusion Kernel for Accurate Graph Classification with Applications in Cheminformatics," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 7, no. 2, pp. 197-207, 2010.
    [24] S. Vishveshwara, K. V. Brinda, and N. Kannan, "Protein Structure: Insights From Graph Theory," Journal of Theoretical and Computational Chemistr}, vol. 1, no. 1, pp. 187-211, 2002.
    [25] B. Wackersreuther, P. Wackersreuther, and A. Oswald, "Frequent Subgraph Discovery in Dynamic Networks," in MLG '10 Washington, DC USA, 2010.
    [26] N. Weskamp, D. Kuhn, E. Hüllermeier, and G. Klebe, "Efficient similarity search in protein structure databases by k -clique hashing," Bioinformatics, vol. 20, pp. 1522-1526, 2005.
    [27] D. W. Williams, J. Huan, and W. Wang, "Graph Database Indexing Using Structured Graph Decomposition," In ICDE, 2007.
    [28] X. Yan, and J. Han, "gSpan: Graph-Based Substructure Pattern Mining," IEEE International Conference, 2002.
    [29] X. Yan and J. Han, "CloseGraph: Mining Close Frequent Graph Patterns," in Proceeding of the 2003 ACM SIGKDD international conference on knowledge discovery and data mining (KDD' 03), Washington, DC, pp. 286-295, 2003.
    [30] Z. Yuan, T. L. Bailey, and R. D. Teasdale, "Prediction of protein B-factor profiles," Proteins, vol. 58, pp. 905-912, 2005.
    [31] Protein data obtained from Protein Data Bank. http://www.rcsb.org/pdb/home/home.do

    下載圖示 校內:2017-09-12公開
    校外:2017-09-12公開
    QR CODE