簡易檢索 / 詳目顯示

研究生: 李文勛
Lee, Wun-Shiun
論文名稱: 在去氧核醣核酸序列當中找尋之共同特徵結構訊號
Finding Motif Signals in DNA Sequences
指導教授: 謝孫源
Hsieh, Sun-Yuan
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 醫學資訊研究所
Institute of Medical Informatics
論文出版年: 2008
畢業學年度: 96
語文別: 英文
論文頁數: 55
中文關鍵詞: 啟發式演算法被植入共同特徵結構訊號問題(l,d)-訊號轉錄因子結合位置整數線性規劃
外文關鍵詞: planted motif search problem, heuristics, transcription factor binding sites, (l,d)-signal, integer linear programming (ILP)
相關次數: 點閱:199下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在分子生物學的領域裡,了解如何調控基因表現的機制是一項非常具有挑戰且重要的任務。而在此項挑戰當中如何辨識出調控元素更是一門重要的工作,特別是在去氧核糖核酸序列當中對於轉錄因子結合區的辨識。被植入共同特徵結構訊號問題(planted motif searching problem) 提供了一套以數學化的概念模型來描述此尋找辨識的工作。在本篇論文當中,我們試著以圖形為基礎並分別結合整數線性規劃(integer linear programming)和啟發式(heuristics)方法來解決在去氧核糖核酸序列上找尋被保留下來的區間。除此之外,我們也修正了被植入共同特徵結構訊號問題,使得此模型能更加準確的描述出這樣的找尋辨識工作。最後,我們把我們的啟發式方法與其他目前最受歡迎使用的工具做一個比較。在模擬的資料當中針對不同的序列長度,我們的啟發式方法擁有比其他工具還要好的準確度。而在於真實的生物資料上也有不錯的效果。

    Revealing the mechanisms that regulate gene expression is a major challenge in molecular biology. An important task in this challenge is to identify regulatory elements, especially the binding sites in deoxyribonucleic acid (DNA) for transcription factors. The planted (l,d)-motif searching problem is a mathematical abstraction of the DNA functional site discovery task. In this paper, we propose a heuristic algorithm for the planted (l,d)-motif searching problem which also discovers conserved regions in DNA sequences. By testing on simulated data sets for evaluation, we demonstrate that the performance of our algorithm is better than previously widely-used motif finding algorithms. Moreover, the experimental results on real biological data sets are also provided.

    Abstract(in Chinese) i Abstract(in English) ii Acknowledgement(in Chinese) iii Contents v List of Figures viii List of Tables x 1 Introduction 1 1.1 DNA 1 1.2 Genes and genomes 3 1.3 Transcription and translation 3 1.4 Sequence motif 4 1.5 Transcription factor 5 1.6 Transcription factor binding sites/response elements 6 1.7 Transcription factors and human disease 6 1.8 Abstraction of the DNA functional site discovery 7 1.9 Probabilistic approach algorithms 8 1.10 Combinatorial approach algorithms 9 1.11 Motivation 9 1.12 Preview of this thesis 10 2 Preliminaries 12 2.1 Basic definition of graph theory 12 2.2 The planted (l,d)-motif problem 14 2.3 Sketch of the WINNOWER algorithm 16 3 Method 18 3.1 Phase 1: Pruning weak vertices 19 3.2 Phase 2: Removing spurious edges 21 3.3 Phase 3: Finding motif signals 24 3.3.1 Integer linear programming approach 24 3.3.2 Heuristic approach - Vine 25 3.4 Space complexity and time complexity 29 3.4.1 Space complexity 29 3.4.2 Time complexity 30 4 Experimental Results 32 4.1 Synthetic data generator 32 4.2 Synthetic data performance evaluation 32 4.3 Real biological data 36 4.4 Real biological data performance evaluation 36 5 Discussion and Conclusion 40 6 Availability and Supplemental Material 41 6.1 The web server 41 6.2 FASTA format 42 6.3 The input page 43 6.4 Documentation 44 6.4.1 Name 44 6.4.2 Description 44 6.4.3 Essential options 44 6.4.4 Advanced options 45 6.4.5 Developers 45 6.5 The result page 46 Acknowledgement 48 Bibliography 49

    [1] M. M. Babu, N. M. Luscombe, L. Aravind, M. Gerstein, and S. A. Teichmann, "Structure and Evolution of Transcriptional Regulatory Networks," Current Opinion in Structural Biology, vol. 14, p. 283V291, 2004.
    [2] T. Bailey and C. Elkan, "Fitting a Mixture Model by Expectation Maximization to Discover Motifs in Biopolymers," Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28--36, 1994.
    [3] M. Blanchette, "Algorithms for Phylogenetic Footprinting," Proceedings of the Fifth Annual International Conference on Research in Computational Molecular Biology, pp. 49--58, 2001.
    [4] A. H. Brivanlou and J. E. Darnell, "Signal Transduction and the Control of Gene Expression," Science, vol. 295, p. 813--818, 2002.
    [5] J. Buhler and M. Tompa, "Finding Motifs Using Random Projections," Journal of Computational Biology, vol. 9, pp. 225--242, 2002.
    [6] A. M. Carvalho, A. T., Freitas, A. L. Oliveira, and M.-F. Sagot, "A Highly Scalable Algorithm for the Extraction of CIS-Regulatory Regions," Proceedings of the 3rd Asia-Pacific Bioinformatics Conference, vol. 1, pp. 273--282, 2005.
    [7] F. Y. L. Chin and H. C. M. Leung, "Voting Algorithms for Discovering Long Motifs," Proceedings of the 3rd Asia-Pacific Bioinformatics Conference, pp. 261--271, 2005.
    [8] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, 2nd ed. Cambridge, MA: MIT Press, 2002.
    [9] G. E. Crooks, G. Hon, J. M. Chandonia, and S. E. Brenner, "WebLogo: A Sequence Logo Generator," Genome Research, vol. 14, pp. 1188--1190, 2004.
    [10] J. Davila, S. Balla, and S. Rajasekaran, "Space and Time Eficient Algorithms for Planted Motif Search," Proceedings of the Second International Workshop Bioinformatics Research and Applications, 2006.
    [11] --, "Fast and Practical Algorithms for Planted (l,d) Motif Search," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 4, pp. 544--552, 2007.
    [12] C. Debouck and P. N. Goodfellow, "DNA Microarrays in Drug Discovery and Development," Nature Genetics, vol. 21, pp. 48--50, 1999.
    [13] E. Eskin and P. A. Pevzner, "Finding Composite Regulatory Patterns in DNA Sequences," Bioinformatics, vol. 18, pp. 354--363, 2002.
    [14] P. A. Evans and A. Smith, "Toward Optimal Motif Enumeratio," Proceedings of the Eighth International Workshop Algorithms and Data Structures, vol. 2751, pp. 47--58, 2003.
    [15] P. A. Evans, A. Smith, and H. T. Wareham, "On the Complexity of Finding Common Approximate Substrings," Theoretical Computer Science, vol. 306, pp. 407--430, 2003.
    [16] B. Ewan, S. A. John, and D. Anindya, "Identification and Analysis of Functional Elements in 1% of the Human Genome by the ENCODE Pilot Project," Nature, vol. 447, pp. 799--816, 2007.
    [17] S. Geman and D. Geman, "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images," Readings in Uncertai Reasoning, pp. 452--472, 1990.
    [18] T. Gregory, "The C-Value Enigma in Plants and Animals: A Review of Parallels and an Appeal for Partnership," Annals of Botany, vol. 95, pp. 133--146, 2005.
    [19] P. Harrison, H. Hegyi, S. Balasubramanian, N. Luscombe, P. Bertone, N. Echols, T. Johnson, and M. Gerstein, "Molecular Fossils in the Human Genome: Identification and Analysis of the Pseudogenes in Chromosomes 21 and 22," Genome Research, vol. 12, pp. 272--280, 2002.
    [20] P. M. Harrison and M. Gerstein, "Studying Genomes Through the Aeons: Protein Families, Pseudogenes and Proteome Evolution," Journal of Molecular Biology, vol. 318, pp. 1155--1174, 2002.
    [21] H. O. Hartley, "Maximum Likelihood Estimation from Incomplete Data," Biometrics, vol. 14, pp. 174--194, 1958.
    [22] M. Hase, T. Yokomizo, T. Shimizu, and M. Nakamura, "Characterization of an Orphan G Protein-Coupled Receptor, GPR20, That Constitutively Activates Gi Proteins," The Journal of Biological Chemistry, vol. 283, pp. 12747--12755, 2008.
    [23] G. Z. Hertz and G. D. Stormo, "Identifying DNA and Protein Patterns with Statistically Significant Alignments of Multiple Sequences," Bioinformatics, vol. 15, pp. 563--577, 1999.
    [24] H. J. Huttunen and H. Rauvala, "Amphoterin as an Extracellular Regulator of Cell Motility: From Discovery to Disease," Journal of Internal Medicine, vol. 255, pp. 351--366, 2004.
    [25] M. Karin, "Too Many Transcription Factors: Positive and Negative Interactions," The New biologist, vol. 2, pp. 126--131, 1990.
    [26] U. Keich and P. A. Pevzner, "Finding Motifs in the Twilight Zone," Bioinformatics, vol. 18, pp. 374--381, 2002.
    [27] K. F. Koehler, L. A. Helguero, L.-A. Haldosen, M. Warner, and J.-A. Gustafsson, "Reflections on the Discovery and Significance of Estrogen Recepter B," The Endocrine Society, vol. 26, pp. 465--478, 2005.
    [28] D. S. Latchman, "Transcription Factors: An Overview," International Journal of Biochemistry and Cell Biology, vol. 29, pp. 1305--1312, 1997.
    [29] C. Lawrence, S. Altschul, M. Boguski, J. Liu, A. Neuwald, and J. Wootton, "Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment," Science, vol. 262, pp. 208--214, 1993.
    [30] T. I. Lee and R. A. Young, "Transcription of Eukaryotic Protein-Coding Genes," Annual Reviews in Genetics, vol. 34, pp. 77--137, 2000.
    [31] M. Li, B. Ma, and L. Wang, "On the Closest String and Substring Problems," Journal of the ACM, vol. 49, pp. 157--171, 2002.
    [32] X. Liu, J. S. Liu, and D. L. Brutlag, "BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes," Pacific Symposium on Biocomputing, pp. 127--138, 2001.
    [33] L. Marsan and M.-F. Sagot, "Extracting Structured Motifs Using a Su±x TreeX-Algorithms and Application to Promoter Consensus Identi¯cation," Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, 2000.
    [34] T. Nagahata, T. Sato, A. Tomura, M. Onda, K. Nishikawa, and M. Emi, "Identification of RAI3 as a Therapeutic Target for Breast Cancer," Endocrine-Related Cancer, vol. 12, pp. 65--73, 2005.
    [35] D. B. Nikolov and S. K. Burley, "RNA Polymerase II Transcription Initiation: A Structural View," Proceedings of the National Academy of Sciences, vol. 94, pp. 15--22, 1997.
    [36] E. V. Nimwegen, "Scaling Laws in the Functional Content of Genomes," Trends in Genetics, vol. 19, pp. 479--484, 2003.
    [37] C. Nugent and L. Lundblad, "The Telomerase Reverse Transcriptase: Components and Regulation," Genes and Development, vol. 12, pp. 1073--1085, 1998.
    [38] P. A. Pevzner and S.-H. Sze, "Combinatorial Approaches to Finding Subtle Signals in DNA Sequences," Intelligent Systems for Molecular Biology, pp. 269--278, 2000.
    [39] A. Pidoux and R. Allshire, "The Role of Heterochromatin in Centromere Function," Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 360, pp. 569--579, 2005.
    [40] N. Pisanti, A. Carvalho, L. Marsan, and M.-F. Sagot, "RISOTTO: Fast Extraction of Motifs with Mismatches," Proceedings of the 7th Latin American Theoretical Symposium, pp. 757--768, 2006.
    [41] A. Price, S. Ramabhadran, and P. A. Pevzner, "Finding Subtle Motifs by Branching from Sample Strings," Bioinformatics, vol. 19, pp. 149--155, 2003.
    [42] S. Rajasekaran, S. Balla, and C.-H. Huang, "Exact Algorithms for Planted Motif Problems," Journal of Computational Biology, vol. 8, pp. 1117--1128, 2005.
    [43] J. C. Reed, "Apoptosis-Regulating Proteins as Targets for Drug Discovery," TRENDS in Molecular Medicine, vol. 7, pp. 314--319, 2001.
    [44] I. Rigoutsos and A. Floratos, "Combinatorial Pattern Discovery in Biological Sequences: The TEIRESIAS Algorithm," Bioinformatics, vol. 14, pp. 56--57, 1998.
    [45] R. G. Roeder, "The Role of General Initiation Factors in Transcription by RNA Polymerase II," Trends in Biochemical Sciences, vol. 21, pp. 327--335, 1996.
    [46] Y. C. Tseng, T. Y. Juang, and C. J. Cnang, "Congestion-Free, Dilation-2 Embedding of Complete Binary Tree in Star Graphs," Networks, vol. 33, no. 3, pp. 221--231, 1999.
    [47] R. Siddharthan, E. D. Siggia, and E. van Nimwegen, "Phylogibbs: A Gibbs Sampler Incorporating Phylogenetic Information," PLoS Computational Biology, vol. 1, pp. 534--556, 2005.
    [48] S. Sivashankari and P. Shanmughavel, "Comparative Genomics - A Perspective," Bioinformation, vol. 1, pp. 376--378, 2007.
    [49] G. D. Stormo, "DNA Binding Sites: Representation and Discovery," Bioinformatics, vol. 16, pp. 16--23, 2000.
    [50] G. D. Stormo and G. W. Hartzell, "Identifying Protein-Binding Sites from Unaligned DNA Fragments," Proceedings of the National Academy of Sciences of the United States of America, vol. 86, pp. 1183--1187, 1989.
    [51] Y. Suh and J. Vijg, "SNP Discovery in Associating Genetic Variation with Human Disease Phenotypes," Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis, vol. 573, pp. 41--53, 2005.
    [52] S. M. Tareeq, S. Saha, T. Islam, and R. Quazi, "ANT: A Novel Heuristic Algorithm for Finding Motif," Information Technology Journal, vol. 6, pp. 189--195, 2007.
    [53] M. Thanbichler, S. C. Wang, and L. Shapiro, "The Bacterial Nucleoid: A Highly Organized and Dynamic Structure," J Cell Biochem, vol. 96, pp. 506--521, 2005.
    [54] J. W. Thomas, J. W. Touchman, R. W. Blakesley, G. G. Bouffard, S. M. Beckstrom-Sternberg, and E. H. Margulies, "Comparative Analyses of Multi-Species Sequences from Targeted Genomic Regions," Nature, vol. 424, pp. 788--793, 2003.
    [55] J. C. Venter, M. D. Adams, E. W. Myers, P. W. Li, and R. J. Mural, "The Sequence of the Human Genome," Science, vol. 291, pp. 1304--1351, 2001.
    [56] D. B. West, Introduction to Graph Theory, 2nd ed. Upper Saddle River, NJ: Prentice Hall, 2001.
    [57] E. Wingender, P. Dietze, H. Karas, and R. Knuppel, "TRANSFAC: A Database on Transcription Factors and Their DNA Binding Sites," Nucleic Acids Research, vol. 24, pp. 238--241, 1996.
    [58] T.Wolfsberg, J. McEntyre, and G. Schuler, "Guide to the Draft Human Genome," Nature, vol. 409, pp. 824--826, 2001.
    [59] C. T. Workman and G. D. Stormo, "ANN-SPEC: A Method for Discovering Transcription Factor Binding Sites with Improved Speci¯city," Pacific Symposium of Biocomputing, vol. 5, pp. 464--475, 2000.
    [60] F. Y. Xie, M. C. Woodle, and P. Y. Lu, "Harnessing in vivo siRNA Delivery for Drug Discovery and Therapeutic Development," Drug Discovery Today, vol. 11, pp. 67--73, 2006.
    [61] E. Zaslavsky and M. Singh, "A Combinatorial Optimization Approach for Diverse Motif Finding Applications," Algorithms for Molecular, vol. 1, p. 13, 2006.

    下載圖示 校內:2013-08-05公開
    校外:2013-08-05公開
    QR CODE