| 研究生: |
郭昱伶 Guo, Yu-Ling |
|---|---|
| 論文名稱: |
可視化轉錄因子結合位點與甲基化資訊之實現 Computational pipeline for visualizing transcription factor binding site with methylation information |
| 指導教授: |
賀保羅
Horton, Paul |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 醫學資訊研究所 Institute of Medical Informatics |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 英文 |
| 論文頁數: | 33 |
| 中文關鍵詞: | 甲基化序列標誌 、轉錄因子 、DNA 甲基化 |
| 外文關鍵詞: | MethylSeqLogo, Transcription Factor, DNA methylation |
| 相關次數: | 點閱:103 下載:11 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
轉錄因子(TF)結合和基因附近的DNA甲基化在基因調控中扮演重要角色。Hsu和Horton在近期開發了MethylSeqLogo,作為序列標誌的「甲基化智能」擴展,它可以同時顯示一組轉錄因子結合位點的DNA序列和甲基化訊息(相對於某種背景分佈)。然而,他們的MethylSeqLogo設計有兩個限制。第一個限制是他們沒有完全自動化從原始數據生成MethylSeqLogo圖像的過程。這是一個大問題,因為TF結合和DNA甲基化是組織特異性的,所以每個使用者可能對不同的MethylSeqLogo圖像感興趣。Hsu和Horton提供了開源軟體來生成MethylSeqLogo圖像,但要求使用者提供在圖像中顯示的轉錄因子結合位點序列和甲基化訊息。不幸的是,從相關實驗的標準原始或次要數據文件中提取這些訊息需要幾個步驟(轉錄因子結合位的CHiP Seq和DNA甲基化的亞硫酸测序)。第二個限制是他們只提供了兩個背景分佈(whole genome 和 一組 promoter regions)來比較轉錄因子結合位點的DNA甲基化水平,但某些TF的結合位點分佈可能與這兩個背景模型都不匹配。
本研究描述了我們開發的原始軟體,以支持和擴展MethylSeqLogo。我們的目標是:1)創建一個自動化的計算管道,從標準數據文件格式(BED file)中自動生成MethylSeqLogo圖像的實驗數據;2)擴展MethylSeqLogo軟體,支持基於與TF結合位點側翼區域的TF特定背景模型flanking region)。
在描述我們的方法的實做之後,我們展示了使用我們的管道生成的幾個MethylSeqLogo圖像。當使用相同的(全基因組)背景模型時,我們確認這些MethylSeqLogo圖像與Hsu和Horton發佈的圖像一致。此外,我們對比了使用與啟動子區域背景(promoter region)相比的側翼區域(flanking region)背景模型生成的MethylSeqLogo圖像。我們發現在某些情況下,使用啟動子背景模型(promoter region)似乎表明DNA甲基化起著重要作用,當使用側翼區域背景模型(flanking region)時,DNA甲基化與結合位之間的大部分相關性消失。這個觀察強調了在解釋數據中的統計趨勢時使用適當背景的必要性。
Both transcription factor (TF) binding and DNA Methylation on or near genes are known to play important roles in gene regulation. Recently Hsu & Horton developed MethylSeqLogo, as a "methylation smart" extension to sequence logos which simultaneously show the DNA sequence and methylation patterns of a collection of TF binding sites (vis-à-vis some background distribution).
Their MethylSeqLogo design appears useful to but their work has two limitations. The first limitation is that they did not fully automate the process of producing MethylSeqLogo images from primary data. This is problematic because both TF binding and DNA methylation are tissue specific, so every user potentially is interested in seeing a different MethylSeqLogo image. Hsu & Horton do provide open source software to produce MethylSeqLogo images, but require the user to provide the information TF binding site sequence and methylation information shown in the images. Unfortunately, several steps are needed to extract that information from the standard primary or secondary data files widely available from the relevant experiments (CHiP Seq for TF binding and BiSulfite sequencing for DNA methylation). The second limitation is they provide only two background distributions (whole genome and a set of promoter regions) against which to compared the DNA methylation level of TF binding sites, but the binding site distribution of some TFs may not fit either of those background models well.
This work describes our development of original software to support and extend MethylSeqLogo. We 1) create an automated pipeline to automate the production of MethylSeqLogo's from experimental data in standard data file formats, and 2) extend the MethylSeqLogo software to support TF specific background models based on the regions flanking their binding sites.
After describing our implementation we show several MethylSeqLogo images generated using our pipeline. We confirm that those MethylSeqLogo images are consistent with the images published by Hsu & Horton when using the same (whole genome) background model. Furthermore we contrast the MethylSeqLogo images produced with a flanking region background model versus their promoter region background. We find that in some cases for which the promoter background model seems to indicate an important role for DNA methylation, much of the apparent correlation between DNA methylation and binding disappears when using the flanking region background model. This observation underscores the need for using a suitable background when interpreting statistical trends in data.
[1] Samuel A Lambert et al. “The human transcription factors”. Cell 172.4 (2018), pp.650–665.
[2] Ryan Lister and Joseph R Ecker. “Finding the fifth base: genomewide sequencing of cytosine methylation”. Genome research 19.6 (2009), pp. 959–966.
[3] Coby Viner et al. “Modeling methylsensitive transcription factor motifs with an ex
panded epigenetic alphabet”. bioRxiv (2016), p. 043794.
[4] Peter A Jones. “Functions of DNA methylation: islands, start sites, gene bodies and beyond”. Nature reviews genetics 13.7 (2012), pp. 484–492.
[5] Keith D Robertson. “DNA methylation and human disease”. Nat Rev Genet 6 (2005), pp. 597–610.
[6] Andrew P Feinberg and Bert Vogelstein. “Hypomethylation distinguishes genes of some human cancers from their normal counterparts”. Nature 301.5895 (1983), pp. 89–92.
[7] Serge Saxonov, Paul Berg, and Douglas L Brutlag. “A genomewide analysis of CpG
dinucleotides in the human genome distinguishes two distinct classes of promoters”. Proceedings of the National Academy of Sciences 103.5 (2006), pp. 1412–1417.
[8] Aimée M Deaton and Adrian Bird. “CpG islands and the regulation of transcription”. Genes & development 25.10 (2011), pp. 1010–1022.
[9] Thomas D Schneider and R Michael Stephens. “Sequence logos: a new way to display consensus sequences”. Nucleic acids research 18.20 (1990), pp. 6097–6100.
[10] MC Thomsen and M Nielsen. “Seq2Logo: a method for construction and visualization of amino acid binding motifs and sequence profiles including sequence weighting, pseudo counts and twosided representation of amino acid enrichment and depletion”. Nucleic Acids Research 40 (May 2012).
[11] KK Dey, D Xie, and M Stephens. “A new sequence logo plot to highlight enrichment and depletion”. BMC bioinformatics 19.473 (2018).
[12] András Micsonai et al. “BeStSel: a web server for accurate protein secondary structure prediction and fold recognition from the circular dichroism spectra”. Nucleic acids research 46.W1 (2018), W315–W322.
[13] M Siebert and J Söding. “Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences”. Nucleic Acids Res 44.13 (2016), pp. 6055–69.
[14] Paul Horton and FeiMan Hsu. “MethylSeqLogo: DNA methylation smart sequence logos”. bioRxiv (2022), pp. 2022–11.
[15] Juan M Vaquerizas et al. “A census of human transcription factors: function, expression and evolution”. Nature Reviews Genetics 10.4 (2009), pp. 252–263.
[16] Aaron R Quinlan and Ira M Hall. “BEDTools: a flexible suite of utilities for comparing genomic features”. Bioinformatics 26.6 (2010), pp. 841–842.
[17] Wes McKinney et al. “Data structures for statistical computing in python”. Proceedings of the 9th Python in Science Conference. Vol. 445. Austin, TX. 2010, pp. 51–56.
[18] Michael L. Waskom. “seaborn: statistical data visualization”. Journal of Open Source Software 6.60 (2021), p. 3021. DOI: 10.21105/joss.03021. URL: https://doi.
org/10.21105/joss.03021.
[19] Solomon Kullback and Richard A Leibler. “On information and sufficiency”. The annals of mathematical statistics 22.1 (1951), pp. 79–86.
[20] Ryan K Dale, Brent S Pedersen, and Aaron R Quinlan. “Pybedtools: a flexible Python library for manipulating genomic datasets and annotations”. Bioinformatics 27.24 (2011),pp. 3423–3424.
[21] Thomas D Schneider et al. “Information content of binding sites on nucleotide sequences”. Journal of molecular biology 188.3 (1986), pp. 415–431.
[22] Jaime A CastroMondragon et al. “JASPAR 2022: the 9th release of the openaccess database of transcription factor binding profiles”. Nucleic acids research 50.D1 (2022), pp. D165–D173.
[23] Fayrouz Hammal et al. “ReMap 2022: a database of Human, Mouse, Drosophila and Arabidopsis regulatory regions from an integrative analysis of DNAbinding sequencing experiments”. Nucleic acids research 50.D1 (2022), pp. D316–D325.
[24] ENCODE Project Consortium et al. “An integrated encyclopedia of DNA elements in the human genome”. Nature 489.7414 (2012), p. 57.
[25] Michael Allevato et al. “Sequencespecific DNA binding by MYC/MAX to lowaffinity nonEbox motifs”. PloS one 12.7 (2017), e0180147.
[26] Ximei Luo et al. “Effects of DNA methylation on TFs in human embryonic stem cells”. Frontiers in genetics 12 (2021), p. 639461.
[27] Peter F Johnson. “Identification of C/EBP basic region residues involved in DNA sequence recognition and halfsite spacing preference”. Molecular and cellular biology 13.11 (1993), pp. 6919–6930.