| 研究生: |
倪國勛 Ni, Kuo-Hsun |
|---|---|
| 論文名稱: |
在亞硫酸鹽定序之基因序列資料中,將 EAGLE 之基因變異可能性評分方法延伸至胞嘧啶的甲基化程度評分 Extending the EAGLE Genome Variant Calling Method for Use in Inferring Cytosine Methylation from Bisulfite Sequencing Data |
| 指導教授: |
賀保羅
Paul Horton |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 英文 |
| 論文頁數: | 44 |
| 中文關鍵詞: | 甲基化 、機率模型 |
| 外文關鍵詞: | Methylation, Probability Model |
| 相關次數: | 點閱:143 下載:17 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
DNA 甲基化是表觀遺傳學熱門的一個主題,DNA 的甲基化會降低該生物體基因區段轉錄作用 (Transcription) 之能力,進而抑制相關蛋白質的生成,被稱為基因靜默 (Gene silencing),在發育和癌症中起著核心作用。脊椎動物的DNA甲基化一般發生在CpG位點 (胞嘧啶-磷酸-鳥嘌呤位點,即DNA序列中胞嘧啶後緊連鳥嘌呤的位點)。經DNA甲基轉移酶催化胞嘧啶轉化為5-甲基胞嘧啶。在人類的基因中,大約70%-80% 的CpG位點 (CG site) 已被甲基化,但是在某些特定區域,如富含胞嘧啶和鳥嘌呤的CpG島 (CpG island) 則未被甲基化。這與包含所有廣泛表達基因在內的56%的哺乳動物基因中的啟動子有關。因此研究甲基化程度之現象,對於哺乳動物之疾病或是生理表徵有龐大的影響。
Eagle (EAGLE: Explicit Alternative Genome Likelihood Evaluator) [1],是一個評估生物個體對於變異資料信心程度評分的專案。其專案針對給定的參考基因組 (Reference genome)、變異資料 (Variant Calling File)、生物個體序列資訊 (Binary Alignment/Map file),建立一個機率模型,對於每個變異點進行評分。評分愈高的變異點,代表對該個體有愈多條個體序列資訊支持該項變異點發生的可能性。
因此我們想要將Eagle的機率模型對於基因位點的變異可能性程度評估的方法也套用到基因位點的甲基化上,對於基因位點之甲基化可能性程度進行評估。
而當今測定DNA甲基化的方法中最廣為人知的是採取亞硫酸鹽定序 (Bisulfite sequencing),亞硫酸鹽處理DNA會將胞嘧啶殘基 (C) 轉化成尿嘧啶 (U),但被甲基化的胞嘧啶殘基 (C) 則不受影響。而在後續的聚合酶連鎖反應中 (PCR, Polymerase chain reaction),上述由胞嘧啶殘基 (C) 轉化而來的尿嘧啶 (U) 則會被轉換成胸腺嘧啶(T),被甲基化的胞嘧啶殘基 (C) 則同樣不受影響。因此以亞硫酸鹽定序加上聚合酶連鎖反應之後的結果,我們會得到未發生甲基化的胞嘧啶 (C) 轉換成胸腺嘧啶(T)的亞硫酸鹽定序之序列。
綜合上述,Meth-EAGLE 在變異點的評估以及亞硫酸鹽定序後的序列特性,我們的論文中的設計可以分為三大部分。
首先,在第一個部份,我們將會先評估變異資料對於該生物個體序列的可能性評分,並且挑選最高評分的變異組合,根據該變異組合來修改參考基因組,以此來得到一組根據生物個體變異傾向修改過後的參考基因組 (在此論文,以下皆稱之為「已修正參考基因組」)。
針對該基因組,我們會對他做兩種假設,並修改成兩筆參考基因組資料,分別用以應付亞硫酸鹽定序的正、反股兩種狀況(在論文的2-6章節中會詳細說明)。
在第二個部份,我們會根據這份已修正參考基因組去產生出一組變異資料,並且因為大多數的甲基化都是發生在CpG位點上,因此我們將已修正參考基因組中所有CpG中的胞嘧啶 (C) 轉換成胸腺嘧啶(T)視為自定義變異資料,以及已修正參考基因組中所有CpG中的鳥嘌呤 (G) 轉換成腺嘌呤 (A) 亦視為自定義變異資料,綜合上述兩組變異資料形成第二部分的自定義變異資料 (在此論文,以下皆稱之為「脫氨組合」)。
在最後一個部分,將第一部份的已修正參考基因組,以及第二部分的脫氨組合,以及原先的生物個體序列資訊,放入Meth-Eagle 的機率模型中運算。以此得出的該生物個體資訊在已經根據可能變異傾向修改後的已修正參考基因組比對下其脫氨組合傾向。若是原生物個體序列點在某點其脫氨組合的可能性較高,則代表該生物個體序列在該點是傾向非甲基化的程度愈高。
藉由以上論文架構設計,使用者可以不只可以利用 Eagle 推斷出生物個體變異的可能性程度。透過Meth-EAGLE,使用者還可以輸入亞硫酸鹽定序加上聚合酶連鎖反應後的個體資訊,在推斷出生物個體變異的可能性的同時,也知曉生物個體的甲基化可能性的推測。
在Meth-EAGLE的實驗中,我們可以幾乎正確地將參考基因組依照符合生物個體資訊的變異替換成已修正參考基因組。且透過我們設計的脫氨組合,Meth-EAGLE在甲基化位點的分析也符合資料的實際甲基化的狀況。以上的結果都可以從我們的論文實驗敘述中查看詳細描述。
Background: DNA methylation is a hot topic in epigenetics. DNA methylation typically acts to repress gene transcription and then suppress the ability of corresponding protein biosynthesis, which is called Gene silencing. In mammals, DNA methylation is essential for normal development and is associated with a number of key processes including genomic imprinting, repression of transposable elements, aging, and carcinogenesis. In human, around 70%-80% of CpG dinucleotides are methylated. Therefore, the study of DNA methylation attracts everyone's attention nowadays.
Bisulfite sequence is an experimental technique in which genomic DNA is first treated with bisulfite, and then sequenced. Treatment of DNA with bisulfite converts unmethylated cytosine residues to uracil, which (after amplification by PCR) is read as thymine. but leaves 5-methylcytosine residues unaffected. Thus in principle the methylation state of individual genomic cytosines can be inferred by comparing the bisulfite treated DNA sequences to the human genome sequence. In practice though this inferrence must overcome complications including differences ambiguous read mapping to repetitive regions of the genome and differences between individual genomes and the reference genome.
EAGLE (EAGLE: Explicit Alternative Genome Likelihood Evaluator) [1], is a method for evaluating the degree to which sequencing data supports a given candidate genome variant. It employs an explicit probability model to handle read mapping uncertainty in a well principled way. Given DNA sequencing data R and a hypothetical genome sequence G, EAGLE can compute the likelihood of G given R.
Method: here, we extend the EAGLE method to include the deamination of unmethylated cytosines when modeling the sequencing data R and to include information on which cytosines are methylated in the genome sequence hypothesis G. As before, EAGLE can output the likelihood of G given R.
We modified the EAGLE software to implement this model. Our design can be divided into three parts.
In the first part, we choose the highest likelihood variant set from EAGLE and use this set to modify the reference sequence. Conventionally, we name the modified reference sequence as corrected reference sequence. Moreover, we make two hypotheses to the corrected reference sequence so as to handle the forward and reverse strands cases for Bisulfite sequence. In the second part, because almost all of the methylation occurs on CpG sites, we generate two kinds of sets from the corrected reference sequence by recording all forward strand cytosines in CpGs and then all reverse strand cytosines (i.e. the G) of CpGs. We name these sets of potentially deaminated nucleotides deamination sets. In the final step, we use the corrected reference sequence, deamination sets, and the original Binary Alignment/Map file as the three inputs to Eagle. After evaluating, we can obtain the methylated probability of each nucleobase without the bias which may be caused by variants, because the bias is already filtered out by the first part in our design.
Results:
By using Meth-EAGLE, user can not only get the degree to which sequencing data supports a given candidate genome variant from a normal sequencing data, but also can get the methylated degree to each potential methylated nucleobase by entering bisulfite treated sequencing data.
We conducted some simulation based experiments to get an idea of how well this method should work. In our experiments, the corrected reference sequence inferred was almost always correct. Via the use of deamination sets, the inferrence of methylated versus unmethylated cytosines is also in line with the methylation call file made by the simulator.
We conclude the method is a promising way to simultaneously call methylation and genome variants from bisulfite sequencing data. As future work, more extensive benchmarking will needed to ascertain if our implementation is ready to contribute as a practical tool.
[1] Kuo, T., Frith, M. C., Sese, J., & Horton, P. (2018). EAGLE: Explicit Alternative Genome Likelihood Evaluator. BMC Medical Genomics, 11, [28]. https://doi.org/10.1186/s12920-018-0342-1
[2] Adrian Bird, The essentials of DNA methylation, Cell, Volume 70, Issue 1, 1992, Pages 5-8, ISSN 0092-8674, https://doi.org/10.1016/0092-8674(92)90526-I. (https://www.sciencedirect.com/science/article/pii/009286749290526I)
[3] Robertson KD. DNA methylation and human disease. Nat Rev Genet. 2005 Aug;6(8):597-610. doi: 10.1038/nrg1655. PMID: 16136652.
[4] Jin Z, Liu Y. DNA methylation in human diseases. Genes Dis. 2018;5(1):1-8. Published 2018 Jan 31. doi:10.1016/j.gendis.2018.01.002
[5] Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 2011 Jun 1;27(11):1571-2. doi: 10.1093/bioinformatics/btr167. Epub 2011 Apr 14. PMID: 21493656; PMCID: PMC3102221.
[6] Lynch J. PCR Technology: Principles and Applications for DNA Amplification. J Med Genet. 1990;27(8):536.
[7] Chatterjee A, Stockwell PA, Rodger EJ, Morison IM. Comparison of alignment software for genome-wide bisulphite sequence data. Nucleic Acids Res. 2012 May;40(10):e79. doi: 10.1093/nar/gks150. Epub 2012 Feb 16. PMID: 22344695; PMCID: PMC3378906.
[8] Fraga MF, Esteller M. DNA methylation: a profile of methods and applications. Biotechniques. 2002 Sep;33(3):632, 634, 636-49. doi: 10.2144/02333rv01. PMID: 12238773.
[9] Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010 Apr;38(6):1767-71. doi: 10.1093/nar/gkp1137. Epub 2009 Dec 16. PMID: 20015970; PMCID: PMC2847217.
[10] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PMID: 19505943; PMCID: PMC2723002.
[11] Jabbari K, Bernardi G. Cytosine methylation and CpG, TpG (CpA) and TpA frequencies. Gene. 2004 May 26;333:143-9. doi: 10.1016/j.gene.2004.02.043. PMID: 15177689.
[12] Pedersen, B.S., Eyring, K.R., De, S., Yang, I., & Schwartz, D. (2014). Fast and accurate alignment of long bisulfite-seq reads. arXiv: Genomics.
[13] G. Piaggeschi, N. Licheri, G. Romano, S. Pernice, L. Follia and G. Ferrero, "MethylFASTQ: A Tool Simulating Bisulfite Sequencing Data," 2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2019, pp. 334-339, doi: 10.1109/EMPDP.2019.8671567.
[14] Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv: Genomics.