| 研究生: |
張恆霖 Chang, Heng-Lin |
|---|---|
| 論文名稱: |
搜尋最適合亞硫酸氫鹽測序數據之基因組變異與胞嘧啶甲基化狀態的組合 —基於EAGLE 變異點評估器程式碼庫之案例研究 Searching for combinations of genome variants and cytosine methylation state which best fit bisulfite sequencing data — a case study using the EAGLE variant evaluator program code base |
| 指導教授: |
賀保羅
Horton, Paul |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 英文 |
| 論文頁數: | 33 |
| 中文關鍵詞: | 基因組 、甲基化 、亞硫酸鹽定序 、機率模型 |
| 外文關鍵詞: | Genomics, Methylation, Bisulfite Sequencing, Probability Model |
| 相關次數: | 點閱:74 下載:20 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近幾年以來,在基因表關遺傳學的研究領域當中,很重要的一塊主題,就是 DNA 的甲基化現象。DNA 甲基化是 DNA 化學修飾的一種形式,它能影響 DNA 序列,在不改變序列的情況下,改變其序列原本會產生的遺傳表現,就是所謂的表觀遺傳編碼 (epigenetic code)。DNA 甲基化的發生會導致基因該區段的轉錄作用率降低,意即可能抑制蛋白質的相關生成,而這種現象又稱為基因靜默 (Gene silencing),進而使其失去功能,在影響發育與癌症中扮演關鍵作用 。在人類的細胞中,大約有1%的鹼基是發生甲基化作用的,在一般成熟的細胞當中,DNA 甲基化通常發生於 CpG 雙核甘酸 (CpG dinucleotide),在人類的基因中,大約80%左右的 CpG 位點可能都發生甲基化,但某些特定區域則未發生甲基化,例如密集的胞嘧啶和鳥嘌呤,也就是所謂的 CpG島 (CpG island)。而這與包含所有廣泛表達基因在內的50%的哺乳動物基因中的啟動子有關 。因此對人類來說,研究哺乳動物之甲基化程度的現象,對於探討疾病或是特定生理表徵有著龐大的關係在我們的研究中,我們參考了 EAGLE 的作法,將 EAGLE 計算生物體變異的生成機率模型,延伸應用至探討生物體基因序列中,胞嘧啶位點發生甲基化之可能性進行評估。我們的做法是模仿 EAGLE,分別給定兩種假設序列,分別為假設的參考基因組序列 (Reference sequence) 以及假設的替代基因組序列 (Alternative sequence) 。首先,我們參考變異資料,先評估該變異資料對生物體的可能性評分,將評分後變異可能性發生較高的位點用來修改參考基因組序列,因此我們能得到一組已經將變異可能發生位點替換後的參考基因組 。再來,我們將所有 CpG 位點假設皆已發生甲基化的序列當作參考基因組,接著分別將每個位點各自假設發生甲基化的序列當作替代基因組序列各自加入生成機率模型,並藉由貪婪演算法,分別計算各個 CpG 位點發生甲基化之最高可能性評估分數,在計算過程中,因為我們的資料來源為經過亞硫酸鹽定序之資料,故在計算過程中,須考慮到來自生物序列資訊正股反股不同之情況。在輸出的結果中,若是原生物個體序列的位點其脫氨組合的可能性越高,則代表該生物個體序列在該位點傾向發生甲基化的程度越高 。在 EAGLE-METH 的設計架構中,使用者可以透過輸入經過亞硫酸鹽定序以及聚合酶連鎖反應後的個體資訊,來推斷生物個體 CpG 位點型甲基化發生的可能性評估。在我們的實驗中,透過我們設計的生成機率模型,經由模型的計算,挑選出各個 CpG 位點可能發生甲基化之最高可能性分數,與模擬的甲基化實驗數據做比較,我們幾乎可以推測大部分模擬發生甲基化的 CpG 位點,EAGLE-METH 給予之甲基化 CpG 位點的可能性分數,也大部分符合實際資料。
DNA methylation is a form of chemical modification of DNA that affects the genetic expression of DNA without changing the sequence, which is part of the so-called epigenetic code, and it is also part of an epigenetic mechanism. The process of DNA methylation is the articulation of the Methyl group at the 5' carbon position of the cytosine ring, and this methylation is articulated in the cytosine 5' direction, which is more common in all vertebrates.
The occurrence of DNA methylation leads to a decrease in the transcriptional rate of that region of the gene, which means that it may inhibit the associated production of proteins, a phenomenon also known as Gene silencing, which in turn renders it non-functional and plays a key role in development and cancer. In human cells, about 1% of the bases are methylated. In mature cells, DNA methylation usually occurs at the CpG dinucleotide, and about 80% of the CpG sites in human genes may be methylated, but some specific regions are not methylated, such as dense cytosine and guanine. We call that the CpG island. This is associated with promoters in 56% of mammalian genes, which contain all widely expressed genes. Therefore, for humans, the study of methylation in mammals is of great relevance to the study of diseases or specific physiological phenotypes.
In our study, we extended the EAGLE(Explicit Alternative Genome Likelihood Evaluator) model for calculating the probability of variation in an organism to evaluate the probability of methylation at the cytosine site in the gene sequence of an organism. Our approach is to emulate EAGLE by giving two hypothetical sequences, a hypothetical Reference sequence, and a hypothetical Alternative sequence. First, we evaluate the likelihood score of the variant to the organism by referring to the variant data and use the sites with a higher likelihood of variation to modify the reference genome sequence, so that we can obtain a reference genome with the replacement of the possible variant sites, call modified reference. Then, we took the sequences that were assumed to be methylated at all CpG sites as the reference genome, then added the sequences that were assumed to be methylated at each site as alternative genomic sequences to the generative probability model, and calculated the highest probability scores of methylation at each CpG site by the Greedy algorithm. In the calculation process, because our data source is the data from bisulfite sequencing, we have to take into account the different situations from the biological sequence information of positive and negative strands. In the output, the higher the probability of deamination combination at the site of the sequence, the higher the degree of methylation of the individual sequence at that site. In the EAGLE-METH design framework, users can infer the likelihood of CpG site methylation by inputting the information of an individual after bisulfite sequencing and polymerase chain reaction.
In our experiments, the highest probability scores for each CpG site were selected by our designed probability model and compared with the simulated methylation experimental data. We could almost predict most of the simulated CpG sites, and the probability scores of methylated CpG sites given by EAGLE-METH were mostly consistent with the actual data.
[1] Lisa D Moore, Thuc Le, and Guoping Fan. “DNA methylation and its basic function”. Neuropsychopharmacology 38.1 (2013), pp. 23–38.
[2] Tony Kuo et al. “EAGLE: explicit alternative genome likelihood evaluator”. BMC Medical Genomics 11.2 (2018), pp. 1–10.
[3] Lin Liu et al. “Comparison of nextgeneration sequencing systems”. J Biomed Biotechnol 2012.251364 (2012), p. 25136.
[4] David L Barker and Richard E Marsh. “The crystal structure of cytosine”. Acta Crystallographica 17.12 (1964), pp. 1581–1587.
[5] Felix Krueger et al. “DNA methylome analysis using short bisulfite sequencing data”. Nature Methods 9.2 (2012), pp. 145–151.
[6] Stephan Pabinger et al. “A survey of tools for variant analysis of nextgeneration genome sequencing data”. Briefings in bioinformatics 15.2 (2014), pp. 256–278.
[7] Bryan M Turner. “Defining an epigenetic code”. Nature cell biology 9.1 (2007), pp. 2–6.
[8] Russell P Darst et al. “Bisulfite sequencing of DNA”. Current Protocols in Molecular Biology 91.1 (2010), pp. 7–9.
[9] Felix Krueger and Simon R Andrews. “Bismark: a flexible aligner and methylation caller for BisulfiteSeq applications”. Bioinformatics 27.11 (2011), pp. 1571–1572.
[10] Gerald Schochetman, ChinYih Ou, and Wanda K Jones. “Polymerase chain reaction”. The Journal of infectious diseases 158.6 (1988), pp. 1154–1157.
[11] William R Pearson and David J Lipman. “Improved tools for biological sequence comparison.” Proceedings of the National Academy of Sciences 85.8 (1988), pp. 2444–2448.
[12] Peter JA Cock et al. “The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants”.Nucleic Acids Research 38.6 (2010), pp. 1767–1771.
[13] KuoHsun Ni. “在亞硫酸鹽定序之基因序列資料中,將 EAGLE 之基因變異可能性評分方法延伸至胞嘧啶的甲基化程度評分”. July 2021. URL: http://ir.lib.ncku.edu.tw/handle/987654321/206273.
[14] José M Abuı́n et al. “BigBWA: approaching the Burrows–Wheeler aligner to Big Data technologies”. Bioinformatics 31.24 (2015), pp. 4003–4005.
[15] Brent S Pedersen et al. “Fast and accurate alignment of long bisulfiteseq reads”. arXiv preprint arXiv:1401.1129 (2014).
[16] Heng Li. “Aligning sequence reads, clone sequences and assembly contigs with BWAMEM”. arXiv preprint arXiv:1303.3997 (2013).
[17] Giulia Piaggeschi et al. “MethylFASTQ: a tool simulating bisulfite sequencing data”. 2019 27th Euromicro International Conference on Parallel, Distributed and NetworkBased Processing (PDP). IEEE. 2019, pp. 334–339
[18] Heng Li et al. “The sequence alignment/map format and SAMtools”. Bioinformatics 25.16 (2009), pp. 2078–2079.