簡易檢索 / 詳目顯示

研究生: 廖鈺茹
Liao, Yu-Ru
論文名稱: DNA 拷貝數及RNA 表現量間的關係建模-以大腸直腸癌患者為例
Modeling Association between DNA Copy Number and RNA Expressions on Colon Adenocarcinoma Patients
指導教授: 鄭順林
Jeng, Shuen-Lin
學位類別: 碩士
Master
系所名稱: 管理學院 - 統計學系
Department of Statistics
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 105
中文關鍵詞: COADMICdCorMARSDNA拷貝數RNA關聯
外文關鍵詞: COAD, MIC, dCor, MARS, DNA copy number, RNA, association
相關次數: 點閱:82下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • DNA拷貝數(CN)和RNA表達之間的關聯是癌症研究中的重要問題。具有強DNA-RNA結合的關鍵基因可以作為治療靶標。在這項研究中,我們探索了幾種方法來識別特定癌症患者群體中具有強DNA-RNA關聯的基因。我們分析了從Genomic Data Commons(GDC)下載的結腸腺癌(COAD)患者。我們使用工具集'Bedtools'和註釋'非重疊GRCh38'來建立一個新的計算讀數計數方法。新計算的讀數計數方法是解決不同轉錄本的外顯子可能重疊的問題。在獲得讀數後,我們使用最大信息係數(MIC),距離相關(dCor)和多變量自適應回歸樣條(MARS)來找出DNA拷貝數表達與RNA表達之間的二維關聯。本研究中的創新算法稱為A2dMIC,它能夠計算DNA拷貝數表達與基因間RNA表達之間的三維關聯。

    The association between DNA Copy Number (CN) and RNA expressions is an important issue in cancer studies. The critical genes with strong DNA-RNA association may serve as therapeutic targets. In this study, we explore several methods to identify genes with strong DNA-RNA association in specific groups of cancer patients. We analyze the patients with the Colon Adenocarcinoma (COAD) downloaded from the Genomic Data Commons (GDC). We used the toolset 'Bedtools” and the annotation 'non-overlap GRCh38” to establish a new calculation read count method. The new calculated read count method is to solve the problem that the exons of the different transcripts may be overlapping. After obtaining the read counts, we using maximal information coefficient (MIC), distance correlation (dCor) and Multivariate Adaptive Regression Splines (MARS) to find out the two-dimensional association between DNA copy number expression and RNA expressions. And the innovative algorithm in this study is called A2dMIC, which is able to calculate the three-dimensional association between the DNA copy number expression and RNA expressions across genes.

    摘要 i Abstract ii Acknowledgements iii Table of Contents iv List of Tables vi List of Figures vii Chapter 1. Introduction 1 1.1. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2. Motivated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1. Motivated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2. Motivated Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.3. Flow Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2. Literature Review 9 2.1. Tools for Read Counts Calculation . . . . . . . . . . . . . . . . . . . . . . 9 2.2. Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3. DNA copy number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 3. Methods of Association Calculation 17 3.1. Maximum Information Coefficient . . . . . . . . . . . . . . . . . . . . . . 17 3.2. Distance Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3. Multivariate Adaptive Regression Spline . . . . . . . . . . . . . . . . . . . 19 3.4. Adaptive-MIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4.1. The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5. Methods for High Dimensional Association . . . . . . . . . . . . . . . . . 27 3.5.1. MIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5.2. dCor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5.3. MARS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5.4. Adaptive-MIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Chapter 4. Proposed Method 31 4.1. A2dMIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.1. Case Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1.2. Generality, Equitability and Effectiveness of A2dMIC . . . . . . . . 34 4.1.3. Case Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.4. Case Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2. Splitting Overlapped Exons . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.1. Original GRCh38 . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.2. Splitted GRCh38 . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.3. Calculating read counts . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3. Estimating RNA-Seq Data . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.1. Counts Per Million for RNA-Seq data . . . . . . . . . . . . . . . . 50 4.3.2. Model Building and Prediction . . . . . . . . . . . . . . . . . . . . 50 Chapter 5. COAD Patients Analysis 51 5.1. Raw Data Format: bam and bai . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2. Finding DNA Copy Number and RNA Expression . . . . . . . . . . . . . . 52 5.2.1. Finding DNA Copy Number . . . . . . . . . . . . . . . . . . . . . 53 5.2.2. Finding RNA Expression . . . . . . . . . . . . . . . . . . . . . . . 54 5.3. Identifying Associations . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3.1. Two Dimensional Association . . . . . . . . . . . . . . . . . . . . 57 5.3.2. High Dimension Association . . . . . . . . . . . . . . . . . . . . . 72 Chapter 6. Summary and Future Research 99 6.1. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2. Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 References 102

    [1] Simon Anders, Paul Theodor Pyl, and Wolfgang Huber. Htseq—a python framework
    to work with high-throughput sequencing data. Bioinformatics, 31(2):166–169, 2015.
    [2] Joydeep Bhattacharya, Ernesto Pereda, and Christos Ioannou. Functional associations
    at global brain level during perception of an auditory illusion by applying maximal information
    coefficient. Physica A: Statistical Mechanics and its Applications, 491:708–
    715, 2018.
    [3] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical
    learning, volume 1. Springer series in statistics New York, 2001.
    [4] Jerome H Friedman et al. Multivariate adaptive regression splines. The annals of statistics,
    19(1):1–67, 1991.
    [5] Belinda Giardine, Cathy Riemer, Ross C Hardison, Richard Burhans, Laura Elnitski,
    Prachi Shah, Yi Zhang, Daniel Blankenberg, Istvan Albert, James Taylor, et al. Galaxy:
    a platform for interactive large-scale genome analysis. Genome research, 15(10):1451–
    1455, 2005.
    [6] illumina. Bam file format. Technical report, illumina, 2106.
    [7] Ying Jin, Oliver H Tam, Eric Paniagua, and Molly Hammell. Tetranscripts: a package
    for including transposable elements in differential expression analysis of rna-seq
    datasets. Bioinformatics, 31(22):3593–3599, 2015.
    [8] Justin B Kinney and Gurinder S Atwal. Equitability, mutual information, and the
    maximal information coefficient. Proceedings of the National Academy of Sciences,
    111(9):3354–3359, 2014.
    [9] Günter Klambauer, Karin Schwarzbauer, Andreas Mayr, Djork-Arne Clevert, Andreas
    Mitterecker, Ulrich Bodenhofer, and Sepp Hochreiter. cn. mops: mixture of poissons
    for discovering copy number variations in next-generation sequencing data with a low
    false discovery rate. Nucleic acids research, 40(9):e69–e69, 2012.
    [10] Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information.
    Physical review E, 69(6):066138, 2004.
    [11] Gwenaël GR Leday, Aad W van der Vaart, Wessel N van Wieringen, and Mark A van de
    Wiel. Modeling association between dna copy number and gene expression with constrained
    piecewise linear regression splines. The Annals of Applied Statistics, pages
    823–845, 2013.
    [12] Yang Liao, Gordon K Smyth, and Wei Shi. featurecounts: an efficient general purpose
    program for assigning sequence reads to genomic features. Bioinformatics, 30(7):923–
    930, 2013.
    [13] HM Liu, N Rao, D Yang, L Yang, Y Li, and F Ou. A novel method for identifying snp
    disease association based on maximal information coefficient. Genetics and molecular
    research: GMR, 13(4):10863, 2014.
    [14] Abbas Parsaie, Amir Hamzeh Haghiabi, Mojtaba Saneie, and Hasan Torabi. Prediction
    of energy dissipation on the stepped spillway using the multivariate adaptive regression
    splines. ISH Journal of Hydraulic Engineering, 22(3):281–292, 2016.
    [15] Brent S Pedersen and Aaron R Quinlan. Mosdepth: quick coverage calculation for
    genomes and exomes. Bioinformatics, 34(5):867–868, 2017.
    [16] Jonathan R Pollack, Therese Sørlie, Charles M Perou, Christian A Rees, Stefanie S
    Jeffrey, Per E Lonning, Robert Tibshirani, David Botstein, Anne-Lise Børresen-Dale,
    and Patrick O Brown. Microarray analysis reveals a major direct role of dna copy
    number alteration in the transcriptional program of human breast tumors. Proceedings
    of the National Academy of Sciences, 99(20):12963–12968, 2002.
    [17] Aaron R Quinlan and Ira M Hall. Bedtools: a flexible suite of utilities for comparing
    genomic features. Bioinformatics, 26(6):841–842, 2010.
    [18] David N Reshef, Yakir A Reshef, Hilary K Finucane, Sharon R Grossman, Gilean
    McVean, Peter J Turnbaugh, Eric S Lander, Michael Mitzenmacher, and Pardis C Sabeti.
    Detecting novel associations in large data sets. science, 334(6062):1518–1524,
    2011.
    [19] Marc W Schmid and Ueli Grossniklaus. Rcount: simple and flexible rna-seq read counting.
    Bioinformatics, 31(3):436–437, 2014.
    [20] Gabriella Sferra, Federica Fratini, Marta Ponzi, and Elisabetta Pizzi. Phylo_dcor: distance
    correlation as a novel metric for phylogenetic profiling. BMC bioinformatics,
    18(1):396, 2017.
    [21] Fubo Shao, Keping Li, and Yulin Dong. Identifying multi-variable relationships based
    on the maximal information coefficient. Intelligent Data Analysis, 21(1):151–166,
    2017.
    [22] Junhui Shen, Suhas Vasaikar, and Bing Zhang. Dlad4u: deriving and prioritizing disease
    lists from pubmed literature. BMC bioinformatics, 19(17):495, 2018.
    [23] Terry Speed. A correlation for the 21st century. Science, 334(6062):1502–1503, 2011.
    [24] Gábor J Székely and Maria L Rizzo. Partial distance correlation. In Nonparametric
    Statistics, pages 179–190. Springer, 2016.
    [25] Gábor J Székely, Maria L Rizzo, Nail K Bakirov, et al. Measuring and testing dependence
    by correlation of distances. The annals of statistics, 35(6):2769–2794, 2007.
    [26] Gábor J Székely, Maria L Rizzo, et al. Partial distance correlation with methods for
    dissimilarities. The Annals of Statistics, 42(6):2382–2412, 2014.
    [27] D.C. U.S. Capitol Washington. Remarks by the president in state of the union address
    | january 20, 2015. Technical report, The White House Office of the Press Secretarys,
    January 20, 2015.
    [28] James D Watson and Elke Jordan. The human genome program at the national institutes
    of health. Genomics, 5(3):654–656, 1989.
    [29] Ziheng Yang and Rasmus Nielsen. Estimating synonymous and nonsynonymous substitution
    rates under realistic evolutionary models. Molecular biology and evolution,17(1):32–43, 2000.
    [30] Wengang Zhang and Anthony TC Goh. Multivariate adaptive regression splines and
    neural network models for prediction of pile drivability. Geoscience Frontiers, 7(1):45–52, 2016.

    下載圖示 校內:2024-08-01公開
    校外:2024-08-01公開
    QR CODE