簡易檢索 / 詳目顯示

研究生: 陳佳芬
Chen, Chia-Fen
論文名稱: 基因序列長度之估計
The estimation of the DNA sequence length
指導教授: 馬瀰嘉
Ma, Mi-Chia
學位類別: 碩士
Master
系所名稱: 管理學院 - 統計學系
Department of Statistics
論文出版年: 2012
畢業學年度: 100
語文別: 英文
論文頁數: 75
中文關鍵詞: 樣本涵蓋率序列比對
外文關鍵詞: sample coverage, read mapping
相關次數: 點閱:106下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 由於構成生物基因體的核苷酸數目龐大,且定序技術無法一次讀取所有基因體,故採取將基因體切割成較小的基因片段(Reads),並解讀之。基因組裝採用基因片段堆疊的方式,運用電腦程式拼湊堆疊的基因片段,重建原本的基因序列。近年來生物基因體紛紛被生物資訊學家們透過定序及組裝的方式找出,在文獻中,已有許多學者探討如何衡量生物基因體的長度,並研發了不同的分析軟體進行組裝。在本文中,主要探討以基因片段(Reads)堆疊的資料來估計其原本序列的長度。由於欲了解所提出估計方法是否能估計到真實序列的長度,故採用已知序列長度的物種,以序列比對(read mapping)方式獲得資料進行研究,並應用生態統計領域的樣本涵蓋率(sample coverage)以估計序列之長度。本文以實例分析,了解估計的準確性,並進一步探討需多少數量的基因片段,以期能估計到預期的鹼基對個數。

    The DNA sequencing technology cannot read whole genomes in one go due to the huge number of the nucleotide for the genomes and the limitation of the DNA sequencing technology. The DNA is broken into millions of random fragments that are called “reads” by a DNA sequencing machine, and the base calling is used for assigning bases. A computer program pieces together the many overlapping reads and reconstructs the original sequence. Recently, there are more and more species which genomes are sequenced by the researcher of the bioinformation. How to estimate the length of the DNA sequence and the associated implementation have been introduced by many scholars in the literatures. In this article, the estimation for the length of the DNA sequence is based on the overlapping reads data, and the sample coverage method is applied in the proposed method. In order to understand whether the proposed method estimates the correct value or not, we use the known DNA sequence and the method of the read mapping to obtain the overlapping reads data. Through the real example study, the accuracy of the proposed method is discussed. Furthermore, how many the number of the reads is requisite to obtain the pre-specified accuracy.

    Chapter 1 Introduction 1 Chapter 2 Literature Review 5 2.1 Read Mapping 6 2.2 Sample Coverage 9 Chapter 3 Proposed Method 13 3.1 Data preprocessing 13 3.2 The estimation of DNA sequence length 15 Chapter 4 Simulation Study 18 4.1 Real example study 18 4.1.1 The number of the reads for mapping 21 4.1.2 Goodness of fit 22 4.1.3 Estimation for the length of DNA sequence 25 4.2 Simulation study 56 4.2.1 Simulation design 56 4.2.2 The performance of the estimators of the variance 57 4.2.3 The performance of the estimator of the variance (independent assumption) 62 Chapter 5 Conclusion and Discussion 66 Reference 68 Appendix A 70 Appendix B 74

    1.Bao S., Jiang R., Kwan W., Wang B., Ma X. & Song Y.Q. (2011) Evaluation of next-generation sequencing software in mapping and assembly. J Hum Genet 56, 406-14.
    2.Bentley D.R. et al. (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53-9.
    3.Chao A. (1981) On estimating the probabaility of discovering a new species. The Annals of Statistics 9, 1339-42.
    4.Chao A. & Lee S.M. (1992) Estimating the number of classes via sample coverage. Journal of the American Statistical Association 87, 210-7.
    5.Chao A., Hwang W.H., Chen Y.C. & Kuo C.Y. (2000) Estimating the number of shared species in two communities. Statistica Sinica 10, 227-46.
    6.Chao A., Ma M.C. & Yang M.C.K. (1993) Stopping rules and estimation for recapture debugging with unequal failure rates. Biometrika 80, 193-201.
    7.Esty W.W. (1986) The efficiency of Good's nonparametric coverage estimator. The Annals of Statistics 14, 1257-60.
    8.Huang S.P. & Weir B.S. (2001) Estimating the total number of alleles using a sample coverage method. Genetics 159, 1365-73.
    9.Lander E.S. & Waterman M.S. (1988) Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2, 231–9.
    10.Li X. & Waterman M.S. (2003) Estimating the repeat structure and length of DNA sequences using L-tuples. Genome Res 13, 1916-22.
    11.Mao C.X. & Lindsay B.G. (2002) A Poisson model for the coverage problem with a genomic application. Biometrika 89, 669-81.
    12.Marcais G. & Kingsford C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764-70.
    13.McGrath C.L. & Katz L.A. (2004) Genome diversity in microbial eukaryotes. Trends Ecol Evol 19, 32-8.
    14.Needleman S.B. & Wunsch C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Evolution 48, 443-53.
    15.Shaffer C. (2007) Next-generation DNA sequencing outpaces expectations. Nat Biotechnol 25, 149.
    16.Smith T.F., Waterman M.S. & Fitch W.M. (1981) Comparative biosequence metrics. J Mol Evol 18, 38-46.
    17.Waterman M.S. (1995) Introduction to computational biology: Maps,sequences and genomes. Chapman and Hall, London, UK.
    18.Wilhelm J., Pingoud A. & Hahn M. (2003) Real-time PCR-based method for the estimation of genome sizes. Nucleic Acids Res 31, e56.

    下載圖示 校內:2013-07-19公開
    校外:2014-07-19公開
    QR CODE