簡易檢索 / 詳目顯示

研究生: 陳秀瑜
Chan, Soa-Yu
論文名稱: 用Hellinger距離估計基因序列間的非相似性
Estimation of degree of dissimilarity between DNA sequence Using Hellinger distance
指導教授: 吳鐵肩
Wu, Tiee-Jian
學位類別: 碩士
Master
系所名稱: 管理學院 - 統計學系
Department of Statistics
論文出版年: 2006
畢業學年度: 94
語文別: 英文
論文頁數: 58
中文關鍵詞: 基因序列不相似性測量突變
外文關鍵詞: Dissimilarity measures, DNA sequcnes, symmetric Kullback-Leibler discrepance, Hellinger distance, Mutation
相關次數: 點閱:115下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在分子生物學中,如何量化生物序列間的相似度是很重要的議題。在過去已經提出數種的基因序列相似度量測方法。本研究的目的可分為以下兩部分:

    (1)利用大規模的電腦模擬,比較Hellinger distance
    (HD)與 symmetric Kullback-Leibler
    discrepancy (SK-LD)這兩種方法的優劣;
    (2)使用實例比較使用HD 與SK-LD兩方法所耗資的CPU時
    間與記憶體大小。

    因此,經由模擬研究和實際資料分析,本論文發現:

    (1)HD 對於基因序列相似度的敏感度表現幾乎與SK-LD
    一樣好。
    (2)在計算效率上,HD佔用的CPU時間與記憶體大小明顯
    的比SK-LD少。

    In molecular biology, the issue of quantifying the similarity between two biological sequences is very important. Several measures of DNA sequence dissimilarity have been developed in the past. The purpose of this thesis is twofold. Firstly, we use large scale simulation to compare the performance of Hellinger distance (HD) and symmetric Kullback-Leibler discrepancy (SK-LD). Secondly, we compare the actual CPU time and memory space required on a PC in computing HD and SK-LD. Our simulation study and real data analysis show that (1) the performance of HD is almost as good as SK-LD and (2) the computational efficiency of HD, in terms of CPU time and memory space required, is significantly better than those of SK-LD.

    Chapter 1 Introduction........................................................1 1.1 DNA Sequence..............................................................5 1.2 Mutations in the DNA......................................................6 1.3 Literature Review.........................................................7 Chapter 2 Dissimilarity Measures.............................................10 Chapter 3 Methodology........................................................13 3.1 Simulation Design........................................................13 3.2 Sensitivity of Dissimilarity Measures....................................15 3.3 Performances of Dissimilarity Measures...................................18 3.4 Estimation the Degree of Dissimilarity between two DNA sequences.........23 Chapter 4 Real Data Analysis.................................................25 4.1 Experiment #1............................................................25 4.2 Experiment #2............................................................28 4.3 Experiment #3............................................................31 4.4 Experiment #4............................................................33 Chapter 5 Conclusions........................................................36 References...................................................................37 Appendix.....................................................................40 List of Tables Table 1.1 The number of base pairs and DNA sequences from the year 1982 to 2005...........................2 Table 3.1 Example of calculating rank penalty scores using HD or SK-LD as a dissimilarity search tool...............19 Table 3.2 The optimal word size of HD for DNA sequence comparison using window size l........24 Table 4.1 Score matrix among thr A-thr C using dissimilarity measure HD (upper triangle) and similarity measure BLAST (lower triangle) at the default search parameter setting......................27 Table 4.2 The estimated degrees of dissimilarity and SK-LD between HSLIPAS and 39 library sequences using HD and SK-LD, respectively, are sorted from the highest to lowest similarity...........30 Table 4.3 The sensitivity of HD and SK-LD to segment shuffling using the permuted versions of the SARS sequence g1 and the 14 SARS sequences......32 Table 4.4 Comparison of average rank of BLAST scores of probes to non-target genes in designing a 70-mer oligo probe for each gene from T7 phage genome.............................35 Appendix Table A1 Estimation of the degree of dissimilarity using Hellinger distance , = window size and = optimal word size............................40 Table A2 Query (HSLIPAS) and test dataset of 63 sequences........................57 List of Figures Figure 1.1 Growth of GeneBank Data from 1982 to 2005..................1 Figure 3.1 Data structure for section 3.2-3.3......................................14 Figure 3.2 Relation between the sample mean of 5000 HD scores and mutation rate ....................................16 Figure 3.3 Gives the relations between and the logarithm of sample standard deviation of 5000 HD scores at window size 900 and ..................17 Figure 3.4 Over 5000 comparisons:(a)~(f) among different similarity/dissimilarity measures at window size 100, 250, 400, 600, 1200 and 3000......................21 Figure 3.5 HD measures over 5000 comparisons: among different window sizes for the word-based measure ......................22 Figure 4.1 Hierarchical dendrogram (using complete linkage) of thrA, thrB, thrC and rand sequences on the basis of the matrices of dissimilarity in Table 2,using HD at the default search parameter setting.....27

    Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K., and Watson, J.D. Molecular Biology of the Cell, 3rd edition. New York: Garland. (1994)

    Almeida, J.S., Carrico, J.A., Maretzek, A., Noble, P.a., and Fletcher, M. Analysis of genomic sequences by chaos game representation. Bioinformatics, 17, 429-437. (2001)

    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. Basic local alignment search tool. Journal of Molecular Biology, 215, 403-410. (1990)

    Arratia, T., Gordon, L., and Waterman, W.S. The Erdos-Renyi law in distribution, for coin tossing and sequence matching. Annals of Statistics, 18, 539-570. (1990)

    Blaisdell, B.E. A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Science, U.S.A., 83, 5155-5159. (1986)

    Blaisdell, B.E. Average value of a dissimilarity measure not requiring sequence alignment are twice the averages of conventional mismatch count requiring sequence alignment for a computer-generated model system. Journal of Molecular Evolution, 29, 538-549. (1989)

    Blaisdell, B.E. Effectiveness of measures requiring and not requiring prior sequence alignment of estimating the dissimilarity of natural sequences. Journal of Molecular Evolution, 29, 526-537. (1989)

    Cressie, N. and Read, T.R.C. Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society. Series B, 46, 440-464. (1984)

    Fichant, G. and Gautier, C. Statistical method for predicting protein coding regions in nucleic acid sequences. CABIOS, 3, 287-295. (1987)

    Frith, M.C., Hansen, U., Sponge, J.L., and Weng, Z. Finding functional equence elements by multiple local alignment. Nuclei Acids Research, 32,189-200. (2004)

    Funchs,T. From sequence to biology: the impact on bioinformatics. Bioinformatics, 18, 505-506. (2002)

    Gentleman, J.F. and Mullin, T.C. The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability. Biometrics, 45, 35-52. (1989)

    Hancock, J.M. and Armstrong, J.S. SIMPLE34: an improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences. Comput. Appl. Biosci., 10, 67-70. (1994)

    Hide, W., Burke, J., and Davison, D. Biological evaluation of d2, an algorithm for high performance sequence comparison. Journal of Computational Biology, 1,199-215. (1994)

    Hughes, T.R., Mao, M., Jones, A.R., Burchard, J., Marton, M.J., Shannon, K.W., Lefkowitz, S.M., Ziman, M., Scheclter, J.M., Meyer, M.R. Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nat. Biotechnol., 19, 342-347. (2001)

    Kane, M.D., Jatkoe, T.A., Strumpf, C.T., Lu, J., Thomas, J.D., Madore, S.J. Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Res., 28, 4552-4557. (2000)

    Nordberg, E.K. YODA: selecting signature oligonucleotides. Bioinformatics, 21, 1365-1370. (2005)

    Pearson, W.R. Rapid and sensitive sequence comparison with FASTA and FASTP. Methods in enzymology, 183, 63-98. (1990)

    Pearson, W.R. and Lipman, D.J. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences USA, 85, 2444-2448. (1988)

    Pevzner, P.A. Nucleotide sequences versus Markov models. Computers in Chemistry, 16, 103-106. (1992)

    Pevzner, P.A. Statistical distance between texts and filtration methods in sequence comparison. CABIOS, 8, 121-127. (1992)

    Pham, T.D. and Zuegg, J. A probabilistic measure for alignment-free sequence comparison. Bioinformatics, 20, 3455-3461. (2004)

    S.Vinga and J. Almeida Alignment-free sequence comparison -- a review. Bioinformatics, 19, 513-524. (2003)

    Sege, T.D. and Saxberg, B.E.H. A statistical test for comparing several nucleotide sequences. Nucleic Acids Research, 10, 375-389. (1982)

    Torney, D.C., Burks, C., Davison, D., and Sirkin, K.M. Computation of d2: A measure of sequence dissimilarity. In Computers and DNA, Santa Fe Institute Studies in the Sciences of Complexity, G. Bell and T. Mrarr (eds), 109-125. New York: Addison-Wesley. (1990)

    Wang, X. and Seed, B. Selection of oligonucleotide probes for protein coding sequences. Bioinformatics, 19, 796-802. (2003)

    Wu, T.-J., Burke, J.P., and Davison, D.B. A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics, 53, 1431-1439. (1997)

    Wu, T.-J., Hsieh, Y.-C., and Li, L.-A. Statistical Measures of DNA Sequences Dissimilarity under Markov Chain Models of Base Composition. Biometrics, 57, 441-448. (2004)

    Wu, T-J , Huang, Y.-H., and Li, L.-A. Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences, Bioinformatics, 21, 4125-4132. (2005)

    下載圖示 校內:2007-07-28公開
    校外:2011-07-28公開
    QR CODE