| 研究生: |
陳秀瑜 Chan, Soa-Yu |
|---|---|
| 論文名稱: |
用Hellinger距離估計基因序列間的非相似性 Estimation of degree of dissimilarity between DNA sequence Using Hellinger distance |
| 指導教授: |
吳鐵肩
Wu, Tiee-Jian |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 統計學系 Department of Statistics |
| 論文出版年: | 2006 |
| 畢業學年度: | 94 |
| 語文別: | 英文 |
| 論文頁數: | 58 |
| 中文關鍵詞: | 基因序列 、不相似性測量 、突變 |
| 外文關鍵詞: | Dissimilarity measures, DNA sequcnes, symmetric Kullback-Leibler discrepance, Hellinger distance, Mutation |
| 相關次數: | 點閱:115 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在分子生物學中,如何量化生物序列間的相似度是很重要的議題。在過去已經提出數種的基因序列相似度量測方法。本研究的目的可分為以下兩部分:
(1)利用大規模的電腦模擬,比較Hellinger distance
(HD)與 symmetric Kullback-Leibler
discrepancy (SK-LD)這兩種方法的優劣;
(2)使用實例比較使用HD 與SK-LD兩方法所耗資的CPU時
間與記憶體大小。
因此,經由模擬研究和實際資料分析,本論文發現:
(1)HD 對於基因序列相似度的敏感度表現幾乎與SK-LD
一樣好。
(2)在計算效率上,HD佔用的CPU時間與記憶體大小明顯
的比SK-LD少。
In molecular biology, the issue of quantifying the similarity between two biological sequences is very important. Several measures of DNA sequence dissimilarity have been developed in the past. The purpose of this thesis is twofold. Firstly, we use large scale simulation to compare the performance of Hellinger distance (HD) and symmetric Kullback-Leibler discrepancy (SK-LD). Secondly, we compare the actual CPU time and memory space required on a PC in computing HD and SK-LD. Our simulation study and real data analysis show that (1) the performance of HD is almost as good as SK-LD and (2) the computational efficiency of HD, in terms of CPU time and memory space required, is significantly better than those of SK-LD.
Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K., and Watson, J.D. Molecular Biology of the Cell, 3rd edition. New York: Garland. (1994)
Almeida, J.S., Carrico, J.A., Maretzek, A., Noble, P.a., and Fletcher, M. Analysis of genomic sequences by chaos game representation. Bioinformatics, 17, 429-437. (2001)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. Basic local alignment search tool. Journal of Molecular Biology, 215, 403-410. (1990)
Arratia, T., Gordon, L., and Waterman, W.S. The Erdos-Renyi law in distribution, for coin tossing and sequence matching. Annals of Statistics, 18, 539-570. (1990)
Blaisdell, B.E. A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Science, U.S.A., 83, 5155-5159. (1986)
Blaisdell, B.E. Average value of a dissimilarity measure not requiring sequence alignment are twice the averages of conventional mismatch count requiring sequence alignment for a computer-generated model system. Journal of Molecular Evolution, 29, 538-549. (1989)
Blaisdell, B.E. Effectiveness of measures requiring and not requiring prior sequence alignment of estimating the dissimilarity of natural sequences. Journal of Molecular Evolution, 29, 526-537. (1989)
Cressie, N. and Read, T.R.C. Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society. Series B, 46, 440-464. (1984)
Fichant, G. and Gautier, C. Statistical method for predicting protein coding regions in nucleic acid sequences. CABIOS, 3, 287-295. (1987)
Frith, M.C., Hansen, U., Sponge, J.L., and Weng, Z. Finding functional equence elements by multiple local alignment. Nuclei Acids Research, 32,189-200. (2004)
Funchs,T. From sequence to biology: the impact on bioinformatics. Bioinformatics, 18, 505-506. (2002)
Gentleman, J.F. and Mullin, T.C. The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability. Biometrics, 45, 35-52. (1989)
Hancock, J.M. and Armstrong, J.S. SIMPLE34: an improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences. Comput. Appl. Biosci., 10, 67-70. (1994)
Hide, W., Burke, J., and Davison, D. Biological evaluation of d2, an algorithm for high performance sequence comparison. Journal of Computational Biology, 1,199-215. (1994)
Hughes, T.R., Mao, M., Jones, A.R., Burchard, J., Marton, M.J., Shannon, K.W., Lefkowitz, S.M., Ziman, M., Scheclter, J.M., Meyer, M.R. Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nat. Biotechnol., 19, 342-347. (2001)
Kane, M.D., Jatkoe, T.A., Strumpf, C.T., Lu, J., Thomas, J.D., Madore, S.J. Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Res., 28, 4552-4557. (2000)
Nordberg, E.K. YODA: selecting signature oligonucleotides. Bioinformatics, 21, 1365-1370. (2005)
Pearson, W.R. Rapid and sensitive sequence comparison with FASTA and FASTP. Methods in enzymology, 183, 63-98. (1990)
Pearson, W.R. and Lipman, D.J. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences USA, 85, 2444-2448. (1988)
Pevzner, P.A. Nucleotide sequences versus Markov models. Computers in Chemistry, 16, 103-106. (1992)
Pevzner, P.A. Statistical distance between texts and filtration methods in sequence comparison. CABIOS, 8, 121-127. (1992)
Pham, T.D. and Zuegg, J. A probabilistic measure for alignment-free sequence comparison. Bioinformatics, 20, 3455-3461. (2004)
S.Vinga and J. Almeida Alignment-free sequence comparison -- a review. Bioinformatics, 19, 513-524. (2003)
Sege, T.D. and Saxberg, B.E.H. A statistical test for comparing several nucleotide sequences. Nucleic Acids Research, 10, 375-389. (1982)
Torney, D.C., Burks, C., Davison, D., and Sirkin, K.M. Computation of d2: A measure of sequence dissimilarity. In Computers and DNA, Santa Fe Institute Studies in the Sciences of Complexity, G. Bell and T. Mrarr (eds), 109-125. New York: Addison-Wesley. (1990)
Wang, X. and Seed, B. Selection of oligonucleotide probes for protein coding sequences. Bioinformatics, 19, 796-802. (2003)
Wu, T.-J., Burke, J.P., and Davison, D.B. A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics, 53, 1431-1439. (1997)
Wu, T.-J., Hsieh, Y.-C., and Li, L.-A. Statistical Measures of DNA Sequences Dissimilarity under Markov Chain Models of Base Composition. Biometrics, 57, 441-448. (2004)
Wu, T-J , Huang, Y.-H., and Li, L.-A. Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences, Bioinformatics, 21, 4125-4132. (2005)