簡易檢索 / 詳目顯示

研究生: 鄭若君
Cheng, Jo-Chun
論文名稱: 基因序列演化後之相似性研究
A Study of Similarity Measures of DNA Sequences under Evolution
指導教授: 吳鐵肩
Wu, Tiee-Jian
李隆安
Li, Lung-An
學位類別: 碩士
Master
系所名稱: 管理學院 - 統計學系
Department of Statistics
論文出版年: 2003
畢業學年度: 91
語文別: 英文
論文頁數: 92
中文關鍵詞: 標準化的歐式距離外顯子演化模式突變率重組率
外文關鍵詞: mutation rate, the exon, recombination rate, standardized Euclidean distance, Kullback-Leibler discrepancy, evolution model
相關次數: 點閱:89下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 演化是任何物種能生存至今必經的過程,生物演化包括了基因突變、基因重組、適應外在環境改變等原因。而本研究分為以下四個步驟:
    (1) 於PIR (Protein Information Resource) 網站上蒐集了人類的蛋白質序列,及其對應的基因序列,並記錄基因序列上外顯子的資訊,探討所蒐集的資訊之機率分佈。
    (2) 建立一個人類演化的統計模式,趨近於人類真實的演化過程,模式中考慮的演化因素包括突變率、基因重組率、死亡率、人口成長率等。
    (3) 利用一對男女作為人類始袓,透過上述演化的統計模式,逐步模擬產生兩組5000個第一萬代的子代,再計算子代間基因序列的標準化的歐式距離以及Kullback-Leibler discrepancy。
    (4) 探討演化的統計模式在不同參數時所產生子代之基因序列間的非相似性,利用變異數分析與邏輯斯迴歸了解各個演化因素之顯著性。

    Evolution is the most important process of all kinds of organism and species experienced. The evolution consists of the mutation, recombination of genes, survival from environment changes, etc. This thesis is composed of four steps. First, we searched proteins in protein family or superfamily in PIR (Protein Information Resource) web site (http://pir.georgetown.edu). We recorded the nucleotide sequences, from which those proteins were transcribed, and the information of exons in these DNA sequences, and then found the empirical distributions on the information of these exons. Next, we constructed an evolution model to imitate the real evolution history of human being. Some evolution factors like the mutation rate, the recombination rate of genes, the death rate, and the growth of population size, etc. were included in this proposed evolution model. In the third step, a simulation study utilizing the new evolution model was conducted. We generated 5000 offsprings of the ten-thousandth generation from one ancestor parent generation twice for each of nine combinations of the levels of evolution factors, and we evaluated the dissimilarity measure between DNA sequences of each combination by the standardized Euclidean distance, and Kullback-Leibler discrepancy function. Last, the ANOVA analysis and the logistic regression analysis of some statistics of dissimilarity measures were employed to find the significant evolution factors, and the estimates of those levels of evolution factors were also obtained to understand the effect of these levels.

    Chapter 1 Introduction…………………………………………………………1 1.1 Messenger RNA………………………………………………………………2 1.2 Recombination of Genes …………………………………………………3 1.3 Mutation in the DNA………………………………………………………5 1.4 Dissimilarity Measures of DNA Sequence ……………………………6 1.5 Outline ……………………………………………………………………10 Chapter 2 Distribution of the Number、Length and Position of Exons ……………………………………………………………11 2.1 Data Collection …………………………………………………………11 2.2 Distribution of the Number of Exons ………………………………13 2.3 Distribution of the Length of Exons ………………………………16 2.4 Distribution of the Position of Exons ……………………………18 Chapter 3 An Evolution Model ………………………………………………22 3.1 Population Size per Generation………………………………………22 3.2 Creation of the Filial Generation …………………………………23 3.3 Factors of the Evolution Model………………………………………23 3.3.1 Mutation of the Evolution Model…………………………………24 3.3.2 Recombination of Genes of the Evolution Model………………25 Chapter 4 A Simulation study ………………………………………………28 4.1 The First Ancestor Sequences………………………………………28 4.2 Simulation Process……………………………………………………32 4.2.1 Population Size per Generation ………………………………32 4.2.2 Parameters of the Evolution Model……………………………33 4.2.3 Dissimilarity Measures of DNA Sequence ……………………36 4.3 Simulation Result ……………………………………………………37 Chapter 5 Finding Significant Evolution Factors………………………39 5.1 The Analysis of Variance……………………………………………39 5.2 The Logistic Regression Model ……………………………………43 Chapter 6 Conclusion …………………………………………………………46 Reference ………………………………………………………………………47 Appendix …………………………………………………………………………49

    Daniel L. Hartl Elizabeth W. Jones (1999). Essential Genetics, Jones and Bartlett.

    Lindar. Maxson, Charles h. Daugherty (1992). Genetics-A Human Perspective, Wm. C. Brown.
    Blaisdell, B. E. (1986). A measure of the similarity of sets of sequences not requiring sequence alignment. Proceedings of the National Academy of Science, U.S.A. 83, 5155-5159.
    Blaisdell, B. E. (1989a). Effectiveness of measures requiring and not requiring prior sequence alignment of estimating the dissimilarity of natural sequences. Journal of Molecular Evolution 29, 526-537.
    Blaisdell, B. E. (1989b). Average value of a dissimilarity measure not requiring sequence alignment are twice the averages of conventional mismatch count requiring sequence alignment for a computer-generated model system. Journal of Molecular Evolution 29, 538-549.
    Karlin, S. and Brendel, V. (1993). Patchiness and correlation in DNA sequences. Science 259, 677-679.
    Churchill, A. (1992). Hidden Markov chains and the analysis of genome structure. Computers in Chemistry 16, 107-115.
    Gentleman, J. F. and Mullin, R. C. (1989). The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability. Biometrics 45, 35-52.
    Davison, D. (1984). Sequence similarity searching for molecular biologists. Bulletin of Mathematical Biology 46, 437-474.
    Hide, W., Burke, J., and Davison, D. (1994). Biological evaluation of , an algorithm for high performance sequence comparison. Journal of Computa- tional Biology 1, 199-215.
    Fichant, G. and Gautier, C. (1987). Statistical method for predicting protein coding regions in nucleic acid sequences. CABIOS 3, 287-295.
    Karlin, S., Ost, F., and Blaisdell, B. E. (1989). Patterns in DNA and amino acid sequences and their statistical significance. In Mathematical Methods for DNA Sequences, M. S. Waterman (ed), 133-157. Boca Raton, Florida: CRC.
    Waterman, M. S. (ed.). (1989). Mathematical Methods for DNA Sequences. Boca Raton, Florida: CRC.
    Wu, T.-J., Burk, J. P., and Davison, D. B. (1997). A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics 53, 1431-1439.
    Wu, T.-J., Hsieh, Y.-C., and Li, L.-A. (2001). Statistical Measures of DNA Sequences Dissimilarity under Markov Chain Models of Base Composition. Biometrics 57, 441-448.

    下載圖示 校內:立即公開
    校外:2003-07-21公開
    QR CODE