簡易檢索 / 詳目顯示

研究生: 楊惠娥
Yang, Hui-E
論文名稱: 人類群體基因組單體型推論問題之研究
On solving population haplotype inference problems
指導教授: 王逸琳
Wang, I-Lin
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2006
畢業學年度: 94
語文別: 英文
論文頁數: 85
中文關鍵詞: 基因組單體型推論整數規劃啟發式演算法最大簡約基因型
外文關鍵詞: Genotype, Haplotype inference, Pure parsimony, Heuristic algorithm, Integer programming
相關次數: 點閱:127下載:7
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著人類基因體計劃的完成,我們對人類DNA的結構及序列已有初步認識,但仍無法確知其功能。DNA的功能可能是疾病研究的關鍵,為了確定其功能,科學家們會針對人類每條染色體上稱為基因組單體型(haplotype)的基因鹼基組合來做進一步的分析。
    由於直接取得基因組單體型資料將耗費許多成本及時間,科學家們通常會使用在一對染色體上稱為基因型(genotype)的基因鹼基混合敘述資料,來代替基因組單體型資料以進行分析。然而,基因型資料所含的資訊並不足以正確地推論出每條染色體上基因的鹼基組合,因此我們必須求解人類群體基因組單體型推論(Population Haplotype Inference,PHI)問題,以從一群人的基因型資料來推論出其相對應的基因組單體型資料。在眾多的PHI 問題相關文獻中,近年來以應用最
    大簡約原則(Pure Parsimony)的PHI問題(PHI problem based on pure parsimony criterion,HIPP)之相關研究最受矚目。HIPP問題旨在使用最少的不同基因組單體型,來解讀一個給定的基因型矩陣。
    本論文特別針對HIPP 問題加以探討,在探討並整理文獻中各類求解PHI 問題之方法後,我們依照基因型資料間的相互包容或排斥的數學關係,提出兩個結合數學規劃與生物意義的啟發式演算法。其中,第一個演算法利用基因型資料間的容斥關係,將可行解空間大幅縮小,以整合的基因型資料來求解一個整數規劃問題;而第二個演算法付予那些可解讀較多基因型的基因組單體型更高的權重,並使用貪婪(greedy)的方式來選取權重較大的基因組單體型以求解HIPP問題。
    本論文並執行大規模的程式測試,來分析各類解法在求解HIPP問題時之效率及效果雙方面的表現。最後,我們提出一個可用來處理大規模(large scale)HIPP問題的技巧,將該問題切成較小規模的子問題,求解各子問題之後再將其解組合之;另外,我們亦針對文獻中一個極有效率的啟發式演算法加以改良,使之可以有效地求解出任一HIPP問題的所有最佳解。

    Due to the completion of Human Genome Project, we have known more about the structure and sequence, but not yet the function of human DNA. The function of DNA may be a key for human disease. To analyze the function of DNA, researchers have to obtain each haplotype, the genetic constitution of an individual chromosome,
    of an individual for analysis. Nevertheless, considering the significant efforts required
    in collecting haplotypes, usually the descriptions of one con°ated pair of haplotypes
    called genotypes are collected.
    Since the genotype data contains insu±cient information to identify the combination of DNA sequence in each copy of a chromosome, one has to solve the population haplotype inference (PHI) problem which infers haplotype data from genotype data for a population. Previous researchers use mathematical programming methods and heuristic algorithms to solve the population haplotype inference problem. This thesis surveys these methods and conducts computational experiments on the efficiency and effectiveness for these methods in solving a population haplotype inference problem based on pure parsimony criterion (HIPP) which seeks the minimum number of distinct haplotypes to infer a given genotype matrix.
    We propose two heuristic algorithms to solve the HIPP problem with promising performance. The first heuristic algorithm exploits the compatible relations among genotypes to solve a reduced integer linear programming problem in a smaller solution space. The second heuristic algorithm selects popular haplotypes that can resolve more genotypes in a greedy fashion. Extensive computational experiments have been
    conducted for several PHI solution methods on both the biological and simulated data.
    The results show that our proposed algorithms are effcient and effective, especially for solving cases with larger recombination rates. Finally, we give a divde-and-concur technique to solve large-scale HIPP problems. We also improve a parsimonious tree growing heuristic to obtain all the multiple optimal solutions for an HIPP problem.

    ACKNOWLEDGEMENTS : : : : : : : : : : : : : : : : : : : : : : : : : : : i LIST OF TABLES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : iv LIST OF FIGURES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : vi CHAPTER I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background and motivation . . . . . . . . . . . . . . . . . . . . 1 1.1.1 DNA, variations and mutations . . . . . . . . . . . . 2 1.1.2 SNP, haplotype and genotype . . . . . . . . . . . . . 3 1.1.3 Haplotype assembly problem . . . . . . . . . . . . . . 6 1.1.4 Population haplotype inference problem . . . . . . . . 8 1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Overview of thesis . . . . . . . . . . . . . . . . . . . . . . . . . 10 II. SOLUTION METHODS FOR THE PHI PROBLEM . . . . . . 12 2.1 Statistical methods . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.1 EM algorithm . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2 Bayesian methods . . . . . . . . . . . . . . . . . . . . 14 2.2 Combinatorial methods . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Clark's inference rule . . . . . . . . . . . . . . . . . . 15 2.2.2 Perfect phylogeny . . . . . . . . . . . . . . . . . . . . 18 2.2.3 Pure parsimony . . . . . . . . . . . . . . . . . . . . . 20 2.2.3.1 Integer linear programming . . . . . . . . 21 2.2.3.2 Quadratic integer programming approach 27 2.2.3.3 Branch-and-bound algorithm . . . . . . . 29 2.2.3.4 Heuristic algorithm . . . . . . . . . . . . . 30 2.2.3.5 Graph theory . . . . . . . . . . . . . . . . 34 2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 III. TWO HEURISTIC ALGORITHMS FOR SOLVING THE PHI PROBLEM BASED ON PURE PARSIMONY . . . . . . . . . . 40 ii 3.1 Compatible genotype pairs . . . . . . . . . . . . . . . . . . . . 40 3.2 The method of merged genotype pairs . . . . . . . . . . . . . . 42 3.3 The greedy heuristic inference method . . . . . . . . . . . . . . 48 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 IV. COMPUTATIONAL EXPERIMENTS AND ANALYSES . . . 54 4.1 Settings for our computational experiments . . . . . . . . . . . 54 4.2 Experiments on ¯2AR gene . . . . . . . . . . . . . . . . . . . . 55 4.3 Experiments on simulated data . . . . . . . . . . . . . . . . . . 56 4.4 Experiments on GHI using other weight mechanisms . . . . . . 62 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 V. RELATED ISSUES IN SOLVING THE HIPP PROBLEM . . 68 5.1 Techniques for solving large-scale HIPP problems . . . . . . . . 68 5.2 Improving the PTG heuristic algorithm . . . . . . . . . . . . . 70 5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 VI. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 REFERENCES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 79 BIOGRAPHY : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85

    Blain, P., Holder, A., Silva, J. and Vinzant, C. 2005. Mathematical approaches to the pure parsimony problem (Tech. Rep.). Trinity University, Mathematics Technical Report S40, San Antonio.
    Brown, D. G. and Harrower, I. M. 2004. A new integer programming formulation for the pure parsimony problem in haplotype analysis. In I. Jonassen and J. Kim (Eds.), Algorithms in Bioinformatics: 4th international workshop, WABI 2004 (pp. 254-265). Springer Berlin / Heidelberg.
    Brown, D. G. and Harrower, I. M. 2005. A new formulation for haplotype inference by pure parsimony (Tech. Rep.). School of Computer Science, University of Waterloo.
    Clark, A. G. Inference of haplotypes from PCR-amplified samples of diploid populations. Molecular Biology and Evolution, 7(2), 111-122, 1990.
    Clark, A. G., Weiss, K. M., Nickerson, D. A., Taylor, S. L., Buchanan, A., Stengard, J., Salomaa, V., Vartiainen, E., Perola, M., Boerwinkle, E. and Sing, C. F. Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase. American Journal of Human Genetics, 63(2), 595-612, 1998.
    Daly, M. J., Rioux, J. D., Schaffner, S. F., Hudson, T. J. and Lander, E. S. High-resolution haplotype structure in the human genome. Nature Genetics, 29(2), 229-232, 2001.
    Davis, C. and Holder, A. 2003. Haplotyping and minimum diversity graphs (Tech. Rep.). Trinity University, Mathematics Technical Report S73, San Antonio.
    Donnelly, P. and Tavare, S. Coalescents and genealogical structure under neutrality. Annual Review of Genetics, 29(1), 401-421, 1995.
    Drysdale, C., McGraw, D., Stack, C., Stephens, J., Judson, R., Nandabalan, K., Arnold, K., Ruano, G. and Liggett, S. Complex promoter and coding region-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness. Proceedings of the National Academy of Sciences of the United States of America, 97(19), 10483-10488, 2000.
    Excoffier, L. and Slatkin, M. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Molecular Biology and Evolution, 12(5), 921-927, 1995.
    Futatsugawa, Y., Kubota, T., Ishiguro, A., Suzuki, H., Ishikawa, H. and Iga, T. PCR-based haplotype determination to distinguish CYP2B6*1/*7 and *5/*6. Clinical Chemistry, 50(8), 1472-1473, 2004.
    Gusfield, D. Inference of haplotypes from samples of diploid populations: complexity and algorithms. Journal of Computational Biology, 8(3), 305-323, 2001.
    Gusfield, D. 2002. Haplotyping as perfect phylogeny: conceptual framework and efficient solutions. In G. Myers, S. Hannenhalli, D. Sanko? S. Istrail, P. Pevzner, and M. Waterman (Eds.), Annual Conference on Research in Computational Molecular Biology: Proceedings of the sixth annual international conference on computational biology (pp. 166-175). ACM Press New York, NY, USA.
    Gusfield, D. 2003. Haplotype inference by pure parsimony. In Combinatorial Pattern Matching: 14th Annual Symposium (pp. 144-155). Springer Berlin / Heidelberg.
    Gusfield, D. 2004. An overview of combinatorial methods for haplotype inference. In A. C. Sorin Istrail, Michael Waterman (Ed.), Computational Methods for SNPs and Haplotype Inference: DIMACS/RECOMB Satellite workshop (pp. 9-25). Springer Berlin / Heidelberg.
    Halldorsson, B. V., Bafna, V., Edwards, N., Lippert, R., Yooseph, S. and Istrail, S. 2002. A survey of computational methods for determining haplotypes. In Computational methods for SNPs and haplotype inference (pp. 26-47). Springer Berlin / Heidelberg.
    Hawley, M. E. and Kidd, K. K. HAPLO: A program using the EM algorithm to estimate the frequencies of multi-site haplotypes. Journal of heredity, 86(5), 409-411, 1995.
    Helmuth, L. Map of the human genome 3.0. Science, 293(5530), 583-585, 2001.
    Huang, Y. T., Chao, K. M. and Chen, T. An approximation algorithm for haplotype inference by maximum parsimony. Journal of Computational Biology, 12(10), 1261-1274, 2005.
    Hudson, R. R. Gene genealogies and the coalescent process. Oxford survey of evolutionary biology, 7, 1-44, 1990.
    Hudson, R. R. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics, 18(2), 337-338, 2002.
    Kalpakis, K. and Namjoshi, P. 2005. Haplotype phasing using semidefinite programming. In Bibe '05: Proceedings of the fifth IEEE symposium on bioinformatics and bioengineering (pp. 145-152). Washington, DC, USA: IEEE Computer Society.
    Lancia, G., Bafna, V., Istrail, S., Lippert, R. and Schwartz, R. 2001. SNPs problems, algorithms and complexity. In European symposium on algorithms (ESA 01) (pp.182-193). Springer-Verlag.
    Lancia, G., Pinotti, C. M. and Rizzi, R. Haplotyping populations by pure parsimony : Complexity, exact and approximation algorithms. INFORMS Journal on Computing, 16(4), 348-359, 2004.
    Li, Z., Zhou, W., Zhang, X. and Chen, L. A parsimonious tree-grow method for haplotype inference. Bioinformatics, 21(17), 3475-3481, 2005.
    Lippert, R., Schwartz, R., Lancia, G. and Istrail, S. Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem. Briefings in Bioinformatics, 3(1), 23-31, 2002.
    Long, J. C., Williams, R. C. and Urbanek, M. An E-M algorithm and testing strategy for multiple-locus haplotypes. American journal of human genetics, 56(3), 799-810, 1995.
    Michalatos-Beloin, S., Tishko? S. A., Bentley, K. L., Kidd, K. K. and Ruano, G. Molecular haplotyping of genetic markers 10 kb apart by allele-specific long-range PCR. Nucleic Acids Research, 24(23), 4841-4843, 1996.
    Nickerson, D. A., Taylor, S. L., Weiss, K. M., Clark, A. G., Hutchinson, R. G., Stengard, J., Salomaa, V., Vartiainen, E., Boerwinkle, E. and Sing, C. F. DNA sequence diversity in a 9.7-kb region of the human lipoprotein lipase gene. Nature Genetics, 19, 233-240, 1998.
    Niu, T., Qin, Z. S., Xu, X. and Liu, J. S. Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. American Journal of Human Genetics, 70(1), 157-169, 2002.
    Rizzi, R., Bafna, V., Istrail, S. and Lancia, G. 2002. Practical algorithms and fixed-parameter tractability for the single individual SNP haplotyping problem. In R. Guig'o and D. Gusfield (Eds.), Algorithms in Bioinformatics: Second international workshop (pp. 29-43). Springer Berlin / Heidelberg.
    Ruano, G. and Kidd, K. K. Direct haplotyping of chromosomal segments from multiple heterozygotes via allele-specific PCR amplification. Nucleic Acids Research, 17(20), 8392-8392, 1989.
    Ruano, G., Kidd, K. K. and Stephens, J. C. Haplotype of multiple polymorphisms resolved by enzymatic amplification of single DNA molecules. Proceedings of the National Academy of Sciences of the United States of America, 87(16), 6296-6300, 1990.
    Stephens, M. and Donnelly, P. A comparison of bayesian methods for haplotype reconstruction from population genotype data. American Journal of Human Genetics, 73(5), 1162-1169, 2003.
    Stephens, M., Smith, N. J. and Donnelly, P. A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics, 68(3), 978-989, 2001.
    Wang, L. and Xu, Y. Haplotype inference by maximum parsimony. Bioinformatics, 19(14), 1773-1780, 2003.
    Wang, R. S., Wu, L. Y., Li, Z. P. and Zhang, X. S. Haplotype reconstruction from SNP fragments by minimum error correction. Bioinformatics, 21(10), 2456-2462, 2005.
    Zhang, X. S., Wang, R. S., Wu, L. Y. and Chen, L. Models and algorithms for haplotyping problem. Current Bioinformatics, 1(1), 105-114, 2006.

    下載圖示 校內:2008-08-25公開
    校外:2008-08-25公開
    QR CODE