| 研究生: |
馬家宜 Ma, Chia-Yi |
|---|---|
| 論文名稱: |
分別以連鎖不平衡及拉氏鬆弛法選取代表性單核苷酸多型性之研究 TagSNP Selection Problems based on Linkage Disequilibrium and Lagrangian Relaxation |
| 指導教授: |
王逸琳
Wang, I-Lin |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 工業與資訊管理學系 Department of Industrial and Information Management |
| 論文出版年: | 2008 |
| 畢業學年度: | 96 |
| 語文別: | 中文 |
| 論文頁數: | 77 |
| 中文關鍵詞: | 連鎖不平衡 、標記型單核苷酸多型性 、基因組單體型 、演算法 、拉氏鬆弛法 |
| 外文關鍵詞: | tagSNP, Lagrangian Algorithm, Graph, Haplotype, Algorithm, Linkage Disequilibrium |
| 相關次數: | 點閱:53 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在DNA序列可能發生的眾多差異性當中,單核苷酸多型性(Single Nucleotide Polymorphism,SNP)是最常發生的一種遺傳變異,由多個SNP鹼基所組成之序列稱為基因組單體形(Haplotype),此序列的改變對疾病的發生及人類特徵的顯現有重大關聯,可被應用於辨識不同疾病及其他相關之醫學研究上。由於目前已發現的SNP資料量龐大,為了節省SNP資料庫所需的高成本花費,許多研究建議以被稱為tagSNP的SNP序列資料部份集合來代表原本全部的SNP序列。
tagSNP可依其應用目的而有不同的定義,本研究首先針對文獻中最常被使用之tagSNP定義,將其選取問題(Selection Problem) 轉換成一個具有多重最佳解的0,1二元整數規劃問題,以選出可辨識出所有的Haplotype序列樣式之最小SNP部分集合。由於過去研究多著重於改善求解方法之效率,並未評估所求得之最佳解與其它最佳解間之資訊差異,因此本研究提出一個以圖型理論為基礎的啟發式演算法,先求解出所有的最佳解,再採用連鎖不平衡(Linkage Disequilibrium,LD)觀念以計算已被選取之tagSNP與尚未被選出之其它SNP間的相互關連性,作為該最佳解所包含Haplotype資訊量多寡的評估指標,並選取其中最多資訊者為最終最佳解。此外,我們亦提出一個可同時考慮極小化tagSNP個數與極大化LD值之雙目標數學規劃模式以求解類似問題。
在求解大規模的tagSNP選取問題上,本研究提出一個以拉氏鬆弛法為基礎的啟發式演算法LRH,採用次梯度(subgradient)法更新拉氏乘數(Lagrangian multiplier),使之逐漸逼近最佳解。我們亦在求解過程中加入貪婪演算法的觀念,藉由固定部份SNP欄位以逐漸縮減問題規模,改善求解速度及求解品質。此外,我們亦提出一個結合LRH與最佳化軟體CPLEX兩者優點的二階段求解方法;數值測試結果顯示該二階段解法的確可以在更短的時間內選取出品質更佳之tagSNP解。最後,本研究提出一個整數規劃模式以在具有容量限制的生物晶片上選取較可靠的tagSNP解,並呈現容量限制下限與辨識之可靠性間的關係圖以供後續研究參考。
Among all possible DNA sequence variations, Single Nucleotide Polymorphism (SNP)is the most common genetic variation. Sequence of closely linked SNPs composes the haplotype. Changes in the sequence can have significant influence on disease occurrences and the phenotype of human traits. The changes can be applied for identifying diseases and other medical research. At present, there are voluminous data of discovered SNPs. To make SNP database more cost-effective, many researchers propose tagging SNPs, a minimal subset of SNPs called tagSNP, to capture the full information of the original SNP sequence(haplotype).
Various SNP tagging methods have been proposed in the literature, based on different purpose of applications. This paper focuses on the most commonly cited definition of tagSNP and proposes methods to select them. In particular, we first seek the smallest SNP subset to identify all Haplotype patterns. The tagSNP Selection Problem can be modeled as a 0-1 binary integer programming problem with multiple optimal solutions. Previously, scholars focused on improving the efficiency of their solution methods, and ignored the differences among multiple optimal solutions. To this end, we proposes a Heuristic algorithm based on graph theory that first solves for all the multiple optimal solutions and then recommend a more informative set of optimal solution based on the concept of Linkage Disequilibrium(LD). Our method calculates the relevancy between a selected tagSNP and the other SNPs, which later serves as an indicator for assessing the amount of information it carries. Among all the multiple optimal solutions, we recommend the one with the highest calculated LD value. In addition, we also propose a bi-objective integer programming model which tries to minimize the number of tagSNPs while maximize its associated LD value at the same time.
To deal with large-scale tagSNP selection problems, we propose a heuristic called LRH, based on the theory of Lagrangian Relaxation. In particular, a modified subgradient method is proposed to update the Lagrangian multiplier which in turn approaches to the optimal solution step by step. In LRH, we incorporate the concept of Greedy Algorithm which selects some good SNP column, gradually reduces the problem size, and thus greatly improves the speed and quality of the calculated solution. The computation results show that LRH has good efficiency and effectiveness, since it can quickly converges to a better solution. Moreover, we also give a two-stage solution method(MIX) which first uses LRH to select good candidate SNP columns and then solve a reduced tagSNP selection problem of much smaller size based on the selected candidate SNPs. The computational results show that the proposed two-stage approach converges to a better solution in a much shorter time.
Finally, in a suggested future research topic, we consider the case where the selected tagSNPs is to be included in a biochip, while the capacity for the biochip is sufficiently large. In this case, we no longer have to select the SNPs of minimum size. Instead, we focus on selecting those tagSNPs that can be used to differentiate haplotypes as many as possible so that the selected SNPs are more robust or reliable.
【1】Al-Sultan, K. S., M. F. Hussain, et al. (1996). "A Genetic Algorithm for the Set Covering Problem." Journal of the Operational Research Society 47(5): 702-709.
【2】Arnheim, N., P. Calabrese, et al. (2003). "Hot and Cold Spots of Recombination in the Human Genome: the Reason We Should Find Them and How This Can Be Achieved " Am. J. Hum. Genet. 73(1): 5-16.
【3】Avi-Itzhak, H., X. Su, et al. (2003). Selection of minimum subsets of single nucleotide polymorphisms to capture haplotype block diversity. In Processding of Pacific Symposium on Biocomputing, Lihue, HI.
【4】Bafna, V., B. V. Halld´orsson, et al. (2003). Haplotypes and informative SNP selection al-gorithms: don't block out information. Proceedings of the seventh annual interna-tional conference on Research in computational molecular biology(RECOMB), Berlin, Germany, ACM.
【5】Balas, E. (1971). "Intersection cuts - A new type of cutting planes for integer program-ming." Operations Research, 19(1): 19-39.
【6】Beasley, J. E. and P. C. Chu (1996). "A genetic algorithm for the set covering problem." European Journal of Operational Research 94(2): 392-404.
【7】Caprara, A., M. Fischetti, et al. (1999). "A Heuristic Method for the Set Covering Prob-lem." Operations Research 47(5): 730-743.
【8】Carlson, C. S., M. A. Eberle, et al. (2004). "Selecting a maximally informative set of sin-gle-nucleotide polymorphisms for association analyses using linkage disequilib-rium." Am. J. Hum. Genet. 74(1): 106-120.
【9】Ceria, S., P. Nobili, et al. (1998). "A Lagrangian-based heuristic for large-scale set cover-ing problems." Mathematical Programming 81(2): 215-228.
【10】Chvatal, V. (1979). "A Greedy Heuristic for the set covering problem." Mathematics of Operations Research 4(3): 233-235.
【11】Cohon, J. L. (1978). Multiobjective Programming and Planning. New York: Academic.
【12】Devlin, B. and N. Risch (1995). "A Comparison of Linkage Disequilibrium Measures for Fine-Scale Mapping." Genomics 29(2): 311-322.
【13】Feo, T. A. and M. G. C. Resende (1989). "A Probabilidtic heuristic for a computationally difficult set covering problem." Operational Research Letters 8: 67-71.
【14】Frangioni, A. and Gallo, G. (1999). "A bundle type dual-ascent approach to linear multi-commodity min cost flow problems." INFORMS Journal on Computing 11(4): 370–393
Gabriel, S. B., S. F. Schaffner, et al. (2002). "The Structure of Haplotype Blocks in the Human Genome." Science 296(5576): 2225-2229.
【15】Garey, M. R. and D. S. Johnson (1979). Computers and Intractability: A Guide to the The-ory of NP-Completeness. New York, W. H. Freeman.
【16】Gomory, P. C. G. a. R. E. (1961). "A Linear Programming Approach to the Cutting Stock Problem-Part I." Operations Research 11: 849-859.
【17】Gomory, P. C. G. a. R. E. (1963). "A Linear Programming Approach to the Cutting Stock Problem-Part II." Operations Research 11(6): 863-888.
【18】Haddadi, S. (1997). "Simple Lagrangian heuristic for the set covering problem." European Journal of Operational Research 97: 200-204.
【19】Halperin, E., G. Kimmel, et al. (2005). "Tag SNP selection in genotype data for maximiz-ing SNP prediction accuracy." Bioinformatics 21(suppl_1): i195-203.
【20】Hifi, M., V. T. Paschos, et al. (2000). "A neural network for the minimum set covering problem." Chaos, Solitons & Fractals 11(13): 2079-2089.
【21】Huang, Y.-T., K. Zhang, et al. (2005). "Selecting additional tag SNPs for tolerating miss-ing data in genotyping." BMC Bioinformatics 6(14712105): 263.
【22】Johnson, G. C. L., L. Esposito, et al. (2001). "Haplotype tagging for the identification of common disease genes." Nat Genet 29(2): 233-237.
【23】Ke, X. and L. R. Cardon (2003). "Efficient selective screening of haplotype tag SNPs." Bioinformatics 19(2): 287-288.
【24】Li, W.-H. and D. Graur (1990). Fundamentals of molecular evolution, Sinauer Associates, Inc. Sunderland. Massachusetts.
【25】Lin, C. H. (2005). Linkage Disequilibrium Measure. Department of Statistics. Los Angeles, University of California.
【26】Nowotny, P., J. M. Kwon, et al. (2001). "SNP analysis to dissect human traits." Current Opinion in Neurobiology 11(5): 637-641.
【27】Ohlsson, M., C. Peterson, et al. (2001). "An efficient mean field approach to the set cover-ing problem." European Journal of Operational Research 133(3): 583-595.
【28】Patil, N., A. J. Berno, et al. (2001). "Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21." Science 294(5547): 1719-1723.
【29】Phillips, M. S., R. Lawrence, et al. (2003). "Chromosome-wide distribution of haplotype blocks and the role of recombination hot spots." Nat Genet 33(3): 382-387.
【30】Shastry, B. S. (2002). "SNP alleles in human disease and evolution." Journal of Human Genetics 47(11): 561-566.
Slavík, P. (1996). A tight analysis of the greedy algorithm for set cover. Proceedings of the twenty-eighth annual ACM symposium on Theory of computing. Philadelphia, Pennsylvania, United States, ACM.
【31】Wang, N., J. M. Akey, et al. (2002). "Distribution of Recombination Crossovers and the Origin of Haplotype Blocks: The Interplay of Population History, Recombination, and Mutation." Am. J. Hum. Genet. 71(5): 1227-1234.
【32】Wang, Y., E. Feng, et al. (2007). The Construction of Minimal Set-Covering Model for TagSNP Selection Problem and Heuristic Function Algorithm. Bioinformatics and Biomedical Engineering, 2007. ICBBE 2007. The 1st International Conference on.
【33】Weale, M. E., C. Depondt, et al. (2003). "Selection and Evaluation of Tagging SNPs in the Neuronal-Sodium-Channel Gene SCN1A: Implications for Linkage-Disequilibrium Gene Mapping " Am. J. Hum. Genet. 73: 551-565.
【34】Zadeh, L. A. (1963). "Optimality and Nonscalar-Valued Performance Criteria." IEEE Transactions on Automatic Control AC-8(1): 59-60.
【35】Zhang, K., P. Calabrese, et al. (2002a). "Haplotype Block Structure and Its Applications to Association Studies: Power and Study Designs." Am. J. Hum. Genet. 71: 1386-1394.
【36】Zhang, K., M. Deng, et al. (2002). A dynamic programming algorithm for haplotype block partitioning. Proceeding Natl. Academic Science. USA. 99: 7335-7339.
【37】Zhang, K. and L. Jin (2003). "HaploBlockFinder: haplotype block analyses." Bioinfor-marics 19(10): 1300-1301.