簡易檢索 / 詳目顯示

研究生: 馬家宜
Ma, Chia-Yi
論文名稱: 分別以連鎖不平衡及拉氏鬆弛法選取代表性單核苷酸多型性之研究
TagSNP Selection Problems based on Linkage Disequilibrium and Lagrangian Relaxation
指導教授: 王逸琳
Wang, I-Lin
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業與資訊管理學系
Department of Industrial and Information Management
論文出版年: 2008
畢業學年度: 96
語文別: 中文
論文頁數: 77
中文關鍵詞: 連鎖不平衡標記型單核苷酸多型性基因組單體型演算法拉氏鬆弛法
外文關鍵詞: tagSNP, Lagrangian Algorithm, Graph, Haplotype, Algorithm, Linkage Disequilibrium
相關次數: 點閱:53下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在DNA序列可能發生的眾多差異性當中,單核苷酸多型性(Single Nucleotide Polymorphism,SNP)是最常發生的一種遺傳變異,由多個SNP鹼基所組成之序列稱為基因組單體形(Haplotype),此序列的改變對疾病的發生及人類特徵的顯現有重大關聯,可被應用於辨識不同疾病及其他相關之醫學研究上。由於目前已發現的SNP資料量龐大,為了節省SNP資料庫所需的高成本花費,許多研究建議以被稱為tagSNP的SNP序列資料部份集合來代表原本全部的SNP序列。
    tagSNP可依其應用目的而有不同的定義,本研究首先針對文獻中最常被使用之tagSNP定義,將其選取問題(Selection Problem) 轉換成一個具有多重最佳解的0,1二元整數規劃問題,以選出可辨識出所有的Haplotype序列樣式之最小SNP部分集合。由於過去研究多著重於改善求解方法之效率,並未評估所求得之最佳解與其它最佳解間之資訊差異,因此本研究提出一個以圖型理論為基礎的啟發式演算法,先求解出所有的最佳解,再採用連鎖不平衡(Linkage Disequilibrium,LD)觀念以計算已被選取之tagSNP與尚未被選出之其它SNP間的相互關連性,作為該最佳解所包含Haplotype資訊量多寡的評估指標,並選取其中最多資訊者為最終最佳解。此外,我們亦提出一個可同時考慮極小化tagSNP個數與極大化LD值之雙目標數學規劃模式以求解類似問題。
    在求解大規模的tagSNP選取問題上,本研究提出一個以拉氏鬆弛法為基礎的啟發式演算法LRH,採用次梯度(subgradient)法更新拉氏乘數(Lagrangian multiplier),使之逐漸逼近最佳解。我們亦在求解過程中加入貪婪演算法的觀念,藉由固定部份SNP欄位以逐漸縮減問題規模,改善求解速度及求解品質。此外,我們亦提出一個結合LRH與最佳化軟體CPLEX兩者優點的二階段求解方法;數值測試結果顯示該二階段解法的確可以在更短的時間內選取出品質更佳之tagSNP解。最後,本研究提出一個整數規劃模式以在具有容量限制的生物晶片上選取較可靠的tagSNP解,並呈現容量限制下限與辨識之可靠性間的關係圖以供後續研究參考。

    Among all possible DNA sequence variations, Single Nucleotide Polymorphism (SNP)is the most common genetic variation. Sequence of closely linked SNPs composes the haplotype. Changes in the sequence can have significant influence on disease occurrences and the phenotype of human traits. The changes can be applied for identifying diseases and other medical research. At present, there are voluminous data of discovered SNPs. To make SNP database more cost-effective, many researchers propose tagging SNPs, a minimal subset of SNPs called tagSNP, to capture the full information of the original SNP sequence(haplotype).
    Various SNP tagging methods have been proposed in the literature, based on different purpose of applications. This paper focuses on the most commonly cited definition of tagSNP and proposes methods to select them. In particular, we first seek the smallest SNP subset to identify all Haplotype patterns. The tagSNP Selection Problem can be modeled as a 0-1 binary integer programming problem with multiple optimal solutions. Previously, scholars focused on improving the efficiency of their solution methods, and ignored the differences among multiple optimal solutions. To this end, we proposes a Heuristic algorithm based on graph theory that first solves for all the multiple optimal solutions and then recommend a more informative set of optimal solution based on the concept of Linkage Disequilibrium(LD). Our method calculates the relevancy between a selected tagSNP and the other SNPs, which later serves as an indicator for assessing the amount of information it carries. Among all the multiple optimal solutions, we recommend the one with the highest calculated LD value. In addition, we also propose a bi-objective integer programming model which tries to minimize the number of tagSNPs while maximize its associated LD value at the same time.
    To deal with large-scale tagSNP selection problems, we propose a heuristic called LRH, based on the theory of Lagrangian Relaxation. In particular, a modified subgradient method is proposed to update the Lagrangian multiplier which in turn approaches to the optimal solution step by step. In LRH, we incorporate the concept of Greedy Algorithm which selects some good SNP column, gradually reduces the problem size, and thus greatly improves the speed and quality of the calculated solution. The computation results show that LRH has good efficiency and effectiveness, since it can quickly converges to a better solution. Moreover, we also give a two-stage solution method(MIX) which first uses LRH to select good candidate SNP columns and then solve a reduced tagSNP selection problem of much smaller size based on the selected candidate SNPs. The computational results show that the proposed two-stage approach converges to a better solution in a much shorter time.
    Finally, in a suggested future research topic, we consider the case where the selected tagSNPs is to be included in a biochip, while the capacity for the biochip is sufficiently large. In this case, we no longer have to select the SNPs of minimum size. Instead, we focus on selecting those tagSNPs that can be used to differentiate haplotypes as many as possible so that the selected SNPs are more robust or reliable.

    中文摘要 i 英文摘要 ii 誌謝 iv 目錄 v 表目錄 viii 圖目錄 ix 第一章 緒論 1 1.1 研究背景 1 1.1.1 DNA、變異(Variation)、突變(Mutation) 1 1.1.2 SNP、基因組單體型(Haplotype) 2 1.2 研究動機與目的 4 1.3 研究問題 5 1.4 論文架構 7 第二章 文獻回顧 8 2.1 tagSNP選取問題之生物資訊相關背景 8 2.1.1限制Diversity與Block結構 8 2.1.2連鎖不平衡 10 2.1.3 Haplotype Block之定義 12 2.1.4 tagSNP之定義 13 2.2 tagSNP選取問題 14 2.2.1 tagSNP選取問題模式 15 2.2.2 tagSNP選取問題之求解方法與結果 15 2.3 tagSNP選取問題之理論複雜度 17 2.4 集合涵蓋問題 18 2.4.1問題模式 18 2.4.2求解方法 19 2.5 小結 20 第三章 求解tagSNP選取問題之多重解最佳解 22 3.1 問題之數學模式 22 3.2 Multi-TagSNP演算法 26 3.3 Multi-TagSNP之時間複雜度 27 3.4 Multi-TagSNP範例說明 29 3.5 以多目標規劃模式求解tagSNP選取問題 33 3.5.1多目標規劃模式簡介 33 3.5.2多目標規劃tagSNP選取之數學模式 34 3.5.3多目標規劃模式之求解方法 35 3.5.4多目標規劃tagSNP選取之數學模式之範例演練 38 3.6 小結 40 第四章以拉氏鬆弛法求解tagSNP選取問題 42 4.1 拉氏鬆弛法 42 4.1.1拉氏問題模式 42 4.1.2次梯度法 43 4.1.3演算流程 44 4.1.4範例說明 46 4.2 啟發式拉氏鬆弛演算法LRH 47 4.3 LRH演算法數值測試分析 51 4.3.1測試資料 51 4.3.2測試結果 52 4.3.2.1模擬資料之測試結果 52 4.3.2.2真實生物資料之測試結果 65 4.4 小結 67 第五章 結論與未來研究方向 69 5.1論文總結與貢獻 69 5.2 未來研究方向 70 5.2.1具容量限制之生物晶片上選取較可靠的tagSNP問題 70 5.2.2其它之研究方向建議 72 參考文獻 74 表目錄 表3.1:各組tagSNP之平均 值與 值 33 表3.2:用權重法求解MOP問題範例 39 表4.1:CPLEX、LRH與MIX之測試結果與OPT gap 58 表4.2:CPLEX、LRH與MIX之測試實際時間與標準化時間(NT) 61 表4.3:LRH、MIX與精確解之最大誤差、最小誤差及平均誤差 65 圖目錄 圖1.1:SNP與Haplotype之說明 3 圖1.2:以tagSNP辨識Haplotype樣式 5 圖2.1:DNA序列之Block structure 9 圖2.2:Diversity之計算 9 圖2.3:設定Diversity門檻值切割Block 10 圖2.4:tagSNP選取問題與TCP之關聯 18 圖3.1:SNP及其所能辨識之Haplotype pair範例 23 圖3.2:比較Haplotype pair差異之轉換矩陣 23 圖3.3:說明SNP與Haplotype pair辨識關係之bipartite網路圖 24 圖3.4:挑選tagSNP範例圖 25 圖3.5:Multi-TagSNP演算法演算法Step 1範例 30 圖3.6:Haplotype pair比較差異對應關係圖 30 圖3.7:Multi-TagSNP演算法Step 2之處理示範圖 31 圖3.8:Multi-TagSNP演算法Step 4之處理示範圖 32 圖3.9:生產效率界線說明圖 35 圖3.10:目標空間與決策空間之關係 36 圖3.11:依偏好資訊流向分類求解多目標規劃之方法 37 圖3.12:MOP演練範例之目標空間關係圖 40 圖4.1:拉氏鬆弛法之演算流程圖 45 圖4.2:修正演算法之挑選對應解範例說明 48 圖4.3:Case 1固定欄位範例說明 49 圖4.4:Case 2固定欄位範例說明 49 圖4.5:Case 3固定欄位範例說明 50 圖4.6:採用LRH求解Sim0.05資料之收斂情況 53 圖4.7:採用LRH求解Sim0.2資料之收斂情況 54 圖4.8:採用LRH求解Hudson資料之收斂情況 55 圖4.9:CPLEX、LRH與MIX之測試結果比較 57 圖4.10:CPLEX、LRH與MIX之測試時間比較 60 圖4.11:Sim0.05資料之求解結果 62 圖4.12:Sim0.2資料之求解結果 63 圖4.13:Hudson資料之求解結果 64 圖4.14:真實資料中遺漏資訊的處理方式示意圖 66 圖5.1:遺漏資訊對辨識Haplotype pair之影響 70 圖5.2:Sim0.05資料之F與C’關係圖 72 圖5.3:Sim0.2資料之F與C’關係圖 72

    【1】Al-Sultan, K. S., M. F. Hussain, et al. (1996). "A Genetic Algorithm for the Set Covering Problem." Journal of the Operational Research Society 47(5): 702-709.
    【2】Arnheim, N., P. Calabrese, et al. (2003). "Hot and Cold Spots of Recombination in the Human Genome: the Reason We Should Find Them and How This Can Be Achieved " Am. J. Hum. Genet. 73(1): 5-16.
    【3】Avi-Itzhak, H., X. Su, et al. (2003). Selection of minimum subsets of single nucleotide polymorphisms to capture haplotype block diversity. In Processding of Pacific Symposium on Biocomputing, Lihue, HI.
    【4】Bafna, V., B. V. Halld´orsson, et al. (2003). Haplotypes and informative SNP selection al-gorithms: don't block out information. Proceedings of the seventh annual interna-tional conference on Research in computational molecular biology(RECOMB), Berlin, Germany, ACM.
    【5】Balas, E. (1971). "Intersection cuts - A new type of cutting planes for integer program-ming." Operations Research, 19(1): 19-39.
    【6】Beasley, J. E. and P. C. Chu (1996). "A genetic algorithm for the set covering problem." European Journal of Operational Research 94(2): 392-404.
    【7】Caprara, A., M. Fischetti, et al. (1999). "A Heuristic Method for the Set Covering Prob-lem." Operations Research 47(5): 730-743.
    【8】Carlson, C. S., M. A. Eberle, et al. (2004). "Selecting a maximally informative set of sin-gle-nucleotide polymorphisms for association analyses using linkage disequilib-rium." Am. J. Hum. Genet. 74(1): 106-120.
    【9】Ceria, S., P. Nobili, et al. (1998). "A Lagrangian-based heuristic for large-scale set cover-ing problems." Mathematical Programming 81(2): 215-228.
    【10】Chvatal, V. (1979). "A Greedy Heuristic for the set covering problem." Mathematics of Operations Research 4(3): 233-235.
    【11】Cohon, J. L. (1978). Multiobjective Programming and Planning. New York: Academic.
    【12】Devlin, B. and N. Risch (1995). "A Comparison of Linkage Disequilibrium Measures for Fine-Scale Mapping." Genomics 29(2): 311-322.
    【13】Feo, T. A. and M. G. C. Resende (1989). "A Probabilidtic heuristic for a computationally difficult set covering problem." Operational Research Letters 8: 67-71.
    【14】Frangioni, A. and Gallo, G. (1999). "A bundle type dual-ascent approach to linear multi-commodity min cost flow problems." INFORMS Journal on Computing 11(4): 370–393
    Gabriel, S. B., S. F. Schaffner, et al. (2002). "The Structure of Haplotype Blocks in the Human Genome." Science 296(5576): 2225-2229.
    【15】Garey, M. R. and D. S. Johnson (1979). Computers and Intractability: A Guide to the The-ory of NP-Completeness. New York, W. H. Freeman.
    【16】Gomory, P. C. G. a. R. E. (1961). "A Linear Programming Approach to the Cutting Stock Problem-Part I." Operations Research 11: 849-859.
    【17】Gomory, P. C. G. a. R. E. (1963). "A Linear Programming Approach to the Cutting Stock Problem-Part II." Operations Research 11(6): 863-888.
    【18】Haddadi, S. (1997). "Simple Lagrangian heuristic for the set covering problem." European Journal of Operational Research 97: 200-204.
    【19】Halperin, E., G. Kimmel, et al. (2005). "Tag SNP selection in genotype data for maximiz-ing SNP prediction accuracy." Bioinformatics 21(suppl_1): i195-203.
    【20】Hifi, M., V. T. Paschos, et al. (2000). "A neural network for the minimum set covering problem." Chaos, Solitons & Fractals 11(13): 2079-2089.
    【21】Huang, Y.-T., K. Zhang, et al. (2005). "Selecting additional tag SNPs for tolerating miss-ing data in genotyping." BMC Bioinformatics 6(14712105): 263.
    【22】Johnson, G. C. L., L. Esposito, et al. (2001). "Haplotype tagging for the identification of common disease genes." Nat Genet 29(2): 233-237.
    【23】Ke, X. and L. R. Cardon (2003). "Efficient selective screening of haplotype tag SNPs." Bioinformatics 19(2): 287-288.
    【24】Li, W.-H. and D. Graur (1990). Fundamentals of molecular evolution, Sinauer Associates, Inc. Sunderland. Massachusetts.
    【25】Lin, C. H. (2005). Linkage Disequilibrium Measure. Department of Statistics. Los Angeles, University of California.
    【26】Nowotny, P., J. M. Kwon, et al. (2001). "SNP analysis to dissect human traits." Current Opinion in Neurobiology 11(5): 637-641.
    【27】Ohlsson, M., C. Peterson, et al. (2001). "An efficient mean field approach to the set cover-ing problem." European Journal of Operational Research 133(3): 583-595.
    【28】Patil, N., A. J. Berno, et al. (2001). "Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21." Science 294(5547): 1719-1723.
    【29】Phillips, M. S., R. Lawrence, et al. (2003). "Chromosome-wide distribution of haplotype blocks and the role of recombination hot spots." Nat Genet 33(3): 382-387.
    【30】Shastry, B. S. (2002). "SNP alleles in human disease and evolution." Journal of Human Genetics 47(11): 561-566.
    Slavík, P. (1996). A tight analysis of the greedy algorithm for set cover. Proceedings of the twenty-eighth annual ACM symposium on Theory of computing. Philadelphia, Pennsylvania, United States, ACM.
    【31】Wang, N., J. M. Akey, et al. (2002). "Distribution of Recombination Crossovers and the Origin of Haplotype Blocks: The Interplay of Population History, Recombination, and Mutation." Am. J. Hum. Genet. 71(5): 1227-1234.
    【32】Wang, Y., E. Feng, et al. (2007). The Construction of Minimal Set-Covering Model for TagSNP Selection Problem and Heuristic Function Algorithm. Bioinformatics and Biomedical Engineering, 2007. ICBBE 2007. The 1st International Conference on.
    【33】Weale, M. E., C. Depondt, et al. (2003). "Selection and Evaluation of Tagging SNPs in the Neuronal-Sodium-Channel Gene SCN1A: Implications for Linkage-Disequilibrium Gene Mapping " Am. J. Hum. Genet. 73: 551-565.
    【34】Zadeh, L. A. (1963). "Optimality and Nonscalar-Valued Performance Criteria." IEEE Transactions on Automatic Control AC-8(1): 59-60.
    【35】Zhang, K., P. Calabrese, et al. (2002a). "Haplotype Block Structure and Its Applications to Association Studies: Power and Study Designs." Am. J. Hum. Genet. 71: 1386-1394.
    【36】Zhang, K., M. Deng, et al. (2002). A dynamic programming algorithm for haplotype block partitioning. Proceeding Natl. Academic Science. USA. 99: 7335-7339.
    【37】Zhang, K. and L. Jin (2003). "HaploBlockFinder: haplotype block analyses." Bioinfor-marics 19(10): 1300-1301.

    下載圖示 校內:2009-09-04公開
    校外:2010-09-04公開
    QR CODE