簡易檢索 / 詳目顯示

研究生: 楊家俊
Yang, Chia-Chun
論文名稱: 設計一個使用於基因字串比對的過濾器
Design of a Filter for DNA Sequence Alignment
指導教授: 李強
Lee, Chiang
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2003
畢業學年度: 91
語文別: 中文
論文頁數: 67
中文關鍵詞: 基因字串比對過濾器
外文關鍵詞: DNA Sequence alignment, filter
相關次數: 點閱:73下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Local alignment algorithm是用來找出兩條sequences間相似的substrings。然而local alignment algorithm目前有兩個問題。首先,DNA sequence database的資料量,現在是每10個月成倍數成長,因此DNA sequence database的資料量在將來一定會變的十分的龐大,如此日後要對DNA sequence database執行local alignment process將會是一件非常耗時的工作。再來,目前存在的local alignment algorithm的做法,都是把database裡所有的sequences,拿來和query sequence執行local alignment process。有一些可以由很簡單的方法,就可知不會滿足query 的sequence,卻仍要花很長的時間去作local alignment process,造成系統執行了很多不必要的工作。

    在此提供了一個filter的方法,來加快對DNA sequence database執行local alignment process的速度。我的做法是在還沒執行local alignment process之前,用一個很簡單的方法,先把不可能滿足使用者需求的sequences先行濾掉。而local alignment algorithm只要對那些沒被濾掉的sequences,執行local alignment process。

    由我的實驗得知,當query sequence長為460,要求的score值為337(match score=1,mismatch score=-1/3)時,我的local alignment filter可達到五成的過濾率。而且越長的sequence將會有更高的過濾率,也因此我的filter更顯的有存在的必要。

    The problem of local alignment is to locate the similar substrings of two sequences. Currently, it faces two main difficulties. Firstly, the size of DNA sequence databases doubles in every ten months. Executing local alignment operations in these databases will be more and more time-consuming. Secondly, current methods for local alignment compare all sequences in the databases, incurring a significant amount of overhead in data processing.

    In this thesis, we propose a novel method for local alignment, which is to filter the input data before the local alignment process is performed. Those unqualifying sequences are removed so that the amount of input data to the time-consuming local alignment process can be drastically reduced. In our experiments, we find that when the length of a query sequence is 460 and a required score is 337 (match score =1 and mismatch score = -1/3), then about 50% of database sequences are filtered out. The longer the sequence, the higher the filtering rate. This demonstrates the validity and the value of the proposed solution.

    Table of Contents i Table of Tables ii Table of Figures iii Table of Algorithms iv 1. 導論 1 1.1 環境需求......1 1.2 Problem formulation......2 1.3 Contribution......3 1.4 Document organization......4 2 Related work 5 2. Environment and background......5 2.1.1 基本名詞解釋......5 2.1.2 Global alignment......6 2.1.3 Local alignment......7 2.2 平行計算環境......9 2.3 向量比對環境......11 2.3.1 向量比對法......11 2.3.2 介紹SST的做法......13 3. The blueprint of local alignment filter......17 3.1 Architecture of local alignment......17 3.2 Architecture of local alignment with filter......18 4. Preprocess stage of Local alignment filter......20 4.1 找出database裡所有的windows......20 4.2 把window's code map成向量......20 4.3 建立向量的index......21 4.3.1 四維bucket陣列做法......21 4.3.2 index tree做法......22 4.3.3 Hash bucket的做法......23 5. Search stage of Local alignment filter 24 5.1 local alignment filter的filter準則......24 5.1.1 similar substring的長度......25 5.1.2 similar substring的mismatch ratio......26 5.2 search stage的work......28 5.2.1 定義向量的distance function......28 5.2.2 複製/切割Query sequence......29 5.2.3 caculate similar向量的distance......31 5.2.4 Search index架構......31 5.2.5 Local alignment filter output......35 6. 滿足query條件, 且最大長度的substring 37 6.1 建立Qi'和Offset(k)......37 6.1.1 建立Qi'......37 6.1.2 建立Offset(k)......38 6.2 建立Mik......38 6.2.1 數列Mik......39 6.2.2 建構Mik......39 6.3 Substring search algorithm......42 6.3.1 定義專有名詞......42 6.3.2 定義algorithm的副程式......45 6.3.3 Substring search algorithm......52 7. Performance evaluations......55 7.1 實驗環境......55 7.1.1 Data......55 7.1.2 軟硬體......56 7.2 Correctness analysis......56 7.3 Filter ration......57 8. Conclusion 61

    1. ftp://ftp.ncbi.nih.gov/repository/dbEST/

    2. NCBI Handbook. http://www.ncbi.nlm.nih.gov, October 9, 2002.

    3. http://www.nyu.edu.

    4. A.Hume and D.M.Sunday. ``Fast string searching'. Software-Practice and Experience, Vol.21(11),pp.1221-1248,November 1991.

    5. A.S. Deshpande, D.S. Richards, and W.R. Pearson. ``A Platform for Biological Sequence Comparison on Parallel Computers'. Computer Applications in the Biosciences, vol.7, pp.237-247,1991.

    6. Burkhardt S., A. Crauser, et al. ``Q-gram based database searching using a suffix array(QUASAR)'. In Third International Conference on Computational Molecular Biology(Re-comb 99), Lyon, France. ACM Press,1999.

    7. Califano A. and I. Rigoutsos. ``Flash: Fast look-up algorithm for string homology'. Technical report, IBM T.J. Watson Research Center, Yorktown Heights, NY, 1995.

    8. Dennis A.Benson, Mark Boguski, David J. Lipman and James Ostell. ``GenBank'.Nucleic Acids Research, Vol.24(1),pp.1-5,1996.

    9. Dennis Benson,David J.LIPMAN, and James Ostell. ``GenBank'.Nucleic Acids Research, Vol.21(13),pp.2963-2965, 1993.

    10. D.E.Knuth,J.H.Morris,Jr.and V.R.Pratt. ``Fast pattern matching in strings'. Society for Industrial and Applied Mathematics Journal on Computing, Vol.6,pp.323-350,1977.

    11. D.F. Sittig, D. Foulser, N.Carriero, G.McCorkle, and P.L. Miller. ``A Parallel Computing Approach to Genetic Sequence Comparison: The Master-Worker Paradigm with Interworker Communication'. Computers and Biomedical Research, vol. 24, pp.152-169, 1991.

    12. D.Gusfield. ``Algorithms on Strings, Trees, and Sequences'. Computer Science and Computational Biology. Cambridge University Press, 1 edition,January 1997.

    13. D.J.Lipman and W.R.Pearson. ``Rapid and sensitive protein similarity searches'. Science,227:1435-41, 1985.

    14. D.M.Sunday. ``A very fast substring search algorithm'. Communication of the ACM, Vol.33(8),pp.132-142,August 1990.

    15. Eldar Giladi, Michael G. Walker, James Ze Wang and Wayne Volkmuth. ``SST: An algorithm for searching sequence databases in time proportional to the logarithm of the database size'. In RECOMB, Japan, 2000.

    16. E. Giladi, M.G. Walker, James Ze Wang and Wayne Volkmuth. ``SST: an algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size'. 18(6):873-879, 2002.

    17. E.Myers. ``An O(ND) difference algorithm and its variations'. Algorithmica, pages 251-266,1986.

    18. E.W. Edmiston, N.G. Core, J.H. Saltz, and R.M. Smith. ``Parallel Processing of Biological Sequence Comparison Algorithms'. Int'l J.Parallel Programming,vol.17,pp.259-275,1988.

    19. Hugh Williams, and Justin Zobel. ``Indexing nucleotide databases for fast query evaluation'. EDBT, 1996.

    20. Joao Setubal,Joao Meidanis. ``Introduction to computational molecular biology'. PWS Publishing Company,1997.

    21. Miller C., J. Gurd, et al. ``RAPID: an extremely fast dna database search tool. In Genes, proteins, and computers V'. York University, England, 1998.

    22. P.D.Smith. ``Experiments with a very fast substring search algorithm'. Software-Practice and Experence, Vol.21(10),pp.1065-1074,October 1991.

    23. P.D.Smith. ``On tuning the Boyer-Moore-Horspool string searching algorithm'. Software-Practice and Experience, Vol.24(4),pp.435-436,April 1994.

    24. P.L. Miller, P.M. Nadkarni, and W.R. Pearson. ``Comparing Machine-Independent versus Machine-Specific Parallelization of a Software Platform for Biological Sequence Comparison'. Computer Applications in the Biosciences, vol.8, pp.167-175,1992.

    25. R.A.Baeza-Yates and C.H.Perleberg. ``Fast and practial approximate string matching'. In Combinatorial Pattern Matching, Third Annual Symposium,pages 185-192,1992.

    26. R.A.Baeza-Yates and G.Navarro. ``Faster approximate string matching'. Algorithmica,23(2):127-158,1999.

    27. R.Nigel Horspool. ``Practical fast searching in strings'. Software-Practice and Experience,Vol.10,pp.501-506,1980.

    28. R.S.Boyer and J.S.Moore. ``A fast string searching algorithm', Communications of the ACM, Vol.20(10),pp.762-772,October 1977.

    29. S.B.Needleman and C.D.Wunsch. ``A general method applicable to the search for similarities in the amino acid sequence of two proteins'. Journal of Molecular Biology, Vol.48,pp.443-453,1970.

    30. Stephen F.Altschul,Warren Gish,Webb Miller,Eugene W.Myers and David J.Lipman. ``Basic Local Alignment Search Tool'. Journal of Molecular Biology, Vol.215(3), pp.403-410, October 1990.

    31. SHI-YI SHEN, JUN YANG, ADAM YAO, and PEI-ING HWANG. ``Super Pairwise Alignment(SPA): An Efficient Approach to Global Alignment for Homologous Sequences'. Journal of Computational Biology, Vol.9,pp477-486, Mary 2002.

    32. S.Uliel, A.Fliess, A.Amir, and R.Unger. ``A simple algorithm for detecting circular permutations in proteins'. Bioinformatics, 15(11):930-936, 1999.

    33. S.Wu and U.Manber. ``Fast text searching allowing errors'. Communications of the ACM, 35(10):83-91, 1992.

    34. T.F.Smith and M.S.Waterman. ``Identification of common molecular subsequences'. Journal of Molecular Biology, March 1981.

    35. T.kahveci and A.Singh. ``An Efficient Index Structure for String Databases'. In VLDB, page 351-360, Roma,Italy, September 2001.

    36. T.K. Yap, O. Frieder, and R.L. Martino. ``Parallel Computation in Biological Sequence Analysis'. IEEE Trans. Parallel and Distributed Systems, vol. 9, No.3, March, 1998.

    37. W.R.Pearson. ``Protein sequence comparision and protein evolution'. Tutorial T6 of Intelligent Systems in Molecular Biology, Cambridge, England, July 1995.

    38. W.R.Pearson and David J.Lipman. ``Improved tools for biological sequence comparison'. Proc. Natl. Academy Science,85:2444-2448, 1988.

    39. X. Guan, R. Mural, R. Mann, and E. Uberbacher. ``On Parallel Search of DNA Sequence Databases'. Proc.Fifth SIAM Conf.Parallel Processing for Scientific Computing, pp.332-337,1991.

    下載圖示 校內:立即公開
    校外:2003-08-26公開
    QR CODE