簡易檢索 / 詳目顯示

研究生: 柯育豪
Ke, Yu-Hao
論文名稱: 利用確切數量間格限制之循序樣式探勘尋找DNA序列中潛在啟動子結合位點
Mining Sequential Patterns with Specific Numbers of Gaps as Possible Promoter Binding Sites in DNA Sequences
指導教授: 黃仁暐
Huang, Jen-Wei
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2017
畢業學年度: 106
語文別: 英文
論文頁數: 58
中文關鍵詞: 循序性樣式帶限制循序性樣式確切間格數目的循序性樣式生物資訊啟動子結合位點
外文關鍵詞: sequential pattern, constrained sequential pattern mining, sequential pattern with specific numbers of gaps, bioinformatics, promoter binding site
相關次數: 點閱:91下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 循序性樣式探勘已經被研究了許多年且已應用在不同的領域,像是購物行為或是生物資訊學領域。在生物資訊學領域,循序性樣式探勘大多用於預測轉錄因子結合位點、識別蛋白質折疊區域,以及鑑定蛋白質相互作用中的區域。在某些情況下,直接使用循序性樣式探勘來尋找生物序列中的樣式可能是不適用的,例如樣式中會含有突變或不重要的nucleotides。過去有些研究在探勘時使用間格限制來解決的這個問題,但研究中並沒有考慮到樣式中確切間格的數量。在此研究中,我們旨在利用確切數量間格限制之循序樣式探勘來尋找DNA序列中潛在啟動子結合位點。我們提出DisSPAM_SNG演算法並使用新的間格表示方法來找出結合位點共通的樣式。我們同時提出數個與間格相關的限制來減少循序性樣式的數量與計算時間,並以分散式處理的方式來加快探勘的速度。我們藉由修改過往的演算法並與我們非分散式的SPAM_SNG演算法相比之後,實驗結果證明了SPAM_SNG演算法的計算速度優於其他演算法。

    Sequential pattern mining has been developed for many years and is applied on many different fields, such as purchasing behavior and bioinformatics domain. In bioinformatics domain, sequential pattern mining is used for prediction of transcription factor binding sites, recognition of protein folds, and identification of hot regions in protein-protein interactions. Directly using sequential pattern mining to find interesting patterns in biological sequences may not be suitable for some cases, such as patterns containing mutation or unimportant nucleotides. Some research use gap constraint to tackle the problem, but they did not consider specific numbers of gaps in the pattern. In this paper, we aim at finding sequential patterns with specific numbers of gaps as possible promoter binding sites in DNA sequences. A new representation of gaps is given to discover common patterns of binding sites. We propose an algorithm, DisSPAM_SNG, for distributed mining sequential patterns with specific numbers of gaps. We also propose a level-order traversal and several constraints to reduce the processing time and the number of interesting patterns. For performance evaluation, we compare SPAM_SNG, the non-distributed version of DisSPAM_SNG, with some modified version of previous state-of-the-art algorithms. The experimental results show that SPAM_SNG outperforms the state-of-the-art algorithms.

    中文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Transcription and Promoter Prediction Problem . . . . . . . . . . . . . . . . . . 6 2.1.1 Overview of transcription . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Stages of transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Promoter Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Sequential Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Constraints on Sequential Pattern Mining . . . . . . . . . . . . . . . . . . . . . 9 2.4 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 Sequential Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Sequential Pattern Mining with Gap Constraint . . . . . . . . . . . . . . . . . . 14 3.3 Distributed Sequential Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . 18 4 Distributed Sequential Pattern Mining with Specific Numbers of Gaps . . . . . . . . . 19 4.1 SPAM_SNG: Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.1 MiningProcess: Extend sequence and Validate constraints . . . . . . . . 22 4.1.2 CalculateFrequency: Calculate support and Prune . . . . . . . . . . . . . 25 4.2 Distributed Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5 PromoterFinder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.2 Varying Minimum Support Threshold . . . . . . . . . . . . . . . . . . . . . . . . 31 6.3 Varying Maxgap Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 6.4 Varying Citem Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.5 Varying Tgap Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.6 Varying Patlen Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.7 Varying Sequence Length (L) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.8 Varying Number of Sequences (D) . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.9 Varying Pgap constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6.10 Distributed SPAM SNG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.11 Result Patterns and Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 7 Conclusions and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    [1] Eukaryotic promoter database. http://epd.vital-it.ch/human/human_database.php.
    [2] National center for biotechnology information. https://www.ncbi.nlm.nih.gov/.
    [3] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94), pages 487–499, Sept. 1994.
    [4] R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering (ICDE’95), pages 3–14, Mar. 1995.
    [5] R. Agrawal and R. Srikant. Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the 5th International Conference on Extending Database Technology (EDBT’96), pages 1–17, Mar. 1996.
    [6] J. F. Allen. Maintaining knowledge about temporal intervals. Communications of the ACM, 26(11):832–843, 1983.
    [7] C. Antunes and A. L. Oliveira. Generalization of pattern-growth methods for sequential pattern mining with gap constraints. In Proceedings of the 3rd international conference on Machine learning and data mining in pattern recognition (MLDM’03), pages 239–251, 2003.
    [8] J. Ayres, J. Flannick, J. Gehrke, and T. Yiu. Sequential pattern mining using a bitmap representation. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02), pages 429–435, 2002.
    [9] C.-C. Chen, C.-Y. Tseng, and M.-S. Chen. Highly scalable sequential pattern mining based on mapreduce model on the cloud. In Proceedings of the 2013 IEEE International Congress on Big Data, pages 310–317, 2013.
    [10] K.-Y. Chen, B. P. Jaysawal, J.-W. Huang, and Y.-B. Wu. Mining frequent time intervalbased event with duration patterns from temporal database. In Proceedings of the 2014 International Conference on Data Science and Advanced Analytics (DSAA’14), pages 548–554, Oct. 2014.
    [11] Y.-C. Chen, J.-C. Jiang, W.-C. Peng, and S.-Y. Lee. An efficient algorithm for mining time interval-based patterns in large database. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM’10), pages 49–58, 2010.
    [12] Y.-C. Chen, W.-C. Peng, and S.-Y. Lee. Ceminer–an efficient algorithm for mining closed patterns from time interval-based data. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining (ICDM’11), pages 121–130, Dec. 2011.
    [13] W. Fang, M. Lu, X. Xiao, B. He, and Q. Luo. Frequent itemset mining on graphics processors. In Proceedings of the Fifth International Workshop on Data Management on New Hardware (DaMoN’09), pages 34–42, 2009.
    [14] P. Fournier-Viger, C.-W. Wu, and V. S. Tseng. Mining maximal sequential patterns without candidate maintenance. In Proceedings of the 9th International Conference on Advanced Data Mining and Applications (ADMA’13), pages 169–180, Dec. 2013.
    [15] M. N. Garofalakis, R. Rastogi, and K. Shim. Spirit: Sequential pattern mining with regular expression constraints. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB’99), pages 223–234, 1999.
    [16] A. Gomariz, M. Campos, R. Marin, and B. Goethals. Clasp: An efficient algorithm for mining frequent closed sequences. In Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’13), pages 50–61, Apr. 2013.
    [17] V. Guralnik and G. Karypis. Parallel tree-projection-based sequence mining algorithms. Parallel Computing, 30(4):443–472, Apr. 2004.
    [18] A. S. Halees, D. Leyfer, and Z. Weng. Promoser: a large-scale mammalian promoter and transcription start site identification service. Nucleic Acids Res, 31:3554–3559, 2003.
    [19] J. Han. How can data mining help bio-data analysis? In Proceedings of the Workshop on Data Mining in Bioinformatics (BIOKDD’02 with SIGKDD’02 Conference), pages 1–4, 2002.
    [20] J. Han. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2005.
    [21] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.-C. Hsu. Freespan: Frequent pattern-projected sequential pattern mining. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’00), pages 355–359, 2000.
    [22] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. ACM Special Interest Group on Management of Data, 29(2):1–12, 2000.
    [23] J. Ho, L. Lukov, and S. Chawla. Sequential pattern mining with constraints on large protein databases. In Proceedings of the 12th International Conference on Management of Data (COMAD’05), pages 89–100, 2005.
    [24] C.-M. Hsu, C.-Y. Chen, and B.-J. Liu. Magiic-pro: detecting functional signatures by efficient discovery of long patterns in protein sequences. Nucleic Acids Research, 34, 2006.
    [25] C.-M. Hsu, C.-Y. Chen, B.-J. Liu, C.-C. Huang, M.-H. Laio, C.-C. Lin, and T.-L. Wu. Identification of hot regions in protein-protein interactions by sequential pattern mining. BMC Bioinformatics, 8, 2007.
    [26] A. Koper and H. S. Nguyen. Sequential pattern mining from stream data. In Proceedings of the 7th International Conference on Advanced Data Mining and Applications (ADMA’11), pages 278–291, Dec. 2011.
    [27] T.-Y. Lee, W.-C. Chang, J. B.-K. Hsu, T.-H. Chang, and D.-M. Shien. Gpminer: an integrated system for mining combinatorial cis-regulatory elements in mammalian gene group. BMC Bioinformatics, 13, 2012.
    [28] V. C.-C. Liao and M.-S. Chen. Dfsp: a depth-first spelling algorithm for sequential pattern mining of biological sequences. Knowledge and Information Systems, 38, 2013.
    [29] V. C.-C. Liao and M.-S. Chen. Efficient mining gapped sequential patterns for motifs in biological sequences. BMC Systems Biology, 7, 2013.
    [30] C.-H. Lin, D.-Y. Chiu, Y.-H. Wu, and A. L. P. Chen. Mining frequent itemsets from data streams with a time-sensitive sliding window. In Proceedings of the 2005 SIAM International Conference on Data Mining (SDM’05), pages 68–79, 2005.
    [31] C. Luo and S. M. Chung. Efficient mining of maximal sequential patterns using multiple samples. In Proceedings of the 2005 SIAM International Conference on Data Mining (SDM’05), pages 415–426, 2005.
    [32] A. Marascu and F. Masseglia. Mining sequential patterns from temporal streaming data. In Proceedings of the 1st ECML/PKDD Workshop on Mining Spatio-Temporal Data (MSTD’05), pages 1–13, Nov. 2005.
    [33] A. Marascu and F. Masseglia. Mining sequential patterns from data streams: a centroid approach. Journal of Intelligent Information Systems, 27(3):291–307, Nov. 2006.
    [34] I. Miliaraki, K. Berberich, R. Gemulla, and S. Zoupanos. Mind the gap: Large-scale frequent sequence mining. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13), pages 797–808, 2013.
    [35] F. Moerchen and D. Fradkin. Robust mining of time intervals with semi-interval partial order patterns. In Proceedings of the 2010 SIAM International Conference on Data Mining (SDM’10), pages 315–326, 2010.
    [36] J.-Z. Ouh, P.-H. Wu, and M.-S. Chen. Experimental results on a constraint based sequential pattern mining for telecommunication alarm data. In Proceedings of the 2nd International Conference on Web Information Systems Engineering (WISE’01), pages 186–193 vol.2, Dec. 2001.
    [37] D. Patel, W. Hsu, and M.-L. Lee. Mining relationships among interval-based events for classification. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD’08), pages 393–404, 2008.
    [38] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proceedings of the 17th International Conference on Data Engineering (ICDE’01), pages 215–224, Apr. 2001.
    [39] J. Pei, J. Han, and W. Wang. Mining sequential patterns with constraints in large databases. In Proceedings of the 11th ACM International Conference on Information and Knowledge Management (CIKM’02), pages 18–25, 2002.
    [40] J. Pei, J. Han, and W. Wang. Constraint-based sequential pattern mining: the patterngrowth methods. Journal of Intelligent Information Systems, 28:133–160, Apr. 2007.
    [41] J. Wang and J. Han. Bide: Efficient mining of frequent closed sequences. In Proceedings of the 20th International Conference on Data Engineering (ICDE’04), pages 79–90, 2004.
    [42] K. Wang, Y. Xu, and J. X. Yu. Scalable sequential pattern mining for biological sequences. In Proceedings of the thirteenth International Conference on Information and Knowledge Management (CIKM’04), pages 178–187, 2004.
    [43] T. White. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 2012.
    [44] S.-Y. Wu and Y.-L. Chen. Mining nonambiguous temporal patterns for interval-based events. IEEE Transactions on Knowledge and Data Engineering (TKDE’07), 19(6):742–758, June 2007.
    [45] X. Yan, J. Han, and R. Afshar. Clospan: Mining closed sequential patterns in large datasets. In Proceedings of the 2003 SIAM International Conference on Data Mining (SDM’03), pages 166–177, 2003.
    [46] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud’10), pages 10–10, 2010.
    [47] M. J. Zaki. Sequence mining in categorical domains: incorporating constraints. In Proceedings of the ninth international conference on Information and knowledge management (CIKM’00), pages 422–429, 2000.
    [48] M. J. Zaki. Parallel sequence mining on shared-memory machines. Journal of Parallel and Distributed Computing, 61:401-426, 2001.
    [49] M. J. Zaki. Spade: An efficient algorithm for mining frequent sequences. Machine Learning, 42:31–60, 2001.

    無法下載圖示 校內:2022-11-09公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE