| 研究生: |
張志豪 Chang, Chih-Hao |
|---|---|
| 論文名稱: |
基於項目集樹樣式探勘之蛋白質功能註解系統 Gene Ontology Supported Protein Function Annotation via Item-Set Tree-Based Pattern Mining |
| 指導教授: |
謝孫源
Hsieh, Sun-Yuan 蔣榮先 Chiang, Jung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2004 |
| 畢業學年度: | 92 |
| 語文別: | 中文 |
| 論文頁數: | 48 |
| 中文關鍵詞: | 功能註解 、文件探勘 、資訊萃取 、樣式探勘 、生物資訊學 |
| 外文關鍵詞: | Text Mining, Pattern Mining, Information Extraction, Bioinformatics, Function Annotation |
| 相關次數: | 點閱:128 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著生物醫學文獻的快速增加,資訊萃取(Information Extraction)在生物資訊學(Bioinformatics)中正迅速的變成一個不可缺少的技術。而其中以樣式(Pattern)比對為基礎的資訊萃取方法更能夠快速、準確的從文件中萃取出有意義的資訊及知識。
在本論文中,我們提出一個以項目集樹(Item-Set Tree)為基礎之樣式探勘(Pattern Mining)的方法,來取代一般以樣式為基礎的資訊萃取方法中所需要的、耗時費力的手動訂立樣式的過程,並以此發展出一套從生物醫學文獻中自動辨識蛋白質相關資訊—生物程序(Biological Process)、細胞成分(Cellular Component)、分子功能(Molecular Function)—的系統,並且能產生相對應的、以基因本體論(Gene Ontology;GO)為基礎的註解。
本論文以樣式探勘方法來尋找樣式之基本觀念是,此方法可以將描述方式類似的句子中,主要特徵的部分篩選出來;換言之,將次要的部分濾除,例如只為加強語氣的字眼,再利用這些特徵從文件中找出描述蛋白質功能的句子。但是一般探勘方法卻會有忽略低頻率特徵的的缺點。針對此點,本論文使用以項目集樹為基礎的探勘方法克服之,對於低頻卻可做為特徵的項目,此方法依然可以正確地篩選出來。
With the accelerative availability of biological literatures, research on information extraction is rapidly becoming an essential component of various bioinformatics applications. The pattern-based information extraction approach can extract information and knowledge from documents quickly and accurately.
In this thesis, we propose a pattern mining methodology based on the item-set tree to substitute the time-consuming manual pattern establishment by experts in pattern-based information extraction approach. We implement an automatic function annotation system, which can extract gene and/or protein information about biological process, cellular component, and molecular function, and produce Gene Ontology-based annotations.
The proposed method can cluster similar sentences and extract the main descriptive components. In general data mining approach, infrequent patterns will be ignored, but the proposed method can overcome this drawback.
[1] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight, J.T. Eppig, M.A. Harris, D.P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin, and G. Sherlock, “Gene Ontology: Tool for the Unification of Biology”, Nature Genetics, vol. 25, pp. 25-29, 2000. http://www.geneontology.org/
[2] LocusLink: http://www.ncbi.nlm.nih.gov/LocusLink/
[3] M. Kubat, A. Hafez, V.V. Raghavan, J.R. Lekkala, and W.K. Chen, “Itemset Trees for Targeted Association Querying”, IEEE Transactions on Knowledge and Data Engineering, vol. 5, no. 6, pp. 1522-1534, November/December 2003.
[4] C. Friedman, P. Kra, H. Yu, M. Krauthammer, and A. Rzhetsky, “GENIES: A Natural-Language Processing System for the Extraction of Molecular Pathways from Journal Articles”, Bioinformatics, vol. 17, Suppl. 1, pp. S74-S82, 2001.
[5] D.-M. Yao, J.-B. Wang, Y.-M. Lu, N. Noble, H.-D. Sun, X.-Y. Zhu, N. Lin, D.G. Payan, M. Li, and K.-B. Qu, “PathwayFinder: Paving the Way Towards Automatic Pathway Extraction”, Proc. Second Asia-Pacific Bioinformatics Conf. (APBC2004), pp. 53-62, 2004.
[6] J.T. Chang, S. Raychaudhuri, and R.B. Altman, “Including Biological Literature Improves Homology Search”, Proc. Pacific Symp. on Biocomputing (PSB2001), pp. 374-383, 2001.
[7] J.-H. Chiang and H.-C. Yu, “MeKE: Discovering the Functions of Gene Products from Biomedical Literature via Sentence Alignment”, Bioinformatics, vol. 19, no. 11, pp. 1417-1422, 2003.
[8] J.-H. Chiang, H.-C. Yu, and H.-J. Hsu, “GIS: A Biomedical Text-Mining System for Gene Information Discovery”, Bioinformatics, vol. 20, no. 1, pp. 120-121, 2004.
[9] S. Ray and M. Craven, “Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text”, Proc. Workshop Critical Assessment for Information Extraction in Biology, 2004.
[10] B.J. Stapley, L.A. Kelley, and M.J.E. Sternberg, “Predicting the Sub-Cellular Location of Proteins from Text Using Support Vector Machines”, Proc. Pacific Symp. on Biocomputing (PSB2002), pp. 374-385, 2002.
[11] N. Daraselia, A. Yuryev, S. Egorov, S. Novichkova, A. Nikitin, and I. Mazo, “Extracting Human Protein Interactions from MEDLINE Using a Full-Sentence Parser”, Bioinformatics, vol. 20, no. 5, pp. 604-611, 2004.
[12] E.M. Marcotte, I. Xenarios, and D. Eisenberg, “Mining Literature for Protein-Protein Interactions”, Bioinformatics, vol. 17, no. 4, pp. 359-363, 2001.
[13] T. Ono, H. Hishigaki, A. Tanigami, and T. Takagi, “Automated Extraction of Information on Protein-Protein Interactions from the Biological Literature”, Bioinformatics, vol. 17, no. 2, pp. 155-161, 2001.
[14] J.M. Temkin and M.R. Gilder, “Extraction of Protein Interaction Information from Unstructured Text Using a Context-Free Grammar”, Bioinformatics, vol. 19, no. 16, pp. 2046-2053, 2003.
[15] L. Wong, “PIES, A Protein Interaction Extraction System”, Proc. Pacific Symp. on Biocomputing (PSB2001), pp. 520-531, 2001.
[16] L. Tanabe, U. Scherf, L. H. Smith, J. K. Lee, L. Hunter and J. N. Weinstein, "MedMiner: An Internet Text-Mining Tool for Biomedical Information, with Application to Gene Expression Profiling", BioTechniques, vol. 27, pp.1210-1217. http://discover.nci.nih.gov/textmining/main.jsp
[17] L. Hirschman, J.C. Park, J. Tsujii, L. Wong, and C.H. Wu, “Accomplishments and Challenges in Literature Data Mining for Biology”, Bioinformatics, vol. 18, no. 12, pp. 1553-1561, 2002.
[18] R. Agrawal, T. Imilienski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases”, Proc. 1993 ACM SIGMOD International Conf. on Management of Data, pp. 207-216, May 1993.
[19] I.N. Kouris, C.H. Makris, and A.K. Tsakalidis, “An Improved Algorithm for Mining Association Rules Using Multiple Support Values”, Proc. 16th International Florida Artificial Intelligence Research Symposium Conf. (FLAIRS2003), pp. 309-313, 2003.
[20] B. Liu, W. Hsu, and Y. Ma, “Mining Association Rules with Multiple Minimum Supports”, Proc. 5th ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining, pp. 337-341, August 15-18, 1999.
[21] R. Feldman, Y. Aumann, M. Finkelstein-Landau, E. Hurvitz, Y. Regev, and A. Yaroshevich, “A Comparative Study of Information Extraction Strategies”, Proc. 3rd International Conf. on Computational Linguistics and Intelligent Text Processing, pp. 349-359, February 17-23, 2002.
[22] R. G.rishman, “Information Extraction: Techniques and Challenges”, Springer-Verlag, Lecture Notes in Artificial Intelligence, 1997.
[23] A. Yeh, L. Hirschman, and A. Morgan, “Background and Overview for KDD Cup 2002 Task 1: Information Extraction from Biomedical Articles”, SIGKDD Explorations, vol. 4, no. 2, pp. 87-89, 2002.
[24] J.A. Blake, J.E. Richardson, C.J. Bult, J.A. Kadin, J.T. Eppig, and the members of the Mouse Genome Database Group. “MGD: The Mouse Genome Database”, Nucleic Acids Research, vol. 31, no. 1, pp. 193-195, 2003. http://www.informatics.jax.org/
[25] The FlyBase Consortium, “The FlyBase Database of the Drosophila Genome Projects and Community Literature”, Nucleic Acids Research, vol. 31, no. 1, pp. 172-175, 2003. http://flybase.org/
[26] PubMed: http://www.ncbi.nlm.nih.gov/entrez/
[27] M. Rebhan, V. Chalifa-Caspi, J. Prilusky, and D. Lancet, “GeneCards: Encyclopedia for Genes, Proteins and Diseases”, Weizmann Institute of Science, Bioinformatics Unit and Genome Center (Rehovot, Israel), 1997. http://bioinformatics.weizmann.ac.il/cards/
[28] B.J. Stapley and G. Benoit, “Biobibliometrics: Information Retrieval and Visualization from Co-Occurrences of Gene Names in Medline Abstracts”, Proc. Pacific Symp. on Biocomputing (PSB2000), pp. 526-537, 2000. http://www.bmm.icnet.uk/~stapleyb/biobib/
[29] F.M. Couto, M.J. Silva, and P. Coutinho, “FiGO: Finding GO Terms in Unstructured Text”, Proc. Workshop Critical Assessment for Information Extraction in Biology, 2004.
[30] EBI: http://www.ebi.ac.uk/Information/index.html
[31] K. Verspoor, J. Cohn, C. Joslyn, S. Mniszewski, A. Rechtsteiner, L.M. Rocha, and T. Simas, “Protein Annotation as Term Categorization in the Gene Ontology”, Proc. Workshop Critical Assessment for Information Extraction in Biology, 2004.
[32] B. Boeckmann, A. Bairoch, R. Apweiler, M.-C. Blatter, A. Estreicher, E. Gasteiger, M.J. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout, and M. Schneider, "The SWISS-PROT Protein Knowledgebase and its Supplement TrEMBL in 2003", Nucleic Acids Research, vol. 31, no. 1, pp. 365-370, 2003. http://us.expasy.org/sprot/
[33] Journal of Biological Chemistry: http://www.jbc.org/
[34] H. Schmid, “Probabilistic Part-of-Speech Tagging Using Decision Trees”, In International Conference on New Methods in Language Processing. 1994. http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
[35] E. Camon, M. Magrane, D. Barrell, V. Lee, E. Dimmer, J. Maslen, D. Binns, N. Harte, R. Lopez, and R. Apweiler, “The Gene Ontology Annotation (GOA) Database: Sharing Knowledge in Uniprot with Gene Ontology”, Nucleic Acids Research, vol. 32, Database issue, pp. D262–D266, 2004. http://www.ebi.ac.uk/GOA/