簡易檢索 / 詳目顯示

研究生: 王建富
Wang, Jian-Fu
論文名稱: 利用一種演化的語法剖析樹為基礎之生物關聯擷取系統
Using Evolutional Dependency Parse Trees for Biological Relation Extraction
指導教授: 高宏宇
Kao, Hung-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2010
畢業學年度: 98
語文別: 英文
論文頁數: 76
中文關鍵詞: 基因蛋白質交互作用調控關聯擷取路徑子樹演化樹
外文關鍵詞: gene, protein, interaction, regulatory, relation, extraction, dependency path, dependency subtree, Evolutional Tree
相關次數: 點閱:98下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來隨著生醫資訊的快速發展,相繼發表了大量的研究成果和資訊。對生物學家來說,要從大量的生醫文獻中隨時取得最新或有用的資訊是非常困難且耗時。因此,開發一個可以幫助生物學家易於蒐集、維護與探勘的高效能資訊擷取系統成為一項重要的研究議題,而它也一直是近年來一個重要的挑戰。
    大體來說,生物關聯的擷取方法有很多種,近年來常使用的是基於自然語言處理的方法,利用語法剖析樹去分析句子的語意、詞性及句法結構等資訊。許多採用此類方法的文獻建議以剖析樹上的兩個基因或蛋白質間的路徑取代整個剖析樹來學習,目的是為了避免學習到一些不必要的資訊,且路徑通常都包含重要的資訊可用來判斷該句子是否有描述生物關聯。然而,只靠路徑上所提供的資訊是不夠的,因為有些句子的剖析樹的路徑上只有兩個實體,沒有其它的資訊;此外,有些句法結構不存在於原始訓練集中,使得用原始訓練集中的剖析樹無法判斷這些句子是否有生物關聯。有鑒於此,本研究提出一種演化的語法剖析樹來解決上述之問題。我們將兩個實體間的路徑擴大為包含兩個實體下的節點的子樹,並將這些子樹擴展與裁剪成各種不同的延伸子樹,我們將這些子樹與延伸子樹統稱為演化樹,並利用演化樹來擷取有生物關聯的句子。根據實驗結果,本研究比其他方法在基因調控關聯擷取上約有10%的改進且在LLL蛋白質交互作用資料集有88%的準確率。

    Nowadays, a large amount of new biological research results and information have been published since the rapid growth biological technology. However, it is very difficult and time-consuming for the biological scientists to acquire such new or useful information from the tremendous literatures. Thus, the development of high-quality information extraction systems which can help the biological scientists easily collect, maintain and discover knowledge needed for research becomes one of the most significant studies and it is still a challenge in recent years.
    Recently proposed approach for relation extraction is based on Natural Language Processing (NLP) technique. Most of NLP-based approaches for biological relation extraction suggested using the dependency path between two genes/proteins instead of the whole dependency parse tree, because it may stripe out many unnecessary words which has no effect on relation learning. However, the dependency path possibly has no any node without two entities. Besides, if we use a limited set of annotated corpus for training the tree information of biological relationships, the incompleteness of the training corpus will lack of some sentence structures that cannot predict whether the sentence has a biological relationship. For this purpose, we developed a biological relation extraction system, called “Evolutional Tree Extraction System” (ETree). We extend the dependency path to the dependency subtree. We also propose a method which can automatically expands and prunes these existing dependency subtrees into various dependency subtrees. We called all of these dependency subtrees “Evolutional Trees”. We use these Evolutional Trees to predict the sentence whether has a biological relationship. Via the experiment results, our method achieves an outstanding performance and has 88% precision rate on the LLL data set.

    中文摘要 I ABSTRACT II CONTENT IV FIGURE LISTING VI TABLE LISTING IX 1. INTRODUCTION 1 1.1 Background 1 1.2 Motivation 3 2. RELATED WORK 9 2.1 Related Resources and Tools 9 2.1.1 PubMed 9 2.1.2 AIIA-GMT 10 2.1.3 Stanford Parser 11 2.2 Related Biological Relation Extraction Systems 14 2.2.1 Co-occurrence-based Approach 14 2.2.2 Pattern-based Approach 14 2.2.3 NLP-based Approach 16 3. OUR PROPOSED METHOD 19 3.1 Preliminary Analysis 19 3.2 System Overview 23 3.2.1 Preprocessing 25 3.3 Evolutional Tree Learning Module 26 3.3.1 Establishment of Seed Tree Set 27 3.3.2 Dependency Subtree Expansion 29 3.3.3 Dependency Subtree Pruning 35 3.3.4 Dependency Subtree Weighting Strategy 40 3.4 Relation Extraction Module 40 4. EXPERIMENT 49 4.1 Data Sets 49 4.2 Performance Metrics 51 4.3 Performance Evaluation of Our Method 51 4.3.1 Evaluation with Dependency Subtree Expansion 51 4.3.2 Evaluation with Dependency Subtree Pruning 55 4.3.3 Evaluation with Different Training Sets 56 4.4 Experiments on the Relation Extraction Module 58 4.4.1 Applying a Weighting Strategy to Our Method 58 4.4.2 Evaluation with Different Similarity Computing Methods (Part I) 59 4.4.3 Evaluation with Different Similarity Computing Methods (Part II) 62 4.5 Performance Comparison with Other Systems 67 4.5.1 Evaluation with Our Three PubMed Data Sets 67 4.5.2 Evaluation with LLL Data Set 70 5. CONCLUSION AND FUTURE WORK 72 5.1 Conclusion 72 5.2 Future Work 73 6. REFERENCEs 74

    [1] Airola, A., et al., All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. Bmc Bioinformatics, 2008. 9.
    [2] Bjorne, r.J., et al. Extracting Complex Biological Events with Rich Graph-Based Feature Sets. 2009. Boulder, Colorado: Association for Computational Linguistics.
    [3] Buyko, E., E. Beisswanger, and U. Hahn. Testing Different {ACE}-Style Feature Sets for the Extraction of Gene Regulation Relations from {MEDLINE} Abstracts. in Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008), Turku, Finland. 2008: Turku Centre for Computer Science (TUCS).
    [4] Erkan, G., A. Ozgur, and D. Radev. Semi-Supervised Classification for Extracting Protein Interaction Sentences using Dependency Parsing. in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 2007.
    [5] Fayruzov, T., et al., DEEPER: A Full Parsing Based Approach to Protein Relation Extraction. 2008. p. 36-47.
    [6] Fayruzov, T., et al., Linguistic feature analysis for protein interaction extraction. Bmc Bioinformatics, 2009. 10.
    [7] Friedman, C., et al., GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 2001. 17 Suppl 1: p. S74-82.
    [8] Fundel, K., R. Kuffner, and R. Zimmer, RelEx - Relation extraction using dependency parse trees. Bioinformatics, 2007. 23(3): p. 365-371.
    [9] Giles, C.B. and J.D. Wren, Large-scale directional relationship extraction and resolution. Bmc Bioinformatics, 2008. 9.
    [10] Goyal, L.D.a.R., Using Relations to Index Biological Document Repositories for Efficient Searching. 2006.
    [11] Hsu, C.N., et al., Integrating high dimensional bi-directional parsing models for gene mention tagging. Bioinformatics, 2008. 24(13): p. I286-I294.
    [12] Hu, Z.Z., et al., Literature mining and database annotation of protein phosphorylation using a rule-based system. Bioinformatics, 2005. 21(11): p. 2759-65.
    [13] Huang, M., et al., Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics, 2004. 20(18): p. 3604-12.
    [14] J, et al., Molecular event extraction from link grammar parse trees, in Proceedings of the Workshop on BioNLP: Shared Task. 2009, Association for Computational Linguistics: Boulder, Colorado. p. 86-94.
    [15] Jasmin, S., J.J. Lars, and R. Isabel, Large-Scale Extraction of Gene Regulation for Model Organisms in an Ontological Context. In Silico Biology, 2005. 5(1): p. 21-32.
    [16] J. Saric et al. Extraction of Regulatory gene/protein networks from Medline. In Bioinformatics, 2006. Vol. 22, No. 6: p.645-650.
    [17] Kim, S., J. Yoon, and J. Yang, Kernel approaches for genic interaction extraction. Bioinformatics, 2008. 24(1): p. 118-126.
    [18] Li, J.X., et al., Kernel-based learning for biomedical relation extraction. Journal of the American Society for Information Science and Technology, 2008. 59(5): p. 756-769.
    [19] Manning, M.-c.D.M.a.B.M.a.C.D., Generating typed dependency parses from phrase structure parses. 5th International Conference on Language Resources and Evaluation, 2006.
    [20] Min, Z., Z. GuoDong, and A. Aiti, Exploring syntactic structured features over parse trees for relation extraction using kernel methods. Information Processing & Management, 2008. 44(2): p. 687-701.
    [21] Miyao, Y., et al., Evaluating contributions of natural language parsers to protein-protein interaction extraction. Bioinformatics, 2009. 25(3): p. 394-400.
    [22] Ono, T., et al., Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics, 2001. 17(2): p. 155-61.
    [23] Pyysalo, S., et al., Comparative analysis of five protein-protein interaction corpora. Bmc Bioinformatics, 2008. 9.
    [24] Raychaudhuri, S. and R.B. Altman, A literature-based method for assessing the functional coherence of a gene group. Bioinformatics, 2003. 19(3): p. 396-401.
    [25] Reichartz, F., H. Korte, and G. Paass, Dependency Tree Kernels for Relation Extraction from Natural Language Text. 2009. p. 270-285.
    [26] Saric, J., et al., Extraction of regulatory gene/protein networks from Medline. Bioinformatics, 2006. 22(6): p. 645-650.
    [27] Shatkay, H., et al., Genes, themes and microarrays: using information retrieval for large-scale gene analysis. Proc Int Conf Intell Syst Mol Biol, 2000. 8: p. 317-28.
    [28] Truc, V., A. Moschitti, and G. Riccardi. Convolution Kernels on Constituent, Dependency and Sequential Structures for Relation Extraction. in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 2009: Association for Computational Linguistics.
    [29] Uzgur, A., et al., Identifying gene-disease associations using centrality on a literature mined gene-interaction network. Bioinformatics, 2008. 24(13): p. I277-I285.
    [30] Wren, J.D., et al., Knowledge discovery by automated identification and ranking of implicit relationships. Bioinformatics, 2004. 20(3): p. 389-98.
    [31] Wren, J.D. and H.R. Garner, Shared relationship analysis: ranking set cohesion and commonalities within a literature-derived relationship network. Bioinformatics, 2004. 20(2): p. 191-198.
    [32] Zhou, G., et al. Tree Kernel-Based Relation Extraction with Context-Sensitive Structured Parse Tree Information. in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 2007.

    下載圖示 校內:2011-08-13公開
    校外:2012-08-13公開
    QR CODE