簡易檢索 / 詳目顯示

研究生: 劉詠熙
Liu, Yong-Xi
論文名稱: 自動化應用PageRank由生醫文件中辨識蛋白質交互作用之句子
Automated PageRank-based Sentences Ranking to Identify Protein Relations from Literature
指導教授: 蔣榮先
Chiang, Jung-Hsien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 中文
論文頁數: 62
中文關鍵詞: 蛋白質交互作用文件探勘
外文關鍵詞: Protein-Protein Interaction, PageRank, Text Mining
相關次數: 點閱:116下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著資訊科學的進步,大量生物醫學的研究成果被記載在文獻中。針對於蛋白質交互作用方面,現今雖有相關的資料庫,但是所能提供的資料有限,並且是以人工的方式從生物醫學文獻中萃取及驗證蛋白質交互作用的資訊,過程是費時及昂貴的。我們在本研究中提出了自動化萃取蛋白質交互作用的過程,使用階層式樣板的比對,來找出對於蛋白質交互作用描述之句子,並且針對於這些句子建立彼此的鏈結關係,以修改的PageRank演算法來對其作排序,找出蛋白質的交互作用關係。在本論文研究中,我們實作一個自動化的文件探勘系統,除了驗證在KEGG反應路徑上所記載的重要蛋白質關係之外,並且找出了大量可能的潛在蛋白質關係。與過去存在類似的系統或資料庫相比,本系統提供蛋白質交互作用驗證句、蛋白質作用關係等較為豐富且大量的資訊。在最後的實驗結果也顯示,本系統對於蛋白質交互作用資訊的找出有相當不錯的能力。

    Along with the improvement of information and computational techniques, increasing number of biomedical researches and literatures have been reported at the public databases such as PubMed. As for identifying protein-protein interactions, there have been some related databases manually evidence and extract interaction data from biomedical literatures. But they offer limited information and the process is time-consuming and expensive in labor power. To enhance the protein-protein interaction extraction process, we implemented an automated framework that combing hierarchical template-based sentence matching and PageRank-based sentence ranking approaches. Using this framework, we extract the interaction evidence sentences and their interaction relations. In this research, we implement a text-mining system to identify many important relations in KEGG pathway databases and discover a great number of novel relations that could potentially extend the existing protein interactions and pathways databases.

    第一章 導論 1 1.1 前言 1 1.2 研究動機 3 1.3 解決方法 3 1.4 論文架構 4 第二章 相關研究 5 2.1 生物資訊學與相關資源 5 2.1.1 NCBI PubMed 6 2.1.2 KEGG 7 2.2 文件分析與相關技術 8 2.2.1 資訊萃取 9 2.2.2 自然語言處理 10 2.2.3 蛋白質交互作用的找尋 11 2.3 蛋白質關係的分析 12 2.3.1 社會網路分析 13 2.3.2 PageRank演算法 13 第三章 蛋白質對關係作用分析 16 3.1 文件分析的基本單位 16 3.2 系統架構 18 3.3 文件前處理 19 3.4 共同出現(Co-occurrence)的概念 20 3.4.1 交互作用關鍵詞 21 3.4.2 蛋白質名稱辨識(Name Entity Recognition) 22 3.4.3 蛋白質名稱解析(Protein Name Resolution) 23 3.5 自然語言處理 27 3.5.1 詞性標記與區塊剖析(POS Tagger and Chunk Parser) 27 3.5.2 句型結構處理 28 3.6 樣版比對(Template Matching) 32 3.7 蛋白質交互關係的找出 33 3.7.1 句與句的關係鏈結 33 3.7.2 鏈結的強度計算 35 3.7.3 以PageRank對句子排序 36 第四章 結合反應路徑之系統應用 38 4.1 資料來源與收集 38 4.2 系統輸入 40 4.3 以蛋白質對為中心之分析目標 41 4.3.1 蛋白質資訊 42 4.3.2 蛋白質對的相關文章 42 4.3.3 含有交互作用資訊的句子 43 4.3.4 於KEGG中所驗證的註記關係 43 4.3.5 蛋白質關係分析 44 第五章 實驗設計與實驗分析 46 5.1 資料集介紹 46 5.1.1 反應路徑資料 47 5.1.2 文獻資料 48 5.1.3 蛋白質交互關係資料 49 5.2 實驗設計與結果討論 49 5.2.1 蛋白質交互作用辨識之效能 50 5.2.2 蛋白質交互關係找出之評估 52 5.2.3 句與句鏈結對於關係排序的影響 52 5.2.4 整體系統對於被驗證的關係的排名分佈 55 第六章 結論與未來展望 57 6.1 結論 57 6.2 未來展望 58

    [1] S.T. Ahmed, D. Chidambaram, H. Davulcu, and C. Baral, “ IntEx: A Syntactic Role Driven Protein-Protein interaction Extractor for Bio-Medical Text”, Proceedings of the ACL-ISMB Workshop , pp. 54-61, 2005.
    [2] C. Blaschke, M.A. Andrade, C. Ouzounis, and A. Valencia, “Automatic extraction of biological information from scientific text: protein-protein interactions”, Proc. International Conference on Intelligent System for Molecular Biology, pp. 60-67, 1999.
    [3] K.B. Cohen and L. Hunter, “Natural Language Processing and Systems Biology”, Technical report, University of Colorado School of Medicine Denver, CO, USA, 2004.
    [4] D.P.A. Corney, B.F. Buxton, W.B. Langdon, and D.T. Jones, “BioRAT: extracting biological information from full-length papers”, Bioinformatics, vol. 20, no. 17, pp. 3206-3213, 2004.
    [5] J. Ding, D. Berleant, D. Nettleton, and E. Wurtele, “Mining Medline: Abstracts, Sentences, Or Phrases?”, Pacific. Symposium on Biocomputing, pp. 326–337, 2002.
    [6] G. Erkan and D.R. Radev, “LexRank: Graph-based Lexical Centrality as Salience in Text Summarization”, Journal of Articial Intelligence Research 22, vol. 20, no. 17, pp. 3206-3213, 2004.
    [7] C. Friedman, P. Kra, H. Yu, M. Krauthammer, and A. Rzhetsky, “GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles”, Bioinformatics, vol. 17, suppl. 1, pp. S74-S82, 2001.
    [8] R. Grishman, “Information Extraction: Techniques and Challenges”, SCIE-97, Springer-Verlag, vol. 1299, pp.10-27, 1997.
    [9] M. Huang, X. Zhu, and M. Li, “A hybrid method for relation extraction from Biomedical literature”, International Journal of Medical Informatics, vol. 75, pp. 443-455, 2006.
    [10] M. Huang, X. Zhu, Y. Hao, D. Payan, K. Qu, and M. Li, “Discovering patterns to extract protein-protein interactions from full texts”, Bioinformatics, vol. 20, no. 18, pp. 3604-3612, 2004.
    [11] M. Kanehisa and S. Goto , “KEGG: Kyoto Encyclopedia of Genes and Genomes”, Nucleic Acids Research, vol. 28, no. 1, pp. 27-30, 2000.
    [12] J.-J. Kim, Z. Zhang, J.C. Park, and S.-K. Ng, “BioContrasts: extracting and exploiting protein–protein contrastive relations from biomedical literature”, Bioinformatics, vol. 22, no. 5, pp. 597-605, 2006.
    [13] O. Kurland, L. Lee, and C. Domshlak, “PageRank without hyperlinks: Structural re-ranking using links induced by language models”, Proceedings of SIGIR 2005, pp. 19-26, 2005.
    [14] R. Mihalcea and P. Tarau, “A Language Independent Algorithm for Single and Multiple Document Summarization”, ACL, 2004.
    [15] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank Citation Ranking: Bringing Order to the Web”, Technical Report, Stanford Digital Libraries Technologies Project, 1998.
    [16] B. Settles, “ABNER: an open source tool for automatically tagging genes,proteins and other entity names in text”, Bioinformatics, vol. 21, no. 14, pp. 3191-3192, 2005.
    [17] B.J. Stapley and G. Benoit, “Biobliometrics: information retrieval and visualization from co-occurrences of gene names in medline abstracts”, Pacific Symposium on Biocomputing, pp. 526-537, 2000.
    [18] L. Tanabe and W.J. Wilbur , “Tagging Gene and Protein Names in Full Text Articles”, Proceedings of the Workshop on Natural Language Processing in the Biomedical Domain, pp. 9-13, 2002.
    [19] J.M. Temkin and M.R. Gilder , “Extraction of protein interaction information from unstructured text using a context-free grammar”, Bioinformatics, vol. 19, no. 16, pp. 2046-2053, 2003.
    [20] A. Vailaya, P. Bluvas, R. Kincaid, A. Kuchinsky, M. Creech, and A. Adler, “An Architecture for Biological Information Extraction and Representation”, Bioinformatics, vol. 21, no. 4, pp. 430-438, 2005.
    [21] T. Wattarujeekrit , P. K. Shah , and N. Collier , “PASBio: predicate-argument structures for event extraction in molecular biology”, BMC BioInformatics, vol. 5, no. 155, 2004.
    [22] F. Wolf and E. Gibson, “Paragraph-, Word-, and Coherence-based Approaches to Sentence Ranking: A Comparison of Algorithm and Human Performance”, ACL, pp. 383-390, 2004.
    [23] J. Xiao, J. Su, G. Zhou, and C. Tan, “Protein-protein interaction extraction: A supervised learning approach”, First International Symposium on Semantic Mining in Biomedicine (SMBM), vol. 148, 2005.
    [24] 蔣明村,“使用自動化樣板建立的蛋白質交互作用驗證系統”,國立成功大學資訊工程學系碩士論文,未出版,2007。
    [25] BioCreative:http://biocreative.sourceforge.net/
    [26] BMC central : http://www.biomedcentral.com/home/
    [27] Eutils:http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
    [28] HPRD:http://www.hprd.org/
    [29] LingPipe:http://www.alias-i.com/lingpipe/
    [30] LLL workshop:http://genome.jouy.inra.fr/texte/LLLchallenge/
    [31] MontyTagger:http://web.media.mit.edu/~hugo/
    [32] NCBI:http://www.ncbi.nlm.nih.gov/
    [33] OpenNLP project:http://opennlp.sourceforge.net/
    [34] PubMed central:http://www.pubmedcentral.nih.gov/
    [35] PubMed Help:http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helppubmed.chapter.pubmedhelp
    [36] UniProt database:http://www.ebi.uniprot.org/index.shtml

    下載圖示 校內:2008-07-12公開
    校外:2009-07-12公開
    QR CODE