| 研究生: |
柯兆軒 Ke, Chao-Hsuan |
|---|---|
| 論文名稱: |
生醫文獻訊息之擷取與應用 Effective Acquisition of Biomedical Message from Literature |
| 指導教授: |
蔣榮先
Chiang, Jung-Hsien |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2013 |
| 畢業學年度: | 101 |
| 語文別: | 英文 |
| 論文頁數: | 90 |
| 中文關鍵詞: | 資訊擷取 、蛋白質親和性 、生醫文獻 、生物資訊 、蛋白質 、交互作用 |
| 外文關鍵詞: | information retrieval, protein binding affinity, biomedical literature, bioinformatics, protein, interaction |
| 相關次數: | 點閱:125 下載:6 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
每年發表並收錄在可自由存取資料庫中的生物文獻越來越多,此優點是讓研究人員可以輕易地直接藉由網路進行文獻搜索也因此有更多機會可以找到有興趣的研究主題或內容。但相反的,因為越來越多的公開文獻開放為自由存取導致多數的使用者發現目前似乎越來越難讓資訊可以有效地在短時間內被發現並使用。
本研究針對生醫文獻內的蛋白親和力和蛋白質相互作用資訊開發自動化提取演算法,並將所找出的資訊存放於公開資料庫提供全球的研究人員自由存取。我們先藉由觀察蛋白親和力相關的全文文章開發了四個模組句型,並設計一個記分功能排列的函數來找出正確的蛋白親和力描述句子。經實驗證明我們的開發的句型偵測結合記分功能函數能有效識別紀錄於生醫文獻全文內的蛋白親和力,藉此省下過去須依賴人工閱讀才能找出重要資訊所需要花費的時間及相關成本。
我們也提出整合機器學習和樣板模式的偵測方法來擷取文獻內的生物路徑證據,幫助研究人員可以藉由瀏覽器直接進行生物路徑的延伸探索,藉此找出新的生物關聯路徑。此外我們也提出了一個有效率的文章排名系統,幫助文獻標記人員快速找出描述特定化學物與疾病的相關文章,藉此增加標記生物事件的效率。
本論文的主要目標是幫助研究人員快速並有效的識別出在生物醫學文章中的重要資料,讓各研究人員可以在很短的時間內有效地發掘出文獻內的重要內容和資訊。
With the increasing availability of full-text literature, researchers easily and effectively keep track of interesting research topics by searching literature online. In contrast, due to the sheer volume of available biological literature, researchers are finding it difficult to locate needed information. In this study, we focused mainly on two major topics: the extraction of data deposition statement and the ability of literature triage from large scale of biomedical articles.
For the extraction of deposition data, we focused on protein-binding affinities and extraction of protein-protein interactions. We developed four sentence patterns that were used to scan full-text articles as well as a scoring function to rank the sentences that match our patterns. The proposed sentence patterns can effectively identify the protein-binding affinities in full-text articles. Furthermore, we also proposed an integrated machine-learning and rule-based algorithm to extract the biological evidence of protein interactions occurring in MEDLINE full-text articles. In addition, an interactive web-based platform has also been constructed for researchers to visualize and manipulate existing KEGG pathways using Cytoscape web API. Lastly, a flexible ranking system was developed to assist curators for chemical-gene-disease relevant articles triage.
The major goal of this study is to assist researchers in identifying useful messages in biomedical articles that matter most and effectively keep up-to-date on progress and discoveries in a short time.
Afantenos, S., et al. (2005) Summarization from medical documents: a survey, Artificial Intelligence in Medicine, 33, 157-177.
Agarwal, S. and Yu, H. (2011) Figure summarizer browser extensions for PubMed Central, Bioinformatics, 27, 1723-1724.
Akane, Y., et al. (2005) Biomedical information extraction with predicate-argument structure patterns. CEUR-WS.org, Cambridgeshire, pp. 60-69.
Ananiadou, S., et al. (2006) Text mining and its potential applications in systems biology, Trends in Biotechnology, 24, 571-579.
Ananiadou, S., et al. (2010) Event extraction for systems biology by text mining the literature, Trends in Biotechnology, 28, 381-390.
Arighi, C., et al. (2011) Overview of the BioCreative III Workshop, BMC Bioinformatics, 12, S1.
Arighi, C.N., et al. (2013) An overview of the BioCreative 2012 Workshop Track III: interactive text mining task, Database, 2013.
Ashburner, M., et al. (2000) Gene Ontology: tool for the unification of biology, Nat Genet, 25, 25-29.
Bairoch, A., et al. (2005) The Universal Protein Resource (UniProt), Nucleic Acids Research, 33, D154-D159.
Barbosa-Silva, A., et al. (2011) PESCADOR, a web-based tool to assist text-mining of biointeractions extracted from PubMed queries, BMC Bioinformatics, 12, 435.
Baroukh, C., et al. (2011) Genes2WordCloud: a quick way to identify biological themes from gene lists and free text, Source Code for Biology and Medicine, 6, 15.
Barton, G.M. and Medzhitov, R. (2003) Toll-Like Receptor Signaling Pathways, Science, 300, 1524-1525.
Berman, H.M., et al. (2000) The Protein Data Bank, Nucleic Acids Research, 28, 235-242.
Bhattacharya, S., et al. (2011) MeSH: a window into full text for document summarization, Bioinformatics, 27, i120-i128.
Björne, J., et al. (2010) Complex event extraction at PubMed scale, Bioinformatics, 26, i382-i390.
Block, P., et al. (2006) AffinDB: a freely accessible database of affinities for protein–ligand complexes from the PDB, Nucleic Acids Research, 34, D522-D526.
Brandes, U. (2008) Social network analysis and visualization [Applications Corner], Signal Processing Magazine, IEEE, 25, 147-151.
Breitkreutz, B.-J., et al. (2003) Osprey: a network visualization system, Genome Biology, 4, R22.
Brown, K.R., et al. (2009) NAViGaTOR: Network Analysis, Visualization and Graphing Toronto, Bioinformatics, 25, 3327-3329.
Bui, Q.-C., et al. (2011) A hybrid approach to extract protein–protein interactions, Bioinformatics, 27, 259-265.
Campos, D., et al. (2013) Gimli: open source and high-performance biomedical name recognition, BMC Bioinformatics, 14, 54.
Carbon, S., et al. (2009) AmiGO: online access to ontology and annotation data, Bioinformatics, 25, 288-289.
Cer, R.Z., et al. (2009) IC50-to-Ki: a web-based tool for converting IC50 to Ki values for inhibitors of enzyme activity and ligand binding, Nucleic Acids Research, 37, W441-W445.
Chang, D.T.-H., et al. (2012) AutoBind: automatic extraction of protein–ligand-binding affinity data from biological literature, Bioinformatics, 28, 2162-2168.
Chen, X., et al. (2002) The Binding Database: data management and interface design, Bioinformatics, 18, 130-139.
Chiang, J.-H., et al. (2011) Condensing biomedical journal texts through paragraph ranking, Bioinformatics, 27, 1143-1149.
Chiang, J.-H., et al. (2006) GeneLibrarian: an effective gene-information summarization and visualization system, BMC Bioinformatics, 7, 392.
Clément-Ziza, M., et al. (2009) Genoscape: a Cytoscape plug-in to automate the retrieval and integration of gene expression data and molecular networks, Bioinformatics, 25, 2617-2618.
Cohen, A.M. and Hersh, W.R. (2005) A survey of current work in biomedical text mining, Briefings in Bioinformatics, 6, 57-71.
Corbett, P. and Copestake, A. (2008) Cascaded classifiers for confidence-based chemical named entity recognition, BMC Bioinformatics, 9, S4.
Dai, H.-J., et al. (2013) T-HOD: a literature-based candidate gene database for hypertension, obesity and diabetes, Database, 2013.
Davis, A.P., et al. (2013) The Comparative Toxicogenomics Database: update 2013, Nucleic Acids Research, 41, D1104-D1114.
Davis, A.P., et al. (2009) Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical–gene–disease networks, Nucleic Acids Research, 37, D786-D792.
Davis, A.P., et al. (2013) Text Mining Effectively Scores and Ranks the Literature for Improving Chemical-Gene-Disease Curation at the Comparative Toxicogenomics Database, PLoS ONE, 8, e58201.
Donaldson, I., et al. (2003) PreBIND and Textomy - mining the biomedical literature for protein-protein interactions using a support vector machine, BMC Bioinformatics, 4, 11.
Errami, M., et al. (2007) eTBLAST: a web server to identify expert reviewers, appropriate journals and similar publications, Nucleic Acids Research, 35, W12-W15.
Fernández, J.M., et al. (2007) iHOP web services, Nucleic Acids Research, 35, W21-W26.
Fleuren, W.W.M., et al. (2011) CoPub update: CoPub 5.0 a text mining system to answer biological questions, Nucleic Acids Research, 39, W450-W454.
Fontaine, J.-F., et al. (2009) MedlineRanker: flexible ranking of biomedical literature, Nucleic Acids Research, 37, W141-W146.
Franceschini, A., et al. (2013) STRING v9.1: protein-protein interaction networks, with increased coverage and integration, Nucleic Acids Research, 41, D808-D815.
Frijters, R., et al. (2010) Literature Mining for the Discovery of Hidden Connections between Drugs, Genes and Diseases, PLoS Comput Biol, 6, e1000943.
Frisch, M., et al. (2009) LitInspector: literature and signal transduction pathway mining in PubMed abstracts, Nucleic Acids Research, 37, W135-W140.
Fundel, K., et al. (2007) RelEx—Relation extraction using dependency parse trees, Bioinformatics, 23, 365-371.
Gerner, M., et al. (2010) LINNAEUS: A species name identification system for biomedical literature, BMC Bioinformatics, 11, 85.
Hakenberg, J., et al. (2011) The GNAT library for local and remote gene mention normalization, Bioinformatics, 27, 2769-2771.
He, M., et al. (2009) PPI Finder: A Mining Tool for Human Protein-Protein Interactions, PLoS ONE, 4, e4554.
Heinen, S., et al. (2010) KID - an algorithm for fast and efficient text mining used to automatically generate a database containing kinetic information of enzymes, BMC Bioinformatics, 11, 375.
Hettne, K., et al. (2010) Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining, Journal of Cheminformatics, 2, 3.
Hu, L., et al. (2005) Binding MOAD (Mother Of All Databases), Proteins: Structure, Function, and Bioinformatics, 60, 333-340.
Hu, Z.Z., et al. (2005) Literature mining and database annotation of protein phosphorylation using a rule-based system, Bioinformatics, 21, 2759-2765.
Huang, D., et al. (2011) MyBioNet: interactively visualize, edit and merge biological networks on the Web, Bioinformatics, 27, 3321-3322.
Huang, M., et al. (2004) Discovering patterns to extract protein–protein interactions from full texts, Bioinformatics, 20, 3604-3612.
Imming, P., et al. (2006) Drugs, their targets and the nature and number of drug targets, Nat Rev Drug Discov, 5, 821-834.
Jang, H., et al. (2006) Finding the evidence for protein-protein interactions from PubMed abstracts, Bioinformatics, 22, e220-e226.
Jessop, D., et al. (2011) OSCAR4: a flexible architecture for chemical text-mining, Journal of Cheminformatics, 3, 41.
Joshi-Tope, G., et al. (2005) Reactome: a knowledgebase of biological pathways, Nucleic Acids Research, 33, D428-D432.
Kanehisa, M. and Goto, S. (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes, Nucleic Acids Research, 28, 27-30.
Kanehisa, M., et al. (2012) KEGG for integration and interpretation of large-scale molecular data sets, Nucleic Acids Research, 40, D109-D114.
Kim, J.-D., et al. (2009) Overview of BioNLP'09 shared task on event extraction. Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task. Association for Computational Linguistics, Boulder, Colorado, pp. 1-9.
Kim, S., et al. (2008) PIE: an online prediction system for protein–protein interactions from text, Nucleic Acids Research, 36, W411-W415.
Korbel, J.O., et al. (2005) Systematic Association of Genes to Phenotypes by Genome and Literature Mining, PLoS Biol, 3, e134.
Krallinger, M., et al. (2009) PLAN2L: a web tool for integrated text mining and literature-based bioentity relation extraction, Nucleic Acids Research, 37, W160-W165.
Krallinger, M. and Valencia, A. (2005) Text-mining and information-retrieval services for molecular biology, Genome Biology, 6, 224.
Krallinger, M., et al. (2011) The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text, BMC Bioinformatics, 12, S3.
Kukich, K. (1992) Techniques for automatically correcting words in text, ACM Comput. Surv., 24, 377-439.
Lazareno, S. and Birdsall, N. (1993) Estimation of competitive antagonist affinity from functional inhibition curves using the Gaddum, Schild and Cheng-Prusoff equations, British Journal of Pharmacology, 109, 1110 - 1119.
Leaman, R. and Gonzalez, G. (2008) BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition. Pac Symp Biocomput., pp. 652-663.
Liu, T., et al. (2007) BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities, Nucleic Acids Research, 35, D198-D201.
Lopes, C.T., et al. (2010) Cytoscape Web: an interactive web-based network browser, Bioinformatics, 26, 2347-2348.
Lowe Hj, B.G. (1994) UNderstanding and using the medical subject headings (mesh) vocabulary to perform literature searches, JAMA, 271, 1103-1108.
Maglott, D., et al. (2007) Entrez Gene: gene-centered information at NCBI, Nucleic Acids Research, 35, D26-D31.
Manning, C.D., et al. (2008) Introduction to Information Retrieval. Cambridge University Press.
Marcotte, E.M., et al. (2001) Mining literature for protein–protein interactions, Bioinformatics, 17, 359-363.
Matsuzawa, A., et al. (2005) ROS-dependent activation of the TRAF6-ASK1-p38 pathway is selectively required for TLR4-mediated innate immunity, Nat Immunol, 6, 587-592.
Mattingly, C.J., et al. (2003) The Comparative Toxicogenomics Database (CTD), Environmental health perspectives, 111, 793-795.
Mattingly, C.J., et al. (2006) The Comparative Toxicogenomics Database (CTD): a resource for comparative toxicological studies, Journal of experimental zoology. Part A, Comparative experimental biology, 305, 689-692.
McDonald, D.M., et al. (2004) Extracting gene pathway relations using a hybrid grammar: the Arizona Relation Parser, Bioinformatics, 20, 3370-3378.
McEntyre, J.R., et al. (2011) UKPMC: a full text article resource for the life sciences, Nucleic Acids Research, 39, D58-D65.
Mika, S. and Rost, B. (2004) NLProt: extracting protein names and sequences from papers, Nucleic Acids Research, 32, W634-W637.
Milo, R., et al. (2010) BioNumbers—the database of key numbers in molecular and cell biology, Nucleic Acids Research, 38, D750-D753.
Mitsumori, T., et al. (2006) Extracting Protein-Protein Interaction Information from Biomedical Text with SVM, IEICE - Trans. Inf. Syst., E89-D, 2464-2466.
Névéol, A., et al. (2011) Extraction of data deposition statements from the literature: a method for automatically tracking research results, Bioinformatics, 27, 3306-3312.
Nagasaki, M., et al. (2011) Systems biology model repository for macrophage pathway simulation, Bioinformatics, 27, 1591-1593.
Narayanaswamy, M., et al. (2005) Beyond the clause: extraction of phosphorylation information from medline abstracts, Bioinformatics, 21, i319-i327.
Neves, M., et al. (2010) Moara: a Java library for extracting and normalizing gene and protein mentions, BMC Bioinformatics, 11, 157.
Ono, T., et al. (2001) Automated extraction of information on protein–protein interactions from the biological literature, Bioinformatics, 17, 155-161.
Pandey, R., et al. (2004) Pathway Miner: extracting gene association networks from molecular pathways for predicting the biological significance of gene expression microarray data, Bioinformatics, 20, 2156-2158.
Puvanendrampillai, D. and Mitchell, J.B.O. (2003) Protein Ligand Database (PLD): additional understanding of the nature and specificity of protein–ligand complexes, Bioinformatics, 19, 1856-1857.
Raza, S., et al. (2008) A logic-based diagram of signalling pathways central to macrophage activation, BMC Systems Biology, 2, 36.
Rebholz-Schuhmann, D., et al. (2012) Text-mining solutions for biomedical research: enabling integrative biology, Nat Rev Genet, 13, 829-839.
Roberts, P.M. (2006) Mining literature for systems biology, Briefings in Bioinformatics, 7, 399-406.
Roche, O., et al. (2001) Ligand−Protein DataBase: Linking Protein−Ligand Complex Structures to Binding Data, Journal of Medicinal Chemistry, 44, 3592-3598.
Ross, K.E., et al. (2013) Construction of protein phosphorylation networks by data mining, text mining and ontology integration: analysis of the spindle checkpoint, Database, 2013.
Rzhetsky, A., et al. (2004) GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data, Journal of Biomedical Informatics, 37, 43-53.
Schaefer, C.F., et al. (2009) PID: the Pathway Interaction Database, Nucleic Acids Research, 37, D674-D679.
Schuemie, M.J., et al. (2004) Distribution of information in biomedical abstracts and full-text publications, Bioinformatics, 20, 2597-2604.
Shang, Y., et al. (2011) Enhancing Biomedical Text Summarization Using Semantic Relation Extraction, PLoS ONE, 6, e23862.
Sharan, R., et al. (2007) Network-based prediction of protein function, Mol Syst Biol, 3.
Shatkay, H. and Craven, M. (2012) Mining the Biomedical Literature.
Smith, L., et al. (2008) Overview of BioCreative II gene mention recognition, Genome Biology, 9, S2.
Spasic, I., et al. (2005) Text mining and ontologies in biomedicine: Making sense of raw text, Briefings in Bioinformatics, 6, 239-251.
Spasić, I., et al. (2009) KiPar, a tool for systematic information retrieval regarding parameters for kinetic modelling of yeast metabolic pathways, Bioinformatics, 25, 1404-1411.
Stapley, B. and Benoit, G. (2000) Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. Proceedings of the fifth Pacific Symposium on Biocomputing. University of Kentucky, Lexington, USA., pp. 529-540.
Suderman, M. and Hallett, M. (2007) Tools for visually exploring biological networks, Bioinformatics, 23, 2651-2659.
Suomela, B. and Andrade, M. (2005) Ranking the whole MEDLINE database according to a large training set using text indexing, BMC Bioinformatics, 6, 75.
Takeda, K., et al. (2008) Apoptosis Signal-Regulating Kinase 1 in Stress and Immune Response, Annual Review of Pharmacology and Toxicology, 48, 199-225.
Tari, L., et al. (2010) Discovering drug–drug interactions: a text-mining and reasoning approach based on properties of drug metabolism, Bioinformatics, 26, i547-i553.
Tsuruoka, Y., et al. (2008) FACTA: a text search engine for finding associated biomedical concepts, Bioinformatics, 24, 2559-2560.
Urquiza, J.M., et al. (2012) Using machine learning techniques and genomic/proteomic information from known databases for defining relevant features for PPI classification, Computers in Biology and Medicine, 42, 639-650.
Vazquez, M., et al. (2011) Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications, Molecular Informatics, 30, 506-519.
Vicedo, J.L. and Gómez, J. (2007) TREC: Experiment and evaluation in information retrieval, Journal of the American Society for Information Science and Technology, 58, 910-911.
von Eichborn, J., et al. (2011) Cobweb: a Java applet for network exploration and visualisation, Bioinformatics, 27, 1725-1726.
Wang, R., et al. (2004) The PDBbind Database: Collection of Binding Affinities for Protein−Ligand Complexes with Known Three-Dimensional Structures, Journal of Medicinal Chemistry, 47, 2977-2980.
Wang, Y., et al. (2009) PubChem: a public information system for analyzing bioactivities of small molecules, Nucleic Acids Research, 37, W623-W633.
Ward, M., et al. (2010) Interactive Data Visualization: Foundations, Techniques, and Applications. A. K. Peters, Ltd.
Warde-Farley, D., et al. (2010) The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function, Nucleic Acids Research, 38, W214-W220.
Wermter, J., et al. (2009) High-performance gene name normalization with GeNo, Bioinformatics, 25, 815-821.
Wilcoxon, F. (1947) Probability Tables for Individual Comparisons by Ranking Methods, Biometrics, 3, 119-122.
Wynn, T.A., et al. (2013) Macrophage biology in development, homeostasis and disease, Nature, 496, 445-455.
Yook, S.-H., et al. (2004) Functional and topological characterization of protein interaction networks, PROTEOMICS, 4, 928-942.
Zhang, S.-W., et al. (2010) PPLook: an automated data mining tool for protein-protein interaction, BMC Bioinformatics, 11, 326.
Zhou, D. and He, Y. (2008) Extracting interactions between proteins from the literature, Journal of Biomedical Informatics, 41, 393-407.
Zhou, W., et al. (2006) ADAM: another database of abbreviations in MEDLINE, Bioinformatics, 22, 2813-2818.
Zhu, F., et al. (2013) Biomedical text mining and its applications in cancer research, Journal of Biomedical Informatics, 46, 200-211.