簡易檢索 / 詳目顯示

研究生: 劉恒惠
Liu, Heng-Hui
論文名稱: 智慧型生醫文獻摘要系統之研究
Intelligent Biomedical Information Summarization for Omic Study
指導教授: 蔣榮先
Chiang, Jung-Hsien
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2011
畢業學年度: 99
語文別: 英文
論文頁數: 66
中文關鍵詞: 生物資訊文件探勘基因名稱正規化
外文關鍵詞: Bioinformatics, text mining, gene normalization
相關次數: 點閱:106下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著檢測技術的進步,現今的研究已導入了所謂「體學」概念,其重要特性就是高通量分析,研究一群基因間彼此的關係,這也使得以往針對單一或少數基因研究的文獻回顧工作更為繁重,因應體學時代的來臨,新的文獻探勘工具必須要能夠在龐大的文獻資料中,為基因與蛋白質名稱建立正確的索引,讓使用者能夠正確地搜尋到基因相關的文獻,以及利用基因註解資訊以及文獻內容判斷基因間關聯性;對於不同型態的文獻資源,包含基因的註解資訊、文獻摘要以及全文內容,要能夠有效運用以提升文獻使用的效率。
    在本論文中,針對建立索引時過程中的基因名稱正規化提出了一創新的演算法來改善基因名稱一文多義的問題;此外,本研究另提出了一度量基因功能性註解辭彙間語意相似性的方法,利用度量不同基因間功能的相似性來協助使用者快速找出彼此的關聯,並從相關的文獻摘要中,利用自然語言處理與有限狀態機等技術,擷取描述基因與基因、功能、疾病間關係的句子供使用者參考;最後,本研究利用資訊檢索的方法與文件摘要的技術,提出一段落排序演算法,以摘要中的資訊做為查詢詞,對於全文中的段落評估與摘要中資訊的相關度與重要性,得到段落排序的結果,並將較重要的段落推薦給使用者瀏覽或利用,以提升全文文獻閱讀的效率。

    Nowadays, biomedical study has entered the omics era. Omic study features high throughput analysis and investigation of relationship among a group of genes. This increases burdens of literature survey for such large amount of genes and raises pressing need for a suitable text mining tool in omic era. A text mining tool for omic study should have reliable indexing of gene/proteins, and abilities of summarizing information from various types of resources.
    In this dissertation, a novel gene normalization algorithm for improving accurate indexing has been proposed. This approach integrated maximum entropy model and fuzzy approach to dealing with ambiguous gene mentions, and achieved approved performance. To group related genes, a novel semantic similarity measure for Gene Ontology term has been implemented to group genes sharing similar annotations. In abstracts, sentences describing relationships of a gene and other genes, functions and diseases provide very essential information. Therefore, a practical information extraction method was proposed to extract useful information from abstracts. In addition to annotation and abstract, full-text article carries more detailed information but its long length rises time-cost of reading. A proposed paragraph ranking approach recommends important paragraphs of full-text by keywords in abstracts for efficient reading.
    With regard to requirements of text mining tool for omic study, the study proposed novel and practical approaches and evaluated their performance.

    摘 要 i ABSTRACT ii ACKNOWLEDGEMENT iii TABLE OF CONTENTS iv LIST OF TABLES vii LIST OF FIGURES viii Chapter 1. INTRODUCTION 1 1.1 Motivation 1 1.2 Objective 2 1.3 Algorithms for Intelligent Biomedical Summarization 3 1.3.1 Gene Normalization 3 1.3.2 Gene Relation Inquiry from Annotation and Abstracts 4 1.3.3 Full-Text Summarization Using Paragraph Ranking 4 1.4 Organization of Dissertation 5 Chapter 2. RELATED WORKS 7 2.1 Gene Ontology 7 2.2 Performance Measures 8 2.3 Natural Language Processing Tools 9 2.4 Maximum Entropy Model 9 2.5 BioCreAtIvE 10 Chapter 3. GENE NORMALIZATION 12 3.1 Background 12 3.2 Methods 13 3.2.1 Gene mention recognition 14 3.2.2 Matching gene mentions to corresponding identifiers 15 3.2.3 Fuzzy set representation of ambiguous mention 16 3.2.4 Maximum entropy classifiers as membership functions 17 3.2.5 Information fusion for handling ambiguous mentions 18 3.3 Experiment and Results 19 3.3.1 Materials 19 3.3.2 Selection of gene mention recognition tools 20 3.3.3 Evaluation of morphological rules 22 3.3.4 Performance of classifiers 23 3.4 Summery 24 Chapter 4. GENE RELATION SUMMARIZATION 25 4.1 Background 25 4.2 Methods 26 4.2.1 GeneCluster: Measuring semantic similarity of GO term 28 4.2.2 GeneSum: Extracting relations of genes from abstract 30 4.3 Experiments and results 34 4.3.1 Evaluation of GeneCluster 34 4.3.2 Evaluation of GeneSum 36 4.4 Summery 38 Chapter 5. FULL-TEXT SUMMARIZATION 39 5.1 Background 39 5.2 Methods 41 5.2.1 Pre-processing 42 5.2.2 Paragraph Relevance 46 5.2.3 PR-ISR 48 5.2.4 Abstract-related condensed text 49 5.3 Experiments 50 5.3.1 Experimental settings 50 5.3.2 Materials 51 5.3.3 Algorithms for comparison 51 5.3.4 Information overlapping and paragraph importance 52 5.3.5 Assessing agreement with human opinion 53 5.3.6 Evaluation qualities of condensed text 54 5.4 Results and Discussion 55 5.4.1 Analysis of annotation of paragraphs 55 5.4.2 Importance of retrieved paragraphs 56 5.4.3 Information coverage of condensed text 58 5.5 Summery 59 Chapter 6. CONCLUSION AND FUTURE WORKS 60 REFERENCES 62

    REFERENCES
    Agrawal, R. and R. Srikant, “Fast Algorithms for Mining Association Rules in Large Databases.” Proceedings of the 20th International Conference on Very Large Data Bases, pp. 487-499, 1994.
    Bairoch, A., B. Boeckmann, et al., “Swiss-Prot: juggling between evolution and stability.” Brief Bioinform, vol. 5, no. 1, pp. 39-55, 2004.
    Barrell, D., E. Dimmer, et al., “The GOA database in 2009--an integrated Gene Ontology Annotation resource.” Nucleic Acids Research, vol. 37, Database issue, D396-403-D396-403, 2009.
    Berger, A. L., V. J. D. Pietra, et al., “A maximum entropy approach to natural language processing.” Comput. Linguist, vol. 22, no. 1, pp. 39-71, 1996.
    Brill, E., “A simple rule-based part of speech tagger,” Proceedings of the third conference on Applied natural language processing, Trento, Italy, pp. 152-155, 1992.
    Brin, S. and L. Page, “The anatomy of a large-scale hypertextual Web search engine.” Seventh International World-Wide Web Conference (WWW 1998), Brisbane, Australia, pp. 107-117, 1998.
    Carpenter, B., “LingPipe for 99.99% Recall of Gene Mentions.” Proceedings of the 2nd BioCreative workshop, pp. 307–309, 2007.
    Chao, G. and M. G. Dyer, “Maximum entropy models for word sense disambiguation.” Proceedings of the 19th international conference on Computational linguistics , vol. 1, pp. 1-7, 2002.
    Cohen, A. M., “Unsupervised gene/protein named entity normalization using automatically extracted dictionaries,” Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics , Detroit, Michigan, pp. 17-24, 2005.
    Colosimo, M. E., A. A. Morgan, et al., “Data preparation and interannotator agreement: BioCreAtIvE task 1B.” BMC Bioinformatics, vol. 6, Suppl 1, S12, 2005.
    Crim, J., R. McDonald, et al., “Automatically annotating documents with normalized gene lists.” BMC Bioinformatics, vol. 6, Suppl 1, S13-S13, 2005.
    Ding, J., D. Berleant, et al., “Mining MEDLINE: abstracts, sentences, or phrases?” Pac Symp Biocomput, pp. 326-337, 2002.
    Doms, A. and M. Schroeder, “GoPubMed: exploring PubMed with the Gene Ontology.” Nucleic Acids Research, vol. 33, Web Server, W783-W786-W783-W786, 2005.
    Drabkin, H. J., C. Hollenbeck, et al., “Ontological visualization of protein-protein interactions.” BMC Bioinformatics, vol. 6, pp. 29-29, 2005.
    Frisch, M., B. Klocke, et al., “LitInspector: literature and signal transduction pathway mining in PubMed abstracts.” Nucleic Acids Res, vol. 37, Web Server issue, W135-W140-W135-W140 , 2009.
    Gao, W., N.-Z. Shi, et al., “Unified generalized iterative scaling and its applications.” Comput. Stat. Data Anal. vol. 54, no. 4, pp. 1066-1078, 2010.
    Gay, C. W., M. Kayaalp, et al., “Semi-automatic indexing of full text biomedical articles.” AMIA ... Annual Symposium Proceedings / AMIA Symposium. AMIA Symposium, pp. 271-275, 2005.
    Goetz, T. and C.-W. von der Lieth, “PubFinder: a tool for improving retrieval rate of relevant PubMed abstracts.” Nucleic Acids Research, vol. 33, Web Server issue, W774-W778-W774-W778, 2005.
    Hanisch, D., K. Fundel, et al., “ProMiner: rule-based protein and gene entity recognition.” BMC Bioinformatics, vol. 6, Suppl 1, S14-S14, 2005.
    Hirschman, L., M. Colosimo, et al., “Overview of BioCreAtIvE task 1B: normalized gene lists.” BMC Bioinformatics. vol. 6, Suppl 1, S11-S11, 2005.
    Hirschman, L., A. Yeh, et al., “Overview of BioCreAtIvE: critical assessment of information extraction for biology.” BMC Bioinformatics, vol. 6, Suppl 1, S1, 2005.
    Hristovski, D., J. Stare, et al., “Supporting discovery in medicine by association rule mining in Medline and UMLS.” Stud Health Technol Inform, vol. 84, pt 2, pp. 1344-1348, 2001.
    Hulth, A., “Improved automatic keyword extraction given more linguistic knowledge,” Proc. of the 2003 Conf. on Empirical Methods in NLP, pp. 216-223, 2003.
    Islamaj Doğan, R. and Z. Lu, “Click-words: learning to predict document keywords from a user perspective.” Bioinformatics, vol. 26, no. 21, pp. 2767-2775, 2010.
    Jelier, R., G. Jenster, et al., “Co-occurrence based meta-analysis of scientific texts: retrieving biological relationships between genes.” Bioinformatics, vol. 21, no. 9, pp. 2049-2058, 2005.
    Krallinger, M., F. Leitner, et al., “Overview of the protein-protein interaction annotation extraction task of BioCreative II.” Genome Biol, vol. 9, Suppl 2, S4, 2008.
    Laskowski, R. A., “Enhancing the functional annotation of PDB structures in PDBsum using key figures extracted from the literature.” Bioinformatics, vol. 23, no.14, pp. 1824-1827.
    Lin, C.-Y., “ROUGE: A Package for Automatic Evaluation of Summaries,” “ Proceedings of Workshop on Text Summarization 2004, Barcelona, Spain, pp. 74–81, 2004.
    Lin, J., “Is searching full text more effective than searching abstracts?” BMC Bioinformatics, vol. 10, pp. 46-46, 2009.
    Lin, J. and W. J. Wilbur, “PubMed related articles: a probabilistic topic-based model for content similarity.” BMC Bioinformatics, vol. 8, no. 1, pp. 423-423, 2007.
    Liu, H., Z.-Z. Hu, et al., “BioThesaurus: a web-based thesaurus of protein and gene names.” Bioinformatics, vol. 22, no.1, pp. 103-105, 2006.
    Lord, P. W., R. D. Stevens, et al., “Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation.” Bioinformatics, vol. 19, no. 10, pp. 1275-1283, 2003.
    Luhn, H. P., “The Automatic Creation of Literature Abstracts.” IBM Journal of Research and Development, vol. 2, no. 2, pp. 159-165, 1958.
    Maglott, D., J. Ostell, et al., “Entrez Gene: gene-centered information at NCBI.” Nucleic Acids Res, vol. 39, Database issue, D52-57, 2011.
    Morgan, A. A., Z. Lu, et al., “Overview of BioCreative II gene normalization.” Genome Biology, vol. 9, Suppl 2, S3-S3, 2008.
    Morgan, A. A., Z. Lu, et al., “Overview of BioCreative II gene normalization.” Genome Biol, vol. 9, Suppl 2, S3, 2008.
    Popescu, M., J. M. Keller, et al., “Fuzzy measures on the Gene Ontology for gene product similarity.” IEEE/ACM Trans Comput Biol Bioinform, vol. 3, no. 3, pp. 263-274, 2006.
    Ratnaparkhi, A., “A Maximum Entropy Model for Part-Of-Speech Tagging.” Proceedings of the Empirical Methods in Natural Language Processing, Philadelphia, Pa. USA, pp. 133-142, 1996.
    Settles, B., “ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text.” Bioinformatics, vol. 21, no. 14, pp. 3191-3192, 2005.
    Shah, P. K., C. Perez-Iratxeta, et al., “Information extraction from full text scientific articles: where are the keywords?” BMC Bioinformatics, vol. 4, 20, 2003.
    Smith, L., L. K. Tanabe, et al., “Overview of BioCreative II gene mention recognition.” Genome Biol, vol. 9, Suppl 2, S2, 2008.
    Toutanova, K., D. Klein, et al., “Feature-rich part-of-speech tagging with a cyclic dependency network,” Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology NAACL 03, Stroudsburg, PA, USA, pp. 173-180, 2003.
    Turney, P. D., “Learning Algorithms for Keyphrase Extraction.” Information Retrieval, vo2: 303-336, 2000.
    Whitfield, M. L., G. Sherlock, et al., “Identification of genes periodically expressed in the human cell cycle and their expression in tumors.” Molecular Biology of the Cell , vol. 13, no. 6, pp. 1977-2000, 2002.
    Xue, N. and S. P. Converse, “Combining classifiers for Chinese word segmentation.” Proceedings of the first SIGHAN workshop on Chinese language processing, vol. 18, pp. 1-7, 2002.
    Yager, R. R., “On Ordered Weighted Averaging Aggregation Operators in Multicriteria Decision-Making.” Ieee Transactions on Systems Man and Cybernetics, vol. 18, no. 1, pp. 183-190, 1998.
    Yeh, A., A. Morgan, et al., “BioCreAtIvE task 1A: gene mention finding evaluation.” BMC Bioinformatics, vol. 6, Suppl 1, S2, 2005.

    下載圖示 校內:2012-08-31公開
    校外:2012-08-31公開
    QR CODE