研究生: |
吳傳揚 Wu, Chuan-Yang |
---|---|
論文名稱: |
由生物醫學文件中淬取基因功能註解 Extracting Gene Function from Biomedical Articles |
指導教授: |
蔣榮先
Chiang, Jung-Hsien |
學位類別: |
碩士 Master |
系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2004 |
畢業學年度: | 92 |
語文別: | 中文 |
論文頁數: | 102 |
中文關鍵詞: | 分類器 、資訊擷取 、基因功能 、自然語言處理 |
外文關鍵詞: | classifier, natural language process, information extraction, gene function |
相關次數: | 點閱:90 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由於網際網路發達,使得許多國際上的生物研究機構能將各自研究的成果透過網路發表,這造就了許多的生物醫學電子期刊的快速成長,這些資料成長快速的電子期刊往往在閱讀方面無法令生物領域研究人員快速獲得重要的資訊如”基因的功能”、”基因與基因之間的相互作用關係”、”生物反應路徑”,因此針對此問題本論文使用資訊擷取之技術與分類器之結合,達到過濾與擷取出文件中重要的“基因/蛋白質功能註解”資訊。
本論文主要由兩個部分組成:分類器技術與資訊擷取。第一部分為使用分類器,其中包含文句篩選,使用生物辭典中基因相關名詞與文章中文句做字串比對,可篩選出被標記出基因名稱的文句。經由訓練完成的分類器對標記基因名稱之文句做功能資訊的辨識,而分類器在分類上為兩類”accepted ”與 ”rejected ”,即若句子為敘述基因與功能之關係時則分類器將其分到”accepted ”,反之句子並不是描述基因與功能之間的關係時則分類到”rejected ”,由此分類器機制可找出文章中重要的基因與功能資訊。
第二部分為資訊擷取,其中包含自然語言處理與知識庫整合,將第一部分所找出之文句加以分析整合,若為相同的功能性註解句子則整合為單一的資訊,如所找尋出的功能性註解為不相同的則認定為新的資訊,由此呈現更豐富的相關資訊提供給予生物領域研究者。
With the increasing popularity of Internet, the worldwide biological research institutions are able to publish their works electronically, resulting in the fast growing of online biomedical document. Yet, the vast amount of information available has hindered scientists and researchers from efficiently discovering significant knowledge such as gene function, protein-protein interactions, biological pathway, etc. from biomedical literatures. In this thesis, we propose a methodology, combining Information Extraction (IE) and classifier, to identify important gene function information through the filtering and extraction of gene and/or protein function annotations from the unstructured biomedical documents.
The strategy proposed in this paper is comprised of two independent components: classification and information extraction. The Naïve Bayes method was adopted to identify function sentences according to the feature list created in the previous phase, and it classifies every sentence candidates into “accepted or rejected”. Only “accepted” candidates were considered having been annotated. The information extraction that key mission of this process is to merge the repeated function information to a unique information and to identify new function information by natural language process and Knowledge Base.
[1] Abhishek Trivedi , Eneid a. Medonca , Stephen B.Johnson , “Using Machine Learning for Classifying Documents and Extracting Features” , 11th World Congress of Medical Informatics (Medinfo 2004), September 2004.
[2] Andrew McCallum and Kamal Nigam , “A Comparison of Event Models for Naïve Bayes Text Classification” , AAAI-98 Workshop on "Learning for Text Categorization" , 1998.
[3] Brill,E , “Some advances in transformation-based part of speech tagging.”, In Proceedings of the Twelfth National Conference on Artificial Intelligence. AAAI Press, 1994.
[4] Dimitris Meretakis, Dimitris Fragoudis, Hongjun Lu, Spiros Likothanassis , “Scalable Association-based Text Classification” , Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, McLean, VA, USA, 2000.
[5] F. Eisenhaber, P. Bork , "Wanted: Subcellular localization of proteins based on sequence" , Trends in Cell Biology, vol.8 , pages 169-170 , 1998.
[6] F. Eisenhaber, P. Bork , "Evaluation of human-readable annotation in biomolecular sequence databases with biological rule libraries" , Bioinformatics, vol. 15, N 7/8, pages 528-535, 1999.
[7] G Bhalotia, PI Nakov, AS Schwartz , MA Hearst , “BioText Team Report for the TREC 2003 Genomics Track” , TREC 2003 Genomics Track.
[8] Hahn U, Romacker M, Schulz S , “Creating Knowledge Repositories from Biomedical Reports : The MEDSYNDIKATE Text Mining System” , Pac Symp Biocomput. pages 338-349, 2002 .
[9] Indra Neil Sarkar, M.Phil. and Thomas C. Rindflesch, Ph.D , “Discovering Protein Similarity using Natural Language Processing” , Proc AMIA Symp. , pages 677-681, 2002 .
[10] J. DINGa, D. BERLEANT D., NETTLETON, and E. WURTELE , “MINING MEDLINE : ABSTRACT , SENTTENCES , OR PHRASES”,Pacific Symposium on Biocomputing 7 ,pages 326-337, 2002.
[11] Kazuhiro Seki, Nihar Sheth, and Javed Mostafa , “Identifying Gene Function Descriptions by Probability-based Sentence Selection” , NIST Special Publication pages 500-255:The Twelfth Text REtrieval Conference (TREC), 2003.
[12] M. Crave , “Learning to Extraction Relations from MEDLINE”AAAI-99 Workshop on Machine Learning for Information Extraction, ,Orlando Florida, 1999.
[13] Nikolai Daraselia., Anton Yuryev, Sergei Egorov, Svetalana Novichkova, Alexander Nikitin and Ilya Mazo , “Extracting Human Protein Interactions from MEDLINE using a full-sentence parser Bioinformatics” , Bioinformatics.;20(5):pages 604-211, 2004.
[14] Perez-Iratxeta C, Bork P, Andrade MA , “Association of genes to genetically inherited diseases using data mining” , Nature Genetics Vol 31 no. 3 pages 316 – 319, 2002.
[15] Rajesh Nair and Burkhard Rost , “Inferring sub-cellular localization through automated lexical analysis” , Bioinformatics, pages 2836-2847 (ISMB'2002 Proceedings), 2002.
[16] Ronen Feldman, Yizhar Regev, Michal Finkelstein-Landau, Eyal Hurvitz and Boris Kogan , “Mining biomedical literature using information extraction” FEATURE , pages 19-23, 2002.
[17] Toshihide Ono, Haretsugu Hishigaki, Akira Tanigami and
Toshihisa Takagi , “Automated extraction of information on protein-protein interactions from the biological literature” , Bioinformatics Vol.17 , Issue 11 ,pages155-161, 2001.
[18] Yoshimasa Tsuruokazy and Jun’ichi Tsujiiyz , “Boosting Precision and Recall of Dictionary-Bayes Protein Name Recognition” , In the Proceedings of the ACL-03 Workshop on Natural Language Processing in Biomedicine. Pages 41-48, 2003.