簡易檢索 / 詳目顯示

研究生: 吳傳揚
Wu, Chuan-Yang
論文名稱: 由生物醫學文件中淬取基因功能註解
Extracting Gene Function from Biomedical Articles
指導教授: 蔣榮先
Chiang, Jung-Hsien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2004
畢業學年度: 92
語文別: 中文
論文頁數: 102
中文關鍵詞: 分類器資訊擷取基因功能自然語言處理
外文關鍵詞: classifier, natural language process, information extraction, gene function
相關次數: 點閱:90下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  •   由於網際網路發達,使得許多國際上的生物研究機構能將各自研究的成果透過網路發表,這造就了許多的生物醫學電子期刊的快速成長,這些資料成長快速的電子期刊往往在閱讀方面無法令生物領域研究人員快速獲得重要的資訊如”基因的功能”、”基因與基因之間的相互作用關係”、”生物反應路徑”,因此針對此問題本論文使用資訊擷取之技術與分類器之結合,達到過濾與擷取出文件中重要的“基因/蛋白質功能註解”資訊。
     
      本論文主要由兩個部分組成:分類器技術與資訊擷取。第一部分為使用分類器,其中包含文句篩選,使用生物辭典中基因相關名詞與文章中文句做字串比對,可篩選出被標記出基因名稱的文句。經由訓練完成的分類器對標記基因名稱之文句做功能資訊的辨識,而分類器在分類上為兩類”accepted ”與 ”rejected ”,即若句子為敘述基因與功能之關係時則分類器將其分到”accepted ”,反之句子並不是描述基因與功能之間的關係時則分類到”rejected ”,由此分類器機制可找出文章中重要的基因與功能資訊。

      第二部分為資訊擷取,其中包含自然語言處理與知識庫整合,將第一部分所找出之文句加以分析整合,若為相同的功能性註解句子則整合為單一的資訊,如所找尋出的功能性註解為不相同的則認定為新的資訊,由此呈現更豐富的相關資訊提供給予生物領域研究者。

      With the increasing popularity of Internet, the worldwide biological research institutions are able to publish their works electronically, resulting in the fast growing of online biomedical document. Yet, the vast amount of information available has hindered scientists and researchers from efficiently discovering significant knowledge such as gene function, protein-protein interactions, biological pathway, etc. from biomedical literatures. In this thesis, we propose a methodology, combining Information Extraction (IE) and classifier, to identify important gene function information through the filtering and extraction of gene and/or protein function annotations from the unstructured biomedical documents.

      The strategy proposed in this paper is comprised of two independent components: classification and information extraction. The Naïve Bayes method was adopted to identify function sentences according to the feature list created in the previous phase, and it classifies every sentence candidates into “accepted or rejected”. Only “accepted” candidates were considered having been annotated. The information extraction that key mission of this process is to merge the repeated function information to a unique information and to identify new function information by natural language process and Knowledge Base.

    第一章 導論 5 1.1系統概述 6 1.2動機 7 1.3解決方法 8 1.4 論文架構. 9 第二章 文獻回顧 10 2.1資訊分類的方法 10 2.1.1決策樹模式 10 2.1.2機率模式 11 2.2自然語言處理 13 2.3 以醫學文件分析對象之資訊系統 13 第三章 基因功能註解之系統概述 25 3.1功能註解資訊與生物醫學文件之關聯性 25 3.2系統架構 28 3.3生物醫學文件之功能註解資訊辨識 29 3.4功能註解之資訊整合 31 第四章 基因功能註解之辨識 33 4.1物醫學文件特徵選取 33 4.1.1 文件前處理 33 4.1.2 特徵選取 35 4.2辨識功能註解分類器訓練 36 4.2.1 樣本收集 37 4.2.2分類器之訓練 38 4.2.3 樣本預測 40 4.3精確功能註解資訊與相關功能註解資訊 40 第五章 基因功能註解之分析整合 42 5.1自然語言處理 42 5.1.2 詞性整合 44 5.2文句詞性斷句 45 5.2.1 一般文句斷句 45 5.2.2 複雜文句斷句 47 5.3相似功能註解資訊整合 50 第六章 實驗結果與分析 52 6.1資料集與文件前處理 52 6.1.1資料來源 52 6.1.2查詢語法 53 6.1.3文件格式 54 6.1.4 資料前處理 55 6.2系統與”LocusLink”在基因功能註解之比較 56 6.2.1 分類器實驗與結果比較 58 6.3精確功能註解與相關功能註解區分之效能 62 第七章 結論與未來研究方向 64 7.1結論 64 7.2 未來研究方向 64 參考文獻 65 附錄 A 67 附錄 B 69

    [1] Abhishek Trivedi , Eneid a. Medonca , Stephen B.Johnson , “Using Machine Learning for Classifying Documents and Extracting Features” , 11th World Congress of Medical Informatics (Medinfo 2004), September 2004.

    [2] Andrew McCallum and Kamal Nigam , “A Comparison of Event Models for Naïve Bayes Text Classification” , AAAI-98 Workshop on "Learning for Text Categorization" , 1998.

    [3] Brill,E , “Some advances in transformation-based part of speech tagging.”, In Proceedings of the Twelfth National Conference on Artificial Intelligence. AAAI Press, 1994.

    [4] Dimitris Meretakis, Dimitris Fragoudis, Hongjun Lu, Spiros Likothanassis , “Scalable Association-based Text Classification” , Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, McLean, VA, USA, 2000.

    [5] F. Eisenhaber, P. Bork , "Wanted: Subcellular localization of proteins based on sequence" , Trends in Cell Biology, vol.8 , pages 169-170 , 1998.

    [6] F. Eisenhaber, P. Bork , "Evaluation of human-readable annotation in biomolecular sequence databases with biological rule libraries" , Bioinformatics, vol. 15, N 7/8, pages 528-535, 1999.

    [7] G Bhalotia, PI Nakov, AS Schwartz , MA Hearst , “BioText Team Report for the TREC 2003 Genomics Track” , TREC 2003 Genomics Track.

    [8] Hahn U, Romacker M, Schulz S , “Creating Knowledge Repositories from Biomedical Reports : The MEDSYNDIKATE Text Mining System” , Pac Symp Biocomput. pages 338-349, 2002 .

    [9] Indra Neil Sarkar, M.Phil. and Thomas C. Rindflesch, Ph.D , “Discovering Protein Similarity using Natural Language Processing” , Proc AMIA Symp. , pages 677-681, 2002 .

    [10] J. DINGa, D. BERLEANT D., NETTLETON, and E. WURTELE , “MINING MEDLINE : ABSTRACT , SENTTENCES , OR PHRASES”,Pacific Symposium on Biocomputing 7 ,pages 326-337, 2002.

    [11] Kazuhiro Seki, Nihar Sheth, and Javed Mostafa , “Identifying Gene Function Descriptions by Probability-based Sentence Selection” , NIST Special Publication pages 500-255:The Twelfth Text REtrieval Conference (TREC), 2003.

    [12] M. Crave , “Learning to Extraction Relations from MEDLINE”AAAI-99 Workshop on Machine Learning for Information Extraction, ,Orlando Florida, 1999.

    [13] Nikolai Daraselia., Anton Yuryev, Sergei Egorov, Svetalana Novichkova, Alexander Nikitin and Ilya Mazo , “Extracting Human Protein Interactions from MEDLINE using a full-sentence parser Bioinformatics” , Bioinformatics.;20(5):pages 604-211, 2004.

    [14] Perez-Iratxeta C, Bork P, Andrade MA , “Association of genes to genetically inherited diseases using data mining” , Nature Genetics Vol 31 no. 3 pages 316 – 319, 2002.

    [15] Rajesh Nair and Burkhard Rost , “Inferring sub-cellular localization through automated lexical analysis” , Bioinformatics, pages 2836-2847 (ISMB'2002 Proceedings), 2002.

    [16] Ronen Feldman, Yizhar Regev, Michal Finkelstein-Landau, Eyal Hurvitz and Boris Kogan , “Mining biomedical literature using information extraction” FEATURE , pages 19-23, 2002.

    [17] Toshihide Ono, Haretsugu Hishigaki, Akira Tanigami and
    Toshihisa Takagi , “Automated extraction of information on protein-protein interactions from the biological literature” , Bioinformatics Vol.17 , Issue 11 ,pages155-161, 2001.

    [18] Yoshimasa Tsuruokazy and Jun’ichi Tsujiiyz , “Boosting Precision and Recall of Dictionary-Bayes Protein Name Recognition” , In the Proceedings of the ACL-03 Workshop on Natural Language Processing in Biomedicine. Pages 41-48, 2003.

    下載圖示 校內:立即公開
    校外:2004-07-29公開
    QR CODE