簡易檢索 / 詳目顯示

研究生: 甘昇玄
Gan, Sheng-Xuan
論文名稱: 生物醫學文件中基因關係摘要系統
A Summarization System for Gene Relations in Biomedical Literatures
指導教授: 蔣榮先
Chiang, Jung-Hsien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2004
畢業學年度: 92
語文別: 中文
論文頁數: 42
中文關鍵詞: 生物資訊資訊淬取自動機
外文關鍵詞: bioinformatics, information extraction, automata
相關次數: 點閱:83下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  •   當科學家手上有了三十億個DNA 序列的龐大生物資料庫之後,生物學家再也不可能用傳統的紙上作業的方法去解讀這些資料,必須依靠電腦的高速計算能力才有可能對這筆龐大的資料進行整理與分析。本論文結合資料探勘、自然語言處理、有限狀態機的技術在大量的醫學文件中摘要出與查詢基因相關之資訊,在處理流程上共分為三個部分: 1.利用醫學文件中描寫基因相關資訊的書寫習慣,配合資料探勘的方法,初步從文句中篩選出可能與查詢基因相關的基因、功能、疾病。2.簡化文句的句型結構並將其轉換為下一個模組所需的輸入格式。3.將篩選與查詢的結果回朔到原文中以有限狀態機的方法判斷原句的句型是否在描述與查詢基因之間的關係,並將符合句型規則的相關基因、功能、疾病及其原句淬取出來。最後,我們將摘要資訊整理為1. 相關基因方面資訊。2. 相關功能方面資訊。3. 相關疾病方面資訊。來提供醫學人員關於生物途徑(BiologicalPathway)、蛋白質交互作用預測、人類致病基因、蛋白質功能的相關資訊,加速醫學研究的流程。

      After creating thirty hundred million DNA sequences’ database by biologists, it needs the high speed process to analyze the vast amounts of data . This paper proposes an summarization system which summarizes a query gene’s related information in biomedical literatures. This system combines three techniques :data mining methodology, natural language processing , finite state machine and consists of three process steps. First ,utilize the author’s writing habits in biomedical literatures and data mining method to extract the candidate related genes , functions
    and diseases. Second , tag the part-of-speech of sentences’ tokens and simplify the sentence pattern. Third , use finite state machine to extract summary sentences that describe the relation between the query gene and other genes, functions or diseases.
      Finally , the summary information is integrated to : 1. Information about related genes. 2.Information about related functions. 3. Information about related diseases.This system provides information to help researches about biological pathway ,protein-protein interaction , human cancer gene and gene’s function and increase the process of medical research.

    第一章 導論……………………………………………..1 1.1 概論…………………………………………………….1 1.2 研究動機……………………………………………….1 1.3 解決方法……………………………………………….2 1.4 論文架構……………………………………………….3 第二章 相關研究…………………………………………..4 2.1 生物資訊學………………………………………...4 2.1.1 HPRD 人類蛋白質資料庫…………………………..5 2.1.2 GeneCard 資料庫………………………………...6 2.1.3 Suiseki System 基因作用關係淬取系統…………8 2.1.4 PubGene System 基因文獻加值資料庫…………….9 2.2 資料探勘與關聯法則……………………………..10 2.3 有限狀態機…………………………………………..11 第三章 自動摘要基因相關資訊之系統設計………………13 3.1 系統概論…………………………………………….13 3.1.1 系統概念圖………………………………………….13 3.1.2 系統架構圖………………………………………….14 3.2 資料探勘與候選關聯項目之淬取…………………….15 3.3 利用自然語言處理技術對句子作結構化的處理…….18 3.4 利用有限狀態機方式淬取摘要文句……………….20 3.5 功能相關資訊的例外處理………………………….23 3.5.1 功能名稱的變化形………………………………23 3.5.2 功能名稱的擴充……………………………………23 第四章 實驗設計與結果分析………………………………24 4.1 實驗資料集介紹及文件前處理……………………….24 4.1.1 資料來源…………………………………………...24 4.1.2 文件格式……………………………………………25 4.1.3 資料前處理…………………………………………25 4.2 實驗結果……………………………………………….26 4.3 與相關的研究結果比較……………………………….27 4.4 與資料庫註解資訊的比對…………………………..28 4.5 系統應用之延伸……………………………………..35 第五章 結論與未來展望……………………………………37 5.1 結論………………………………………………….37 5.2 未來展望……………………………………………….37 參考文獻………………………………………………..38 附錄A………………………………………………………..40

    [1] Akane Yakushiji, Yuka Tateisi, Yusuke Miyao,“Event Extraction from Biomedical Papers using a Full Parser,”Proc. of the Pacific Symposium on Biocomputing , 2001.
    [2] Bing Liu, Chee Wee Chin, Hwee Tou Ng.,“Mining Topic-Specific Concepts and Definitions on the Web,"WWW-03 , 2003.
    [3] Blaschke C. and A. Valencia,“The Frame-based Module of the SUISEKI Information Extraction System,"IEEE Intelligent Systems, vol. 17, pp. 14-20, 2002.
    [4] Brill E.,“A simple rule-based part of speech tagger,"In proceedings of the Thrid Conference on Applied Natural Language Processing, 1992.
    [5] Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A, “ GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles,"Bioinformatics , vol. 20, pp. S74-S82, 2001.
    [6] Gaizauskas R, Demetriou G, Artymiuk PJ, Willett P,“Protein structures and information extraction from biological texts: the PASTA system ,”Bioinformatics , vol. 19, no. 1, pp. 135-143, 2003.
    [7] Gondy Leroy, Hsinchun Chen and Jesse D. Martinez,“A shallow parser based on closed-class words to capture relations in biomedical text,"Journal of Biomedical Informatics (JBI), vol. 36, pp. 145-158, June 2003.
    [8] J.H Chiang, H.C. Yu, and H.J Hsu,“GIS: a biomedical text-mining system for gene information discovery,"Bioinformatics , vol. 20, no. 1, pp. 120-121, 2004.
    [9] J.H Chiang, H.C. Yu, and H.J Hsu,“MeKE: discovering the functions of gene products from biomedical literature via sentence alignment,"Bioinformatics , vol. 19, no. 11, pp. 1417–1422, 2003.
    [10] Joshua M. Temkin and Mark R. Gilder, “Extraction of protein interaction information from unstructured text using a context-free grammar,"Bioinformatics , vol. 19, no. 16, pp. 2046-2053, 2003.
    [11] Leroy G, Chen H,“Filling preposition-based templates to capture information from medical abstracts,”In: Pacific Symposium on Biocomputing, January, Kauai , 2002.
    [12] Marcotte E. M., L. Xenarios, and D. Eisenberg ,“Mining literature for protein-protein interactions ,"Bioinformatics , vol. 17, no 4, pp. 359-363, 2001.
    [13] M Rebhan, V Chalifa-Caspi, J Prilusky, and D Lancet,“GeneCards: a novel functional genomics compendium with automated data mining and query
    reformulation support,"Bioinformatics , vol. 14, no. 8, pp. 656-664, 1998.
    [14] Novichkova S, Egorov S, Daraselia N,“MedScan, a natural language processing engine for MEDLINE abstracts,"Bioinformatics , vol. 19, no. 13, pp. 1699-1706, 2003.
    [15] R. Agrawal, R. Srikant.,“Fast algorithms for mining association rules in large databases,"Proc. of the 20th Int'l Conference on Very Large Databases, September 1994.
    [16] Sekimizu T, Park HS, Tsujii J.,“Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts,"In Genome Informatics., 1998.

    下載圖示 校內:2005-07-30公開
    校外:2005-07-30公開
    QR CODE