簡易檢索 / 詳目顯示

研究生: 楊曜瑋
Yang, Yao-wei
論文名稱: 利用文字探勘技術擷取出蛋白質間交互作用反應
Using text mining to extract protein-protein interaction
指導教授: 王惠嘉
Wang, Hei-Chia
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2007
畢業學年度: 95
語文別: 中文
論文頁數: 65
中文關鍵詞: 生化代謝途徑蛋白質間交互作反應機器學習自然語言處理文字探勘
外文關鍵詞: protein-protein interaction, machine learning, biological pathway, Text mining, natural language processing
相關次數: 點閱:98下載:7
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來在生物領域中,若要知道某個基因功能,可先瞭解與此基因所相關連的各個蛋白質之間交互作用反應(protein-protein interaction)資訊。再取得這些資訊後,則可再進一步瞭解生化代謝途徑(biological pathway)與其相關資訊。蛋白質間交互作用反應的研究,廣泛受到生物學家們的討論、探討與青睞。而許多蛋白質間交互作用反應資料多以文獻的方式存在文獻中,若可以將這些文獻透過資訊技術處理後,進而得到這些知識,則可以替生物學家們在面對這些龐大文獻時,省去許多時間與精力。
    因此,本研究將透過使用文字探勘之技術,自然語言處理與機器學習等方法來進行分析處理與實作。首先,先分別取得並建立其相關蛋白質字典與及關鍵詞表,接著參考以半監督式機器學習法(semi-supervised machine learning)為基礎而成的BPS(Bio Proteins interaction System)找出隱藏在蛋白質間交互作用反應文獻中所提到的各個交互作用反應資訊,以提供給生物學家做更進一步分析之用。

    Recently in the domain of biology, one can realize the information of protein-protein interaction if one feels like to know gene’s function. Once getting these information, the biological pathway and other related information can be understood further. There have been widely discussions on the study of protein-protein information. However, most of them could only be found in the scientific literature. If one could proceed the literature by utilizing the information techniques, these valuable knowledge can be extracted. It can help biologists to save much time and labor when they meet so vast literatures.
    Herein this thesis, a text-mining approach with natural language processing and machine learning is proposed. First, we construct related dictionaries of protein name and get key-term list about interactions between proteins which have been proposed before. Then, we use a semi-supervised method which is the basis of BPS(Bio Proteins interaction System) , to discover every protein-protein interactions mentioned in the literatures for biologists to further analyze.

    摘要 I ABSTRACT II 誌謝 III 表目錄 VI 圖目錄 VII 第1章 緒論 1 第1節 研究背景與動機 2 第2節 研究目的 5 第3節 研究範圍與限制 5 第4節 論文大綱 5 第2章 文獻探討 8 第1節 生物資訊相關來源 8 2.1.1 NCBI 8 2.1.2 蛋白質間交互作用反應 10 2.1.3 生物醫學字典 11 2.1.3.1 MeSH 11 2.1.3.2 UMLS 12 第2節 文字探勘 14 2.2.1 自然語文處理(Natural Language Processing) 15 2.2.2 資訊檢索(Information Retrieval) 16 2.2.3 個體名稱辨識(Name Entity Recognition) 16 第3節 機器學習 17 第4節 蛋白質間交互作用反應資訊擷取系統相關研究 18 2.4.1 以剖析為基礎(parsing-based) 18 2.4.1.1 完全剖析的方法 18 2.4.1.2 部份剖析的方法 19 2.4.2 樣版比對類(Pattern matching) 19 第5節 本章小結 19 第3章 研究方法 21 第1節 研究架構 21 第2節 文獻取得與處理模組 24 3.2.1 文獻前置處理 24 第3節 個體名稱標示模組 26 3.3.1 蛋白質名稱擷取及辨識器 27 3.3.2 蛋白質字典建置 27 3.3.3 交互作用關鍵詞表 28 3.3.3.1 關鍵詞表的功用及取得方式 29 第4節 BPS模組 29 3.4.1 系統演算法 30 3.4.2 演算法運作方式 32 3.4.2.1 Occurrence Sentences分群方式 33 第4章 系統建置與驗證 35 第1節 系統建置 35 4.1.1 實作環境 35 4.1.2 使用套件及模組 35 4.1.3 系統處理流程 36 第2節 實驗方法與評估項目 38 4.2.1 參數設定 38 4.2.2 資料來源 38 4.2.2.1 BioCreAtIvE-PPI 39 4.2.3 實驗設計與評估項目 40 第3節 實驗結果與分析 42 第4節 小結 44 第5章 結論與未來研究方向 46 第1節 研究結果與貢獻 46 第2節 未來研究方向 47 參考文獻 49 附錄一:產生出來的樣版 54 附錄二:擷取出來的句子 55 附錄三:分群程式 56

    Agichtein, E., & Gravano, L. Snowball: extracting relations from large plain-text collections. Proceedings of the fifth ACM conference on Digital libraries, 85-94, 2000.
    Baeza-Yates, R., & Ribeiro-Neto, B. Modern information retrieval: Addison-Wesley Harlow, England, 1999.
    Bairoch, A., Apweiler, R., & Journals, O. The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acids Research, 25(1), 31-36, 1999.
    Blaschke, C., Andrade, M. A., Ouzounis, C., & Valencia, A. Automatic extraction of biological information from scientific text: protein-protein interactions. Proc Int Conf Intell Syst Mol Biol, 1999, 60-67, 1999.
    Cohen, A. M., & Hersh, W. R. A survey of current work in biomedical text mining. Brief Bioinform, 6(1), 57-71, 2005.
    Cooper, J. W., & Kershenbaum, A. Discovery of protein-protein interactions using a combination of linguistic, statistical and graphical information. feedback, 2005.
    De Bruijn, B., & Martin, J. Getting to the (C) ore of Knowledge: Mining Biomedical Literature. Journal of Medical Informatics, 67, 7-18, 2002.
    Donaldson, I., Martin, J., de Bruijn, B., Wolting, C., Lay, V., Tuekam, B., et al. PreBIND and Textomy - mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics, 4(1), 11, 2003.
    Fields, S., & Song, O. A novel genetic system to detect protein protein interactions. Nature, 340(6230), 245-246, 1989.
    Friedman, C., Kra, P., Yu, H., Krauthammer, M., & Rzhetsky, A. GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17(Suppl 1), S74-82, 2001.
    Fukuda, K., Tamura, A., Tsunoda, T., & Takagi, T. Toward information extraction: identifying protein names from biological papers. Pac Symp Biocomput, 707, 18, 1998.
    Gieger, C., Deneke, H., & Fluck, D. The future of text mining in genome-based clinical research. Biosilico, 1(3), 97-102, 2003.
    Hanisch, D., Fluck, J., Mevissen, H. T., & Zimmer, R. Playing biology's name game: identifying protein names in scientific text. Pac Symp Biocomput, 403, 14, 2003.
    Hirschman, L., Park, J. C., Tsujii, J., Wong, L., & Wu, C. H. Accomplishments and challenges in literature data mining for biology. Bioinformatics, 18(12), 1553-1561, 2002.
    Huang, M., Zhu, X., Hao, Y., Payan, D. G., Qu, K., & Li, M. Discovering patterns to extract protein–protein interactions from full texts. Bioinformatics, 20(18), 3604-3612, 2004.
    Hunter, L., & Cohen, K. B. Biomedical Language Processing: Perspective What’s Beyond PubMed? Mol Cell, 21(5), 589-594, 2006.
    Jang, H., Lim, J., Lim, J.-H., Park, S.-J., Lee, K.-C., & Park, S.-H. Finding the evidence for protein-protein interactions from PubMed abstracts. Bioinformatics, 22(14), e220-226, 2006.
    Jenssen, T. K., Lagreid, A., Komorowski, J., & Hovig, E. A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics, 28, 21-28, 2001.
    Kazama, J., Makino, T., Ohta, Y., & Tsujii, J. Tuning Support Vector Machines for Biomedical Named Entity Recognition. Proc. of the Workshop on Natural Language Processing in the Biomedical Domain (at ACL’2002), 1-8, 2002.
    Kim, H., Kim, H., Choi, I., & Kim, M. Finding Relations from a Large Corpus using Generalized Patterns. International Journal of Information Technology, 12(7), 2006.
    Koike, A., Niwa, Y., & Takagi, T. Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics, 21(7), 1227-1236, 2005.
    Krauthammer, M., & Nenadic, G. Term identification in the biomedical literature. J Biomed Inform, 37(6), 512-526, 2004.
    Lee, H.-C., Huang, S.-W., & Li, E. Y. Mining protein-protein interaction information on the internet. Expert Systems with Applications, 30(1), 142-148, 2006.
    Mack, R., & Hehenberger, M. Text-based knowledge discovery: search and mining of life-sciences documents. Drug Discov Today, 7(11 Suppl), S89-98, 2002.
    Marcotte, E. M., Xenarios, I., & Eisenberg, D. Mining literature for protein–protein interactions. Bioinformatics, 17(4), 359-363, 2001.
    Ohta, T., Tateishi, Y., Collier, N., Nobata, C., & Tsujii, J. Building an annotated corpus from biology research papers. Proc. COLING-2000 Workshop on Semantic Annotation and Intelligent Content, 28–34, 2000.
    Ono, T., Hishigaki, H., Tanigami, A., & Takagi, T. Automated extraction of information on protein–protein interactions from the biological literature. Bioinformatics, 17(2), 155-161, 2001.
    Park, J. C., Kim, H. S., & Kim, J. J. Bidirectional incremental parsing for automatic pathway identification with combinatory categorial grammar. Pac. Symp. Biocomput, 6, 396–407, 2001.
    Pustejovsky, J., Castano, J., Zhang, J., Kotecki, M., & Cochran, B. Robust relational parsing over biomedical literature: Extracting inhibit relations. Pacific Symposium on Biocomputing, 7, 362-373, 2002.
    Stephens, M., Palakal, M., Mukhopadhyay, S., Raje, R., & Mostafa, J. Detecting gene relations from Medline abstracts. Pac. Symp. Biocomput, 52, 483-495, 2001.
    Temkin, J. M., & Gilder, M. R. Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics, 19(16), 2046-2053, 2003.
    Thomas, J., Milward, D., Ouzounis, C., Pulman, S., & Carroll, M. Automatic extraction of protein interactions from scientific abstracts. Pacific Symposium on Biocomputing, 5, 538-549, 2000.
    Xia, L. Adaptive Relationship Extraction by Machine Learning. University of Sheffield, 2006.
    Yakushiji, A., Tateisi, Y., Miyao, Y., & Tsujii, J. Event extraction from biomedical papers using a full parser. Pac. Symp. Biocomput, 6, 408–419, 2001.
    Yu, H. Extracting synonymous gene and protein terms from biological literature. Bioinformatics, 19(90001), 340-349, 2003.
    Yu, H. C. Literature Extraction of Protein Functions Using Sentence Pattern Mining. IEEE Transactions on Knowledge and Data Engineering, 17(8), 1088-1098, 2005.

    下載圖示 校內:2010-07-23公開
    校外:2010-07-23公開
    QR CODE