| 研究生: |
楊曜瑋 Yang, Yao-wei |
|---|---|
| 論文名稱: |
利用文字探勘技術擷取出蛋白質間交互作用反應 Using text mining to extract protein-protein interaction |
| 指導教授: |
王惠嘉
Wang, Hei-Chia |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2007 |
| 畢業學年度: | 95 |
| 語文別: | 中文 |
| 論文頁數: | 65 |
| 中文關鍵詞: | 生化代謝途徑 、蛋白質間交互作反應 、機器學習 、自然語言處理 、文字探勘 |
| 外文關鍵詞: | protein-protein interaction, machine learning, biological pathway, Text mining, natural language processing |
| 相關次數: | 點閱:98 下載:7 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來在生物領域中,若要知道某個基因功能,可先瞭解與此基因所相關連的各個蛋白質之間交互作用反應(protein-protein interaction)資訊。再取得這些資訊後,則可再進一步瞭解生化代謝途徑(biological pathway)與其相關資訊。蛋白質間交互作用反應的研究,廣泛受到生物學家們的討論、探討與青睞。而許多蛋白質間交互作用反應資料多以文獻的方式存在文獻中,若可以將這些文獻透過資訊技術處理後,進而得到這些知識,則可以替生物學家們在面對這些龐大文獻時,省去許多時間與精力。
因此,本研究將透過使用文字探勘之技術,自然語言處理與機器學習等方法來進行分析處理與實作。首先,先分別取得並建立其相關蛋白質字典與及關鍵詞表,接著參考以半監督式機器學習法(semi-supervised machine learning)為基礎而成的BPS(Bio Proteins interaction System)找出隱藏在蛋白質間交互作用反應文獻中所提到的各個交互作用反應資訊,以提供給生物學家做更進一步分析之用。
Recently in the domain of biology, one can realize the information of protein-protein interaction if one feels like to know gene’s function. Once getting these information, the biological pathway and other related information can be understood further. There have been widely discussions on the study of protein-protein information. However, most of them could only be found in the scientific literature. If one could proceed the literature by utilizing the information techniques, these valuable knowledge can be extracted. It can help biologists to save much time and labor when they meet so vast literatures.
Herein this thesis, a text-mining approach with natural language processing and machine learning is proposed. First, we construct related dictionaries of protein name and get key-term list about interactions between proteins which have been proposed before. Then, we use a semi-supervised method which is the basis of BPS(Bio Proteins interaction System) , to discover every protein-protein interactions mentioned in the literatures for biologists to further analyze.
Agichtein, E., & Gravano, L. Snowball: extracting relations from large plain-text collections. Proceedings of the fifth ACM conference on Digital libraries, 85-94, 2000.
Baeza-Yates, R., & Ribeiro-Neto, B. Modern information retrieval: Addison-Wesley Harlow, England, 1999.
Bairoch, A., Apweiler, R., & Journals, O. The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acids Research, 25(1), 31-36, 1999.
Blaschke, C., Andrade, M. A., Ouzounis, C., & Valencia, A. Automatic extraction of biological information from scientific text: protein-protein interactions. Proc Int Conf Intell Syst Mol Biol, 1999, 60-67, 1999.
Cohen, A. M., & Hersh, W. R. A survey of current work in biomedical text mining. Brief Bioinform, 6(1), 57-71, 2005.
Cooper, J. W., & Kershenbaum, A. Discovery of protein-protein interactions using a combination of linguistic, statistical and graphical information. feedback, 2005.
De Bruijn, B., & Martin, J. Getting to the (C) ore of Knowledge: Mining Biomedical Literature. Journal of Medical Informatics, 67, 7-18, 2002.
Donaldson, I., Martin, J., de Bruijn, B., Wolting, C., Lay, V., Tuekam, B., et al. PreBIND and Textomy - mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics, 4(1), 11, 2003.
Fields, S., & Song, O. A novel genetic system to detect protein protein interactions. Nature, 340(6230), 245-246, 1989.
Friedman, C., Kra, P., Yu, H., Krauthammer, M., & Rzhetsky, A. GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17(Suppl 1), S74-82, 2001.
Fukuda, K., Tamura, A., Tsunoda, T., & Takagi, T. Toward information extraction: identifying protein names from biological papers. Pac Symp Biocomput, 707, 18, 1998.
Gieger, C., Deneke, H., & Fluck, D. The future of text mining in genome-based clinical research. Biosilico, 1(3), 97-102, 2003.
Hanisch, D., Fluck, J., Mevissen, H. T., & Zimmer, R. Playing biology's name game: identifying protein names in scientific text. Pac Symp Biocomput, 403, 14, 2003.
Hirschman, L., Park, J. C., Tsujii, J., Wong, L., & Wu, C. H. Accomplishments and challenges in literature data mining for biology. Bioinformatics, 18(12), 1553-1561, 2002.
Huang, M., Zhu, X., Hao, Y., Payan, D. G., Qu, K., & Li, M. Discovering patterns to extract protein–protein interactions from full texts. Bioinformatics, 20(18), 3604-3612, 2004.
Hunter, L., & Cohen, K. B. Biomedical Language Processing: Perspective What’s Beyond PubMed? Mol Cell, 21(5), 589-594, 2006.
Jang, H., Lim, J., Lim, J.-H., Park, S.-J., Lee, K.-C., & Park, S.-H. Finding the evidence for protein-protein interactions from PubMed abstracts. Bioinformatics, 22(14), e220-226, 2006.
Jenssen, T. K., Lagreid, A., Komorowski, J., & Hovig, E. A literature network of human genes for high-throughput analysis of gene expression. Nature Genetics, 28, 21-28, 2001.
Kazama, J., Makino, T., Ohta, Y., & Tsujii, J. Tuning Support Vector Machines for Biomedical Named Entity Recognition. Proc. of the Workshop on Natural Language Processing in the Biomedical Domain (at ACL’2002), 1-8, 2002.
Kim, H., Kim, H., Choi, I., & Kim, M. Finding Relations from a Large Corpus using Generalized Patterns. International Journal of Information Technology, 12(7), 2006.
Koike, A., Niwa, Y., & Takagi, T. Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics, 21(7), 1227-1236, 2005.
Krauthammer, M., & Nenadic, G. Term identification in the biomedical literature. J Biomed Inform, 37(6), 512-526, 2004.
Lee, H.-C., Huang, S.-W., & Li, E. Y. Mining protein-protein interaction information on the internet. Expert Systems with Applications, 30(1), 142-148, 2006.
Mack, R., & Hehenberger, M. Text-based knowledge discovery: search and mining of life-sciences documents. Drug Discov Today, 7(11 Suppl), S89-98, 2002.
Marcotte, E. M., Xenarios, I., & Eisenberg, D. Mining literature for protein–protein interactions. Bioinformatics, 17(4), 359-363, 2001.
Ohta, T., Tateishi, Y., Collier, N., Nobata, C., & Tsujii, J. Building an annotated corpus from biology research papers. Proc. COLING-2000 Workshop on Semantic Annotation and Intelligent Content, 28–34, 2000.
Ono, T., Hishigaki, H., Tanigami, A., & Takagi, T. Automated extraction of information on protein–protein interactions from the biological literature. Bioinformatics, 17(2), 155-161, 2001.
Park, J. C., Kim, H. S., & Kim, J. J. Bidirectional incremental parsing for automatic pathway identification with combinatory categorial grammar. Pac. Symp. Biocomput, 6, 396–407, 2001.
Pustejovsky, J., Castano, J., Zhang, J., Kotecki, M., & Cochran, B. Robust relational parsing over biomedical literature: Extracting inhibit relations. Pacific Symposium on Biocomputing, 7, 362-373, 2002.
Stephens, M., Palakal, M., Mukhopadhyay, S., Raje, R., & Mostafa, J. Detecting gene relations from Medline abstracts. Pac. Symp. Biocomput, 52, 483-495, 2001.
Temkin, J. M., & Gilder, M. R. Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics, 19(16), 2046-2053, 2003.
Thomas, J., Milward, D., Ouzounis, C., Pulman, S., & Carroll, M. Automatic extraction of protein interactions from scientific abstracts. Pacific Symposium on Biocomputing, 5, 538-549, 2000.
Xia, L. Adaptive Relationship Extraction by Machine Learning. University of Sheffield, 2006.
Yakushiji, A., Tateisi, Y., Miyao, Y., & Tsujii, J. Event extraction from biomedical papers using a full parser. Pac. Symp. Biocomput, 6, 408–419, 2001.
Yu, H. Extracting synonymous gene and protein terms from biological literature. Bioinformatics, 19(90001), 340-349, 2003.
Yu, H. C. Literature Extraction of Protein Functions Using Sentence Pattern Mining. IEEE Transactions on Knowledge and Data Engineering, 17(8), 1088-1098, 2005.