簡易檢索 / 詳目顯示

研究生: 何承威
Ho, Cheng-Wei
論文名稱: 以樣板品質改善自動學習之資訊擷取方法
A Method of Information Extraction Using Pattern Quality to Improve Automatic Learning
指導教授: 王惠嘉
Wang, Hei-Chia
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理研究所
Institute of Information Management
論文出版年: 2011
畢業學年度: 99
語文別: 中文
論文頁數: 62
中文關鍵詞: 文字探勘樣板排序bootstrapping轉錄因子目標基因
外文關鍵詞: Text mining, Bootstrapping, Transcription factor, Target gene
相關次數: 點閱:105下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 人類基因表現主要控管於轉錄步驟,而轉錄調控是一個非常重要的起始步驟。它實質在於蛋白質與DNA、蛋白質與蛋白質之間的相互作用。透過轉錄因子(Transcription Factor, TF)與目標基因(Target Gene, TGene)之間的相互調控,來決定基因最後的的表現。目前很多相關研究都已記錄在生醫文獻中,以資料庫的方式儲存。但也因文獻成長的速度驚人,生物學家越不易從如此大量的文獻中獲取所需資訊,必須耗費大量的時間與人力來進行資料過濾。如何找出TF和TGene之間的調控關係是很重要的議題,因此我們利用資訊技術協助生醫研究人員從大量的文獻中找出有用資訊。
    為了從生醫文獻中擷取出所需的資訊,有許多專家學者提出方法來改善成效不彰的文件搜尋方式,但這些方法仍有潛在的缺點,例如人工產生樣板(Manual Generate Pattern),雖然資料的Precision較高,但Recall低;若統計生物醫學常見的關鍵字(Keyword),則會有Precision低但Recall高的情況。目前的樣板學習方法很多,其中bootstrapping是一個可以自動學習的方法,藉由bootstrapping技術不斷擴張學習可以改善過去需要人工檢視的問題,但因目前方法所學習到的新樣板並未經過評估,導致最後的回傳結果不一定能代表兩個實體間的交互作用或關係。
      為了能讓bootstrapping技術自動產生的樣板比對到的句子更能代表兩個實體間的交互作用或關係,本研究在bootstrapping產生新樣板後增加了樣版品質的評估,目的是要確保樣板在多次運行的過程中,並不會因為擴張而失去其重要性。最後再透過樣板比對(Pattern Matching),擷取出TF和TGene之間的調控關係。希望藉由本研究能讓生醫研究人員快速又準確地獲得所需資訊,並達到節省人力及時間成本的目的。

    The human gene expression is mainly controlled by the transcriptional procedure, and transcriptional regulation is a very important initial step. It is based on protein-DNA and protein-protein interactions. Through the regulation between Transcription Factor (TF) and Target Gene (TGene), they determine the final gene expression. At present, many related studies have been recorded in the biomedical literature stored in the database. Biologists have to spend a lot of time in obtaining useful information from lots of literature. It is an important issue to find out the regulation between TF and TGene relationship automatically.
    In order to extract information from the biomedical literature, many scholars had been proposed different ways to improve searching quality. One way to extract gene regulating information is by pattern learning. Among a large amount of pattern-learning methods, bootstrapping is a method of automatically expansion learning which can solve the problem of manually inspection. But existing methods may not extract precise regulatory relationship between the two entities because of poor quality patterns which learned from bootstrapping process.
    In order to avoid the problem, this thesis proposes a pattern evaluating process which can filter out the inappropriate pattern. The purpose aims to find representative patterns of regulatory relationship learned by bootstrapping method. We use high-ranked patterns to match the sentences to find out the regulatory relationship between TF and TGene. After evaluation, the proposed method can find more precise patterns which make the biologists spend less labor and time in collecting useful information.

    摘要 I 1. 緒論 1 1.1. 研究背景 1 1.2. 研究動機與目的 2 1.3. 研究範圍與限制 4 1.4. 研究流程 5 1.5. 論文大綱 6 2. 文獻探討 7 2.1. 數位化生物資訊相關資源 7 2.1.1. PubMed (文獻資源) 7 2.1.2. 序列搜索系統 8 2.1.3. HUGO 9 2.2. 文字探勘 9 2.2.1. 自然語言處理 9 2.2.1.1. 斷詞技術 10 2.2.1.2. 詞性標註 10 2.2.1.3. 字根還原 11 2.3. 機器學習 12 2.3.1. 監督式機器學習法 12 2.3.2. 非監督式機器學習法 12 2.3.3. 半監督式機器學習法 12 2.4. Bootstrapping 13 2.5. 樣板排序 14 2.5.1. 以文件為基礎 14 2.5.2. 以相似度為基礎 15 2.6. 相關研究 15 2.7. 小結 18 3. 研究方法 19 3.1. 研究架構 19 3.2. 前處理階段 21 3.2.1. PubMed查詢 21 3.2.2. 句子的前處理 22 3.3. 初始訓練階段 23 3.3.1. 正確樣板訓練 24 3.4. 樣板排序階段 25 3.4.1. 樣板分數 26 3.4.2. 句子與樣板之相關程度分數 26 3.4.3. 樣板排序 27 3.5. Bootstrapping階段 27 3.5.1. 句子過濾 28 3.5.2. 尋找新的Tuple 30 3.5.3. Bootstrapping 31 3.6. 樣板比對階段 31 4. 系統建置與驗證 33 4.1. 系統建置 33 4.1.1. 實驗環境 33 4.1.2. 使用套件及模組 33 4.1.3. 系統處理流程 34 4.2. 實驗方法 35 4.2.1. 資料來源 36 4.2.2. 比較對象 36 4.2.3. 評估指標 36 4.3. 實驗結果與分析 37 4.3.1. 實驗一:系統挑選之初始資料集 37 實驗二:從給定篇數中只挑選人工標定之正確句當初始資料集 46 4.3.2. 實驗三:挑選不同數量之人工標定的正確句當成初始資料集 52 4.4. 系統畫面範例 54 5. 結論與未來研究方向 56 5.1. 研究成果 56 5.2. 未來研究方向 58 參考文獻 59

    參考文獻
    英文文獻
    Agichtein, E., & Gravano, L. (2000). Snowball:Extracting relations from large plain-text collectins. In Proceedings of the 5th ACM International Conference on Digital Libraries, San Antonio, Texas, United States.
    Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open Information Extraction from the Web. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI-07), Hyderabad, India.
    Brin, S. (1998). Extracting Patterns and Relations from the World Wide Web. In Proceedings of the International Workshop on The World Wide Web and Databases, Valencia, Spain.
    Bui, Q. C., Nuallain, B. O., Boucher, C. A., & Sloot, P. M. A. (2010). Extracting causal relations on HIV drug resistance from literature. Bioinformatics, 11(101), 101-111.
    Cano, C., Monaghan, T., Blanco, A., Wall, D. P., & Peshkin, L. (2009). Collaborative text-annotation resource for disease-centered relation extraction from biomedical text. Journal of Biomedical Informatics, 42(5), 967-977.
    Chiang, J. H., Liu, H. S., Chao, S. Y., & Chen, C. Y. (2007). Discovering gene -gene relations from sequential sentence patterns in biomedical literature. Expert Systems with Applications, 33(4), 1036-1041.
    Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, S., & Yates, A. (2004). Web-Scale Information Extraction in KnowItAll(Preliminary Results). In Proceedings of the 13th international conference on World Wide Web New York, USA.
    Feelders, A., Daniels, H., & Holsheimer, M. (2000). Methodological and practical aspects of data mining. Information & Management, 37(5), 271-281.
    Fox, C. (1992). Lexical analysis and stoplists: Prentice-Hall.
    Garten, Y., & Altman, R. B. (2009). Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text. Bioinformatics, 10(Suppl. 2):S6.
    Greenwood, M., & Stevenson, M. (2006). Improving semi-supervised acquisition of relation extraction patterns. In Proceedings of the Workshop on information Extraction Beyond the Document, Sydney, Australia.
    Khoo, C., Chan, S., Yun, N. (2000). Extracting causal knowledge from a medical database using graphical patterns. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Hongkong, China.
    Li, W., Liu, T., & Li, S. (2008). Bootstrapping for extracting relations from large corpora. Journal of electronics, 25(1), 89-96.
    Liao, S. & Grishman, R. (2010). Filtered Ranking for Bootstrapping in Event Extraction. In Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China.
    Lin, C., Tan, B., & Chang, S. (2008). An exploratory model of knowledge flow barriers within healthcare organizations. Information & Management, 45(5), 331-339.
    Marcotte, E. M., Xenarios, L., & Eisenberg, D. (2001). Mining literature for protein-protein interactions. Bioinformatics, 17(4), 359-363.
    Mitchell, T. (1997). Machine Learning: The McGraw-Hill.
    Niu, Y., Otasek, D., & Jurisica, I. (2010). Evaluation of linguistic features useful in extraction of interactions from PubMed: Application to annotating known, high-throughput and predicted interactions in I2D. Bioinformatics, 26(1), 111-119.
    Ono, T., Hishigaki, H., Tanigami, A., & Takagi, T. (2001). Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics, 17(2), 155-161.
    Riloff, E. (1996). Automatically Generating Extraction Patterns from Untagged Text. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, Portland, Oregon.
    Shah, P. K., Perez-Iratxeta, C., Bork, P., & Andrade, M. A. (2003). Information extraction from full text scientific articles: Where are the keywords? Bioinformatics, 4(1),20-28.
    Stevenson, M., & Greenwood, M. (2005). A Semantic Approach to IE Pattern Induction. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Stroudsburg, USA.
    Surdeanu, M., Turmo, J., & Ageno, A. (2006). A Hybrid Approach for the Acquisition of Information Extraction Patterns. In Proceedings of the EACL 2006 Workshop on Adaptive Text Extraction and Mining, Trento, Italy.
    Tsai, R. T. H., Lai, P.T., Dai, H. J., Huang, C. H., Bow, Y. Y., Chang, Y. C., Pan, W.H., & Hsu, W.L. (2009). HypertenGene: extracting key hypertension genes from biomedical literature with position and automatically-generated template features. Bioinformatics, 10(Suppl. 15): S9.
    Hei-Chia Wang, Yi-Hsiu Chen, Hung-Yu Kao, and Shaw-Jenq Tsai (2011). Inference of transcriptional regulatory network by bootstrapping patterns. Bioinformatics, 27(10), 1422-1428.
    Yangarber, R. (2003). Counter-Training in Discovery of Semantic Patterns. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan.
    Yangarber, R., Grishman, R., Tapanainen, P., & Huttunen, S. (2000). Automatic Acquisition of Domain Knowledge for Information Extraction. In Proceedings of the 18th International Conference on Computational Linguistics, SaarbrXucken, Germany.
    Yangarber, R., Grishman, R., Tapanainen, P., & Huttunen, S. (2002). Unsupervised discovery of scenario-level patterns for information extraction. In Proceedings of Conference on Applied Natural Language Processing ANLP-NAACL, Seattle,WA.
    Yu, H., & Agichtein, E. (2003). Extracting synonymous gene and protein terms from biological literature. Bioinformatics, 19(1), 340-349.
    Zeng, X., Li, F., Zhang, D., & Vakali, A. (2004). An XML-Based Bootstrapping Method for Pattern Acquisition. In Proceedings of the 6th International Conference on Enterprise Information Systems, Porto, Portugal.
    Zerhouni, & Elias, A. (2005). US biomedical research: Basic, translational, and clinical science. JAMA, The Journal of the American Medical Association, 294(11), 1352-1358.

    中文文獻
    1. 桂卓慶. (2008). 利用文字探勘技術萃取轉錄因子與目標基因調控資訊. 國立成功大學資訊管理研究所碩士論文.
    3. 維基百科編者. (2010)

    網站資料
    維基百科
    http://www.wikipedia.org/
    PubMed
    http://www.ncbi.nlm.nih.gov/pubmed
    UMLS
    http://umlsinfo.nlm.nih.gov/

    無法下載圖示 校內:2016-07-19公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE