簡易檢索 / 詳目顯示

研究生: 吳嘉欣
Ngo, Gia-Han
論文名稱: 基於大型語言模型的隱式約束資料增強與語義擴展在生物醫學關係萃取之應用
Leverage Large Language Models for Implicit Constrained Data Augmentation with Semantic Enrichment in Biomedical Relation Extraction
指導教授: 將榮先
Chiang, Jung-Hsien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 70
中文關鍵詞: 數據增強生物醫學關係萃取隱含約束資料大型語言模型
外文關鍵詞: Data Augmentation, Biomedical Relation Extraction, Implicit Constrained Data, Large Language Models
相關次數: 點閱:38下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在人工智慧領域,訓練數據的質量和數量對模型性能有深遠的影響。特別是在生物醫學文本檢索和程式碼建議等領域,這些領域通常包含受特定規則或限制的資訊,需要嚴格遵循以保持其完整性和預期意義。資料增強對於訓練強大的人工智慧模型至關重要,然而,由於資料的複雜性和標註的精準性,傳統方法或現有的透過大型語言模型生成訓練資料的方法,均使得這些任務創建高品質資料集面臨嚴峻挑戰。
    為了應對這些挑戰,我們引入了一種專門針對特定規則或具約束資料集的新方法。此方法利用大型語言模型來處理受限資料的複雜性,確保語義的真實性並遵循特定規則。此外,我們還建立了一種大型語言模型的自我評估機制,用於評估生成資料的品質,並過濾因襍訊造成的低品質生成訓練資料。
    我們對該方法在多個領域進行了全面評估,包括生物醫學關係萃取、程式碼建議和資訊檢索任務。實驗結果顯示,該方法顯著提升了模型在多樣化自然語言處理任務中的性能。特別是在生物醫學關係萃取和程式碼建議任務中,我們的方法分別實現了高達8.8%和7.9%的顯著進步。即使在資訊檢索任務中,改進幅度只有1.8%,我們的方法仍展示了在多個領域中的靈活性和有效性,有助於整體任務的提升。這些結果突顯了我們的方法在推動自然語言處理應用的潛力,特別是在生物醫學文本萃取和應用程式開發等複雜領域中,通過提升模型性能並支持更廣泛的研究和應用進展。

    In the field of artificial intelligence, the quality and quantity of training data significantly impact the performance of models, particularly in complex domains like biomedical text mining and code completion, which often involve implicit constrained data. Implicit constrained data denotes information that is bound by specific rules or limitations, necessitating strict adherence to preserve its integrity and intended significance. Data augmentation is crucial for training robust AI models, but both traditional methods and recent approaches using large language models face challenges in creating high-quality datasets for these constrained tasks due to the intricate nature of the data and the necessity for precise annotation.
    To address these challenges, we introduce CAS (Constrained Augmentation and Semantic-Quality), a novel approach tailored for constrained datasets. CAS employs large language models to navigate the complexities of constrained data, ensuring semantic fidelity while adhering to specified rules. Notably, CAS integrates a self-evaluation mechanism to assess the quality of generated data, filtering out noise and low-quality instances.
    CAS underwent comprehensive evaluation across multiple domains including biomedical relation extraction, code completion, and information retrieval tasks. Our empirical findings illustrate CAS's efficacy in significantly improving model performance across diverse natural language processing (NLP) tasks. Notably, CAS achieved substantial enhancements of up to 8.8% and 7.9% in biomedical relation extraction and code completion tasks, respectively. Even in information retrieval, where gains were more modest at 1.8%, our approach demonstrates its versatility and effectiveness across various domains, contributing to overall task enhancement. These results highlight CAS's potential to advance NLP applications, particularly in complex domains such as biomedical text mining and software development, by enhancing performance and supporting broader research and application advancements.

    中文摘要 i Abstract iii Acknowledgements v Contents vii List of Tables x List of Figures xi 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Research Objectives 3 1.4 Thesis Organization 7 2 Literature Review 8 2.1 Data Augmentation 8 2.2 Biomedical Relation Extraction 9 2.3 Information Retrieval with Retrieval Augmented Generation (RAG) 13 2.4 Large Language Models Applied in Data Augmentation 15 2.5 In-Context Learning with Large Language Models 16 2.6 Summary 16 3 Development and Evaluation of Implicit Constrained Data Augmentation Framework 18 3.1 Overall Flow 18 3.2 Generate Implicit Constrained Data Augmentation 18 3.3 Evaluate Constrained Data Augmentation Semantic Quality 19 4 Low-Rank Adaptation for Fine-Tuning Large Language Models 22 4.1 Introduction to Low-Rank Adaptation (LoRA) 22 4.2 Fine-Tuning with LoRA Weights 23 4.3 Integration of LoRA Weights into Base Models 24 4.4 Summary 25 5 Experiments 26 5.1 Experimental Design 26 5.2 Dataset 26 5.3 Experimental Settings 28 5.3.1 Generate and Evaluate ICData Augmentation 28 5.3.2 Biomedical Relation Extraction Task 28 5.3.3 Code Completion Task 28 5.3.4 Mathematics Reasoning Task 29 5.3.5 Information Retrieval Task 29 5.4 Evaluation Metrics 29 6 Results and Discussion 32 6.1 Generated ICData Augmentation 32 6.2 Results of Biomedical Relation Extraction Task 36 6.3 Results of Code Completion Task 38 6.4 Results of Mathematics Reasoning Task 39 6.5 Results of Information Retrieval Task 40 7 Conclusions and Future Work 42 7.1 Conclusion 42 7.2 Future Work 45 References 47

    [1] Ling Luo, Po-Ting Lai, Chih-Hsuan Wei, Cecilia N Arighi, and Zhiyong Lu. Biored: a rich biomedical relation extraction dataset. Briefings in Bioinformatics, 23(5), July 2022.
    [2] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
    [3] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
    [4] Graciela Gonzalez-Hernandez, Martin Krallinger, Monica Muñoz, Raul Rodriguez Esteban, Özlem Uzuner, and Lynette Hirschman. Challenges and opportunities for mining adverse drug reactions: perspectives from pharma, regulatory agencies, healthcare providers and consumers. Database, 2022:baac071, 09 2022.
    [5] Steven Y Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. A survey of data augmentation approaches for nlp. arXiv preprint arXiv:2105.03075, 2021.
    [6] Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709, 2015.
    [7] Tong Niu and Mohit Bansal. Adversarial over-sensitivity and over-stability strategies for dialogue models. arXiv preprint arXiv:1809.02079, 2018.
    [8] Amit Jindal, Arijit Ghosh Chowdhury, Aniket Didolkar, Di Jin, Ramit Sawhney, and Rajiv Shah. Augmenting nlp models using latent feature interpolations. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6931–6936, 2020.
    [9] Connor Shorten, Taghi M Khoshgoftaar, and Borko Furht. Text data augmentation for deep learning. Journal of big Data, 8(1):101, 2021.
    [10] Markus Bayer, Marc-André Kaufhold, and Christian Reuter. A survey on data augmentation for text classification. ACM Computing Surveys, 55(7):1–39, 2022.
    [11] Jianbo Yuan, Zhiwei Jin, Han Guo, Hongxia Jin, Xianchao Zhang, Tristram Smith, and Jiebo Luo. Constructing biomedical domain-specific knowledge graph with minimum supervision. Knowledge and Information Systems, 62:317–336, 2020.
    [12] Delora Baptista, Pedro G Ferreira, and Miguel Rocha. Deep learning for drug response prediction in cancer. Briefings in bioinformatics, 22(1):360–379, 2021.
    [13] Nikola Milošević and Wolfgang Thielemann. Comparison of biomedical relationship extraction methods and models for knowledge graph creation. Journal of Web Semantics, 75:100756, 2023.
    [14] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020.
    [15] Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
    [16] Hoo-Chang Shin, Yang Zhang, Evelina Bakhturina, Raul Puri, Mostofa Patwary, Mohammad Shoeybi, and Raghav Mani. Biomegatron: larger biomedical domain language model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4700–4706, 2020.
    [17] Patrick Lewis, Myle Ott, Jingfei Du, and Veselin Stoyanov. Pretrained language models for biomedical and clinical tasks: understanding and extending the stateof-the-art. In Proceedings of the 3rd clinical natural language processing workshop, pages 146–157, 2020.
    [18] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011, 2009.
    [19] Fenia Christopoulou, Makoto Miwa, and Sophia Ananiadou. Distantly supervised relation extraction with sentence reconstruction and knowledge base priors. arXiv preprint arXiv:2104.08225, 2021.
    [20] Stephen H Bach, Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Cassandra Xia, Souvik Sen, Alex Ratner, Braden Hancock, Houman Alborzi, et al. Snorkel drybell: A case study in deploying weak supervision at industrial scale. In Proceedings of the 2019 International Conference on Management of Data, pages 362–375, 2019.
    [21] Haoming Jiang, Danqing Zhang, Tianyu Cao, Bing Yin, and Tuo Zhao. Named entity recognition with small strongly labeled and large weakly labeled data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1775–1789, 2021.
    [22] Wonjin Yoon, Richard Jackson, Elliot Ford, Vladimir Poroshin, and Jaewoo Kang. Biomedical ner for the enterprise with distillated bern2 and the kazu framework. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 619–626, 2022.
    [23] W Bruce Croft, Donald Metzler, and Trevor Strohman. Search engines: Information retrieval in practice, volume 520. Addison-Wesley Reading, 2010.
    [24] Andrew Yates, Rodrigo Nogueira, and Jimmy Lin. Pretrained transformers for text ranking: Bert and beyond. In Proceedings of the 14th ACM International Conference on web search and data mining, pages 1154–1156, 2021.
    [25] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. Recurrent convolutional neural networks for text classification. In Proceedings of the AAAI conference on artificial intelligence, volume 29, 2015.
    [26] Yahui Chen. Convolutional neural network for sentence classification. Master’s thesis, University of Waterloo, 2015.
    [27] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
    [29] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
    [30] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020.
    [31] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
    [32] Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023.
    [33] Hongyin Luo, Yung-Sung Chuang, Yuan Gong, Tianhua Zhang, Yoon Kim, XixinWu, Danny Fox, Helen Meng, and James Glass. Sail: Search-augmented instruction learning. arXiv preprint arXiv:2305.15225, 2023.
    [34] Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research, 24(251):1–43, 2023.
    [35] Jason Wei and Kai Zou. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196, 2019.
    [36] Marzieh Fadaee, Arianna Bisazza, and Christof Monz. Data augmentation for low-resource neural machine translation. arXiv preprint arXiv:1705.00440, 2017.
    [37] H Dai, Z Liu, W Liao, X Huang, Y Cao, Z Wu, L Zhao, S Xu, W Liu, N Liu, et al. Auggpt: Leveraging chatgpt for text data augmentation. arxiv 2023. arXiv preprint arXiv:2302.13007, 10.
    [38] Chenxi Whitehouse, Monojit Choudhury, and Alham Fikri Aji. Llm-powered data augmentation for enhanced crosslingual performance. arXiv preprint arXiv:2305.14288, 2023.
    [39] Qinyuan Cheng, Xiaogui Yang, Tianxiang Sun, Linyang Li, and Xipeng Qiu. Improving contrastive learning of sentence embeddings from ai feedback. arXiv preprint arXiv:2305.01918, 2023.
    [40] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
    [41] Opher Lieber, Or Sharir, Barak Lenz, and Yoav Shoham. Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 1:9, 2021.
    [42] Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
    [43] Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
    [44] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
    [45] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
    [46] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
    [47] Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna PhippsCostin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, et al. Multipl-e: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 2023.
    [48] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
    [49] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
    [50] Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, and Yang Liu. Openchat: Advancing open-source language models with mixed-quality data. In The Twelfth International Conference on Learning Representations, 2023.
    [51] Wenxuan Zhou, Kevin Huang, Tengyu Ma, and Jing Huang. Document-level relation extraction with adaptive thresholding and localized context pooling. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 14612–14620, 2021.
    [52] Qingyu Tan, Ruidan He, Lidong Bing, and Hwee Tou Ng. Document-level relation extraction with adaptive focal loss and knowledge distillation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1672–1681, 2022.
    [53] Wenxuan Zhou, Sheng Zhang, Tristan Naumann, Muhao Chen, and Hoifung Poon. Continual contrastive finetuning improves low-resource relation extraction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13249–13263, 2023.
    [54] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
    [55] Loubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro Von Werra. A framework for the evaluation of code generation models, 2022.
    [56] Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, et al. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5673–5684, 2023.
    [57] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
    [58] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
    [59] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
    [60] Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models, 2022. URL https://arxiv. org/abs/2206.14858, 2022.
    [61] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
    [62] Ben Wang and Aran Komatsuzaki. Gpt-j-6b: A 6 billion parameter autoregressive language model, 2021.
    [63] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE