簡易檢索 / 詳目顯示

研究生: 陳哲緯
Chen, Che-Wei
論文名稱: 打破檢索系統的界線:使用降噪微調進行無監督領域適應
Breaking Boundaries in Retrieval Systems: Unsupervised Domain Adaptation with Denoise-Finetuning
指導教授: 高宏宇
Kao, Hung-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 醫學資訊研究所
Institute of Medical Informatics
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 51
中文關鍵詞: 資訊檢索領域適應自然語言
外文關鍵詞: Information Retrieval, Domain Adaptataion, Natural Language Processing
相關次數: 點閱:84下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 密集檢索模型在資訊檢索領域展示出了顯著的效果,但它們依賴於豐富的標記數據,
    並在應用於不同領域時面臨挑戰。以往的資訊檢索領域適應方法採用生成模型生成
    偽查詢,創建偽數據集以提高密集檢索模型的性能。然而,這些方法通常使用未經
    領域適應的排序模型,可能導致標記不準確。在本文中,我們證明了在使用排序模
    型進行標記之前,將排序模型適應到目標領域的重要性。這個步驟使我們能夠獲得
    更準確的標記,從而提高了密集檢索模型的整體性能。此外,通過將領域適應的檢
    索模型與領域適應的排序模型結合,我們在三個檢索數據集上的檢索成效得到了更
    進一步的提昇。

    Dense retrieval models have exhibited remarkable effectiveness, but they rely on abundant labeled data and face challenges when applied to different domains. Previous domain adaptation methods have employed generative models to generate pseudo queries, creating pseudo datasets to enhance the performance of dense retrieval models. However, these approaches typically use unadapted rerank models, leading to potentially imprecise labels. In this paper, we demonstrate the significance of adapting the rerank model to the target domain prior to utilizing it for label generation. This adaptation process enables us to obtain more accurate labels, thereby improving the overall performance of the dense retrieval model. Additionally, by combining the adapted retrieval model with the adapted rerank model, we achieve significantly better domain adaptaion results across three retrieval datasets.

    摘要 i Abstract ii 誌謝 iii Table of Contents iv List of Tables vii List of Figures ix Chapter 1. Introduction 1 1.1 Information Retrieval 1 1.2 Motivation 2 1.3 Our Works 3 Chapter 2. Related Work 5 2.1 Domain Adaptation 5 2.1.1. Domain adaptation in Information Retrieval 5 2.2 Two-Stage Retrieval 6 2.2.1. BERT 8 2.2.2. Bi-Encoder 9 2.2.3. Cross-Encoder 10 2.3 Query Generation 11 2.3.1. T5 11 2.3.2. T5 Question Generator 11 iv 2.4 Margin-MSE 12 2.5 BM25 13 Chapter 3. Methodology 16 3.1 Archittecture 16 3.2 Generation of Pseudo Training Datase 18 3.3 Cross-Encoder (Rerank Model) Adaptation 18 3.4 Denoise Finetuning 21 3.5 Bi-Encoder (Retrieval Model) Adaptation 24 Chapter 4. Experiment 25 4.1 Dataset 25 4.1.1. BEIR Benchmark 26 4.1.2. Experiment Dataset 26 4.2 Baseline 26 4.2.1. Zero-shot sparse retrieval model 26 4.2.2. Zero-shot dense retrieval model 27 4.2.3. Pre-training for domain adaptation 27 4.2.4. Query generation domain adaptation 27 4.3 Hyperparameters 27 4.3.1. Query Generation 27 4.3.2. Cross-Encoder (Rerank Model) Adaptation 28 4.3.3. Denoise-Finetuning 28 4.3.4. Bi-Encoder (Retrieval Model) Adaptation: 28 4.4 Implement 28 4.5 Overall Performance 29 4.5.1. Recall Performance 31 Chapter 5. Analysis 33 5.1 Impact of Denoise-finetuning 33 v 5.2 Random Seed 35 5.3 Model Size 36 5.4 Sequence Length 37 5.5 Impact of Teacher Models in Distillation 38 5.6 Noisy Dataset Simulation 40 5.7 The Impact of γ 41 5.8 Margin From Different Teacher Models 43 5.9 Cross-Domain Zero-Shot Performance 43 5.10 General Dense Retrieval Model 44 Chapter 6. Discussion and Conclusion 46 References 47

    [1] Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. In International conference on machine learning, pages 233–242. PMLR, 2017.
    [2] Hans Christian, Mikhael Pramodana Agus, and Derwin Suhartono. Single document automatic text summarization using term frequency-inverse document frequency (tfidf). ComTech: Computer, Mathematics and Engineering Applications, 7(4):285–294, 2016.
    [3] Bijoyan Das and Sarit Chakraborty. An improved text sentiment classification model using tf-idf and next word negation. arXiv preprint arXiv:1806.06407, 2018.
    [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
    [5] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180–1189. PMLR, 2015.
    [6] Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online, July 2020. Association for Computational Linguistics.
    [7] Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Hanbury. Improving efficient neural ranking models with cross-architecture knowledge distillation. arXiv preprint arXiv:2010.02666, 2020.
    [8] Sebastian Hofstätter and Allan Hanbury. Let’s measure run time! extending the ir replicability infrastructure to include performance aspects. arXiv preprint arXiv:1907.04614, 2019.
    [9] Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 113–122, 2021.
    [10] Zhiqi Huang, Puxuan Yu, and James Allan. Cross-lingual knowledge transfer via distillation for multilingual information retrieval. arXiv preprint arXiv:2302.13400, 2023.
    [11] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, November 2020. Association for Computational Linguistics.
    [12] Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020.
    [13] Carlos Lassance. Extending english ir methods to multi-lingual ir. arXiv preprint arXiv:2302.14723, 2023.
    [14] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online, July 2020. Association for Computational Linguistics.
    [15] Tie-Yan Liu et al. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, 3(3):225–331, 2009.
    [16] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    [17] Ji Ma, Ivan Korotkov, Yinfei Yang, Keith Hall, and Ryan McDonald. Zero-shot neural passage retrieval via domain-targeted synthetic question generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1075–1088, Online, April 2021. Association for Computational Linguistics.
    [18] Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pages 1941–1942, 2018.
    [19] Irina Matveeva, Chris Burges, Timo Burkard, Andy Laucius, and Leon Wong. High accuracy retrieval with multiple nested ranker. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 437–444, 2006.
    [20] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. MS MARCO: A human-generated MAchine reading COmprehension dataset, 2017.
    [21] Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with bert. arXiv preprint arXiv:1901.04085, 2019.
    [22] Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. Document ranking with a pretrained sequence-to-sequence model. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 708–718, Online, November 2020. Association for Computational Linguistics.
    [23] Rodrigo Nogueira, Jimmy Lin, and AI Epistemic. From doc2query to doctttttquery. Online preprint, 6, 2019.
    [24] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
    [25] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019.
    [26] Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. Okapi at trec-3. Nist Special Publication Sp, 109:109, 1995.
    [27] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
    [28] Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. Colbertv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734, 2022.
    [29] Charles Stein. Estimation with quadratic loss. Breakthroughs in statistics: Foundations and basic theory, pages 443–460, 1992.
    [30] Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
    [31] Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J Gordon. An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159, 2018.
    [32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
    [33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
    [34] Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. Trec-covid: constructing a pandemic information retrieval test collection. In ACM SIGIR Forum, volume 54, pages 1–12. ACM New York, NY, USA, 2021.
    [35] David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, Online, November 2020. Association for Computational Linguistics.
    [36] Kexin Wang, Nils Reimers, and Iryna Gurevych. TSDAE: Using transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 671–688, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
    [37] Kexin Wang, Nils Reimers, and Iryna Gurevych. Tsdae: Using transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 671–688, 2021.
    [38] Kexin Wang, Nandan Thakur, Nils Reimers, and Iryna Gurevych. Gpl: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval. arXiv preprint arXiv:2112.07577, 4 2021.
    [39] Lidan Wang, Jimmy Lin, and Donald Metzler. A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 105–114, 2011.
    [40] Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn Funk, Rodney Kinney, Ziyang Liu, William Merrill, et al. Cord19: The covid-19 open research dataset. ArXiv, 2020.
    [41] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
    [42] Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In International Conference on Learning Representations, 2021.
    [43] Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. End-to-end open-domain question answering with BERTserini. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 72–77, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
    [44] Qi Zhang, Zijian Yang, Yilun Huang, Ze Chen, Zijian Cai, Kangxu Wang, Jiewen Zheng, Jiarong He, and Jin Gao. Enhancing model performance in multilingual information retrieval with comprehensive data engineering techniques. arXiv preprint arXiv:2302.07010, 2023.
    [45] Wenxuan Zhou and Muhao Chen. Learning from noisy labels for entity-centric information extraction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE