簡易檢索 / 詳目顯示

研究生: 陳宥丞
Chen, Yu-Cheng
論文名稱: 基於維基百科的檢索交錯生成框架:提升生成式模型回應可信度的方法
A Wikipedia-based Retrieval-Interleaved Generation Framework: Enhancing the Credibility of Generative Model Responses
指導教授: 陳牧言
Chen, Mu-Yen
學位類別: 碩士
Master
系所名稱: 工學院 - 工程科學系
Department of Engineering Science
論文出版年: 2025
畢業學年度: 113
語文別: 中文
論文頁數: 69
中文關鍵詞: 檢索交錯生成大型語言模型維基百科應用開放領域問答知識增強生成
外文關鍵詞: Retrieval-Interleaved Generation, Large Language Model, Application of Wikipedia in NLP, Open-Domain Question Answering, Knowledge-Augmented Generation
相關次數: 點閱:56下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究提出並實作一套結合維基百科知識檢索與大型語言模型(LLM)的檢索交錯生成(Retrieval-Interleaved Generation, RIG)架構,旨在提升開放領域問答系統的內容可信度與可驗證性,並有效抑制語言模型生成過程中的幻覺現象。相較於純粹依賴訓練資料的 LLM,RIG 在生成過程中動態引入即時外部知識,透過初步生成、語義檢索、知識萃取與答案整合等多階段流程,結合並行處理與提示工程設計,提供可復現且具成本效益的知識強化機制。

    本研究以 WikiQA 資料集為基礎,設計兩組對照測試集(2015年與2025年版參考答案),透過人工標註與語義評估,全面分析純 LLM 與 RIG 在詞彙匹配、語義相似度、答案結構一致性與資料來源透明度等多項評估指標下的表現差異。實驗結果顯示,RIG 架構在處理時間敏感或需高可信度的問題時,能有效降低錯誤回應比例,並提升答案的可追溯性與語義正確性;雖然因整合外部檢索模組而增加回應延遲,但整體準確度與實用性顯著優於純 LLM。

    此外,研究中所採用的 Wikipedia API、Google Custom Search JSON API 與 OpenAI API 皆屬開放存取工具,具備高擴展性與良好整合性,強化系統之可用性與實作可行性。透過 RIG 框架的設計,本研究證實結合語義檢索與外部知識強化能有效緩解語言模型的知識時效與黑箱問題,為生成式人工智慧應用於知識密集型任務提供一條可行技術路徑。未來可延伸至多語言資料處理、領域特定知識接軌與更進階的語義檢索技術,進一步拓展其應用價值與研究深度。

    This research proposes and implements a Retrieval-Interleaved Generation (RIG) framework that integrates Wikipedia as the core external knowledge source to enhance the factual accuracy and verifiability of responses generated by large language models (LLMs). In contrast to traditional LLMs that rely solely on static training data, RIG dynamically incorporates real-time information through API-based retrieval mechanisms. The architecture is divided into four key stages: initial response generation, external knowledge retrieval, evidence extraction and correction, and final answer synthesis. Each stage is optimized for efficiency using asynchronous processing and structured prompts.

    To validate the effectiveness of the proposed method, this study conducted a series of experiments using a benchmark QA dataset (WikiQA), including evaluations of lexical precision, semantic similarity, structure alignment, and source traceability. A human-annotated validation set comprising 100 questions was created to assess factual alignment against updated knowledge (2025 Wikipedia and Britannica), revealing that RIG significantly outperforms standard LLMs in generating verifiable answers. The framework demonstrates notable improvements in transparency (up to 53% answers with citations) and temporal relevance (e.g., 57% vocabulary change rate over a 10-year period while retaining 89% semantic similarity).

    Despite a slight increase in computational latency, RIG’s integration of low-cost, publicly accessible APIs (Google Search, Wikipedia, OpenAI) ensures a scalable and reproducible setup. The findings confirm that the RIG architecture effectively mitigates hallucination, enhances factual consistency, and supports trustworthy QA generation, especially in knowledge-intensive domains. This framework provides a practical solution aligned with the growing demand for transparency and reliability in AI-generated content.

    摘要 II 誌謝 IX 目錄 X 表目錄 XIII 圖目錄 XIV 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 2 1.3 論文架構 4 第二章 文獻探討 5 2.1 大型語言模型(Large Language Models, LLMs) 5 2.2 語言模型幻覺與可驗證性問題 6 2.3 開放領域問答任務定義與挑戰 7 2.4 Retrieval-Augmented Generation (RAG) 與變體模型 8 2.5 檢索交錯生成(Retrieval-Interleaved Generation, RIG) 9 2.6 Wikipedia 作為開放資料來源在問答系統中的應用 10 2.7 CO-STAR 提示 10 第三章 研究方法 12 3.1 研究目標 12 3.2 研究設計 13 3.2.1 初始回應生成 14 3.2.2 外部知識檢索 17 3.2.3 答案提取與語義收斂 19 3.2.4 最終回應整合 20 3.3 技術實現與優化策略 21 3.3.1 並行處理 21 3.3.2 CO-STAR 提示工程 24 3.3.3 公開存取 API 整合與優化 25 3.3.4 回應標籤的應用與使用者體驗優化 28 3.4 方法小結 29 第四章 實證分析 30 4.1 實驗目標 30 4.2 實驗資料集 31 4.2.1 資料集介紹 31 4.2.2 資料集結構與內容 31 4.2.3 與本研究的關聯性與挑戰 32 4.2.4 測試資料集準備 32 4.3 評估指標 34 4.3.1 精確匹配(Exact Match, EM) 34 4.3.2 精確率(Precision, PRE) 34 4.3.3 召回率(Recall. RC) 35 4.3.4 F1分數(F1 Score, F1) 35 4.3.5 部分匹配(Partial Match, PM) 35 4.3.6 METEOR 36 4.3.7 ROUGE-L 36 4.3.8 BERTScore 37 4.3.9 透明度(Transparency) 37 4.3.10 Jaccard相似度 37 4.3.11 spaCy餘弦相似度 38 4.4 實驗設置 38 4.4.1 軟體環境 39 4.4.2 硬體環境 39 4.5 實驗方法與結果分析 40 4.5.1 實驗方法說明 40 4.5.2 實驗結果與分析 41 4.6 實驗小結 47 第五章 結論 48 5.1 研究結論 48 5.2 研究貢獻 49 5.3 未來發展方向 50 參考文獻 51

    1. Devlin, J., et al. Bert: Pre-training of deep bidirectional transformers for language understanding. in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 2019.
    2. Brown, T., et al., Language models are few-shot learners. Advances in neural information processing systems, 2020. 33: p. 1877-1901.
    3. Ji, Z., et al., Towards mitigating hallucination in large language models via self-reflection. arXiv preprint arXiv:2310.06271, 2023.
    4. Rashkin, H., et al., Increasing faithfulness in knowledge-grounded dialogue with controllable features. arXiv preprint arXiv:2107.06963, 2021.
    5. Lewis, P., et al., Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 2020. 33: p. 9459-9474.
    6. Ramaswami, P. and J. Manyika, DataGemma: Using real-world data to address AI hallucinations. 2024.
    7. Siriwardhana, S., et al., Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering. Transactions of the Association for Computational Linguistics, 2023. 11: p. 1-17.
    8. Radhakrishnan, P., et al., Knowing When to Ask--Bridging Large Language Models and Data. arXiv preprint arXiv:2409.13741, 2024.
    9. Găină, M. wikipedia-api. 2018 [cited 2025 2025-06-02]; Available from: https://pypi.org/project/Wikipedia-API/.
    10. Developers, G. Custom Search JSON API. 2024 [cited 2025 2025-06-02]; Available from: https://developers.google.com/custom-search/v1/overview?hl=zh-tw.
    11. Budzianowski, P. and I. Vulić, Hello, it's GPT-2--how can I help you? towards the use of pretrained language models for task-oriented dialogue systems. arXiv preprint arXiv:1907.05774, 2019.
    12. Achiam, J., et al., Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
    13. Ji, Z., et al., Survey of hallucination in natural language generation. ACM computing surveys, 2023. 55(12): p. 1-38.
    14. Atanasova, P., et al., Faithfulness tests for natural language explanations. arXiv preprint arXiv:2305.18029, 2023.
    15. Bommasani, R., et al., On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
    16. Maynez, J., et al. On Faithfulness and Factuality in Abstractive Summarization. 2020. Online: Association for Computational Linguistics.
    17. Alkaissi, H. and S.I. McFarlane, Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus, 2023. 15(2).
    18. Voorhees, E.M. and D.M. Tice. Building a question answering test collection. in Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval. 2000.
    19. Chen, D., et al., Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051, 2017.
    20. Karpukhin, V., et al. Dense Passage Retrieval for Open-Domain Question Answering. in EMNLP (1). 2020.
    21. Yang, W., et al., End-to-end open-domain question answering with bertserini. arXiv preprint arXiv:1902.01718, 2019.
    22. Lazaridou, A., et al., Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115, 2022.
    23. Kwiatkowski, T., et al., Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 2019. 7: p. 453-466.
    24. Joshi, M., et al., Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
    25. Guu, K., et al. Retrieval augmented language model pre-training. in International conference on machine learning. 2020. PMLR.
    26. Izacard, G. and E. Grave, Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282, 2020.
    27. Khattab, O., C. Potts, and M. Zaharia, Baleen: Robust multi-hop reasoning at scale via condensed retrieval. Advances in Neural Information Processing Systems, 2021. 34: p. 27670-27682.
    28. Gao, Y., et al., Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023. 2(1).
    29. GOVTECH. Mastering the art of prompt engineering with Empower. 2025 2025-03-03; Available from: https://www.tech.gov.sg/media/technews/mastering-the-art-of-prompt-engineering-with-empower/.
    30. Teo, S., How i won singapore’s gpt-4 prompt engineering competition. Towards Data Science, Medium, 2023. 29.
    31. Division, G.D.S.A., PROMPT ENGINEERING PLAYBOOK. 2023.
    32. OpenAI. GPT-4o mini: Advancing cost-efficient intelligence. 2024 [cited 2025 2025-06-02]; GPT-4o mini:]. Available from: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/.
    33. Menick, J., et al., GPT-4o mini: advancing cost-efficient intelligence. Open AI, 2024.
    34. Yang, Y.Y., Wen-tau. WikiQA: A Challenge Dataset for Open-Domain Question Answering. 2015 [cited 2025 2025-06-02]; Available from: https://paperswithcode.com/dataset/wikiqa.
    35. contributors, W. Sample size determination. 2024 [cited 2025 2025-06-02]; Available from: https://en.wikipedia.org/wiki/Sample_size_determination.
    36. Rajpurkar, P., et al., Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
    37. Bowman, S.R. and G.E. Dahl, What will it take to fix benchmarking in natural language understanding? arXiv preprint arXiv:2104.02145, 2021.
    38. Yacouby, R. and D. Axman. Probabilistic extension of precision, recall, and f1 score for more thorough evaluation of classification models. in Proceedings of the first workshop on evaluation and comparison of NLP systems. 2020.
    39. Lin, S., J. Hilton, and O. Evans, Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
    40. Rivest, R.L., Partial-match retrieval algorithms. SIAM Journal on Computing, 1976. 5(1): p. 19-50.
    41. Banerjee, S. and A. Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 2005.
    42. Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. in Text summarization branches out. 2004.
    43. Zhang, T., et al., Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
    44. Kovatchev, V. and M. Lease. Benchmark Transparency: Measuring the Impact of Data on Evaluation. 2024. Mexico City, Mexico: Association for Computational Linguistics.
    45. Niwattanakul, S., et al. Using of Jaccard coefficient for keywords similarity. in Proceedings of the international multiconference of engineers and computer scientists. 2013.
    46. Vasiliev, Y., Natural language processing with Python and spaCy: A practical introduction. 2020: No Starch Press.

    無法下載圖示 校內:2030-06-18公開
    校外:2030-06-18公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE