| 研究生: | 林沂蓁 LIN, Yi-Jhen | 
|---|---|
| 論文名稱: | 分群演算法與TF-IDF應用於工業知識庫的檢索增強生成 Enhanced RAG with Clustering and TF-IDF for Domain-Specific Industrial Knowledge Retrieval | 
| 指導教授: | 蕭宏章 Hsiao, Hung-Chang | 
| 學位類別: | 碩士 Master | 
| 系所名稱: | 電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering | 
| 論文出版年: | 2025 | 
| 畢業學年度: | 113 | 
| 語文別: | 中文 | 
| 論文頁數: | 40 | 
| 中文關鍵詞: | RAG 、檢索增強生成 、分群演算法 、TF-IDF | 
| 外文關鍵詞: | RAG, Retrieval-Augmented Generation, Clustering, TF-IDF | 
| 相關次數: | 點閱:17 下載:0 | 
| 分享至: | 
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 | 
隨著大型語言模型(Large Language Models, LLMs)的發展,因為資訊安全以及經濟因素的考量下,越來越多企業會選擇在企業內部搭建屬於自己的語言模型系統來輔助工作的進行以提升工作效率,然而在使用大型語言模型上仍有許多需要解決的問題,例如訓練資料過時導致語言模型回答時產生幻覺(Hallucination)的現象。
本研究旨在探討使用系統化的檢索增強生成(Retrieval-Augmented Generation, RAG)技術來幫助大型語言模型利用最新的資訊來提升回答的準確度,然而,目前多數研究著重於非結構化連續文本的檢索與生成,對於企業中大量存在的表格型態資料,如何在語意資訊不足的條件下,能夠準確回應使用者查詢需求,仍是一項有待解決的挑戰。
針對此研究目標,提出了分群演算法結合TF-IDF(Term Frequency - Inverse Document Frequency)的混合檢索架構於檢索增強生成的流程中找出符合使用者預期的表格,旨在使用分群演算法和TF-IDF分別對資料進行不同粒度的檢索策略。
此外,在過程中有不同的參數可以依據使用者當前需要的目標進行調整,進一步透過模擬退火法協助使用者於參數繁多的情境下進行有效調整,引導系統達成更佳的檢索品質與生成準確性。
The increasing adoption of Large Language Models in enterprise environments has enabled significant gains in productivity and automation. However, LLMs often suffer from hallucination and outdated knowledge due to their reliance on static pretraining data. To solve this, Retrieval-Augmented Generation has emerged as a promising architecture by combining LLMs with dynamic retrieval modules. While most RAG applications focus on unstructured text, enterprise knowledge bases frequently contain large volumes of structured tables with limited semantic information.
This research addresses these issues by proposing a systematic Retrieval-Augmented Generation technique to improve the accuracy of LLM using up-to-date information. While most existing RAG focus on unstructured continuous text, there is a gap in handling the tabular data in enterprises, especially when semantic information is insufficient for accurate query responses.
To overcome this, we propose a novel hybrid retrieval architecture that combines clustering algorithms and TF-IDF within the RAG pipeline to identify relevant tables.
This approach employs different granularity retrieval strategies for data using clustering algorithms and TF-IDF. Furthermore, the system incorporates adjustable parameters to meet user-specific objectives, with Simulated Annealing assisting in effective parameter tuning in complex, multi-parameter scenarios, thereby guiding the system towards better retrieval quality and accuracy.
[1] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv preprint arXiv:2005.11401, 2020.
    [2] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, Haofen Wang. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv preprint arXiv:2312.10997, 2023.
    [3] Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, Lili Qiu. Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely. arXiv preprint arXiv:2409.14924, 2024.
    [4] Mykhailo Poliakov, Nadiya Shvai. Multi-Meta-RAG: Improving RAG for Multi-Hop Queries using Database Filtering with LLM-Extracted Metadata. arXiv preprint arXiv:2406.13213, 2024.
    [5] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.
    [6] Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, Dong Yu. Dense X Retrieval: What Retrieval Granularity Should We Use? arXiv preprint arXiv:2312.06648, 2023.
    [7] Ishneet Sukhvinder Singh, Ritvik Aggarwal, Ibrahim Allahverdiyev, Muhammad Taha, Aslihan Akalin, Kevin Zhu, Sean O'Brien. ChunkRAG: Novel LLM-Chunk Filtering Method for RAG Systems. arXiv preprint arXiv:2410.19572, 2024.
    [8] Zhiyu An, Xianzhong Ding, Yen-Chun Fu, Cheng-Chung Chu, Yan Li, Wan Du. Golden-Retriever: High-Fidelity Agentic Retrieval Augmented Generation for Industrial Knowledge Base. arXiv preprint arXiv:2408.00798, 2024.
    [9] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention Is All You Need. Advances in Neural Information Processing Systems 30 (NIPS 2017)
    [10] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems. NIPS'20: Proceedings of the 34th International Conference on Neural Information Processing Systems.
    [11] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv preprint arXiv: 2310.11511, 2023.
    [12] Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, Zhen-Hua Ling. Corrective Retrieval Augmented Generation. arXiv preprint arXiv: 2401.15884, 2024.
    [13] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, Baobao Chang, Xu Sun, Lei Li, Zhifang Sui. A Survey on In-context Learning. arXiv preprint arXiv: 2301.00234, 2023.
    [14] Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, Fu Lee Wang. Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment. arXiv preprint arXiv: 2312.12148, 2023.
    [15] Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, Ajmal Mian. A Comprehensive Overview of Large Language Models. arXiv preprint arXiv: 2307.06435, 2023.
    [16] Introduction to Large Language Models [https://developers.google.com/machine-learning/resources/intro-llms]
    [17] Introducing GPTs [https://openai.com/index/introducing-gpts/]
    [18] Introducing LLaMA: A foundational, 65-billion-parameter large language model [https://ai.meta.com/blog/large-language-model-llama-meta-ai/]
    [19] Introducing Gemini: our largest and most capable AI model [https://blog.google/technology/ai/google-gemini-ai/]
    [20] Ilya Sutskever, Oriol Vinyals, Quoc V. Le. Sequence to Sequence Learning with Neural Networks. NIPS'14: Proceedings of the 28th International Conference on Neural Information Processing Systems.
    [21] Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen. Long-context LLMs Struggle with Long In-context Learning. arXiv preprint arXiv: 2404.02060, 2024.
    [22] Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, Aman Chadha. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv preprint arXiv: 2402.07927, 2024.
    [23] Yixuan Tang, Yi Yang. MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries. arXiv preprint arXiv: 2401.15391, 2024.