簡易檢索 / 詳目顯示

研究生: 王嘉臻
Wang, Jia-Jen
論文名稱: 反思觸發:透過注入引導向量實現問答任務中的內部自我修正能力
Reflection Trigger: Latent Self-Correction for Question Answering by Steering Vector Injection
指導教授: 莊坤達
Chuang, Kun-Ta
共同指導教授: 高宏宇
Kao, Hung-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 醫學資訊研究所
Institute of Medical Informatics
論文出版年: 2025
畢業學年度: 113
語文別: 英文
論文頁數: 54
中文關鍵詞: 大型語言模型反思推理向量操控隱藏層注入醫學推理
外文關鍵詞: Large Language Models, Reflective Reasoning, Steering Vector, Activation Injection, Biomedical Reasoning, Commonsense Reasoning
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究的核心問題為:如何引導大型語言模型(LLMs)進行穩定的反思性推理。雖然近期已有許多方法嘗試透過提示語設計(prompt engineering)或多輪對話引導模型反思,但這類方法往往伴隨過度反思,導致回答不穩定的問題,且需依賴人工設計提示語,以及增加生成冗長內容的成本。
    本研究針對前述挑戰,提出一個 Reflection Trigger 的模型,用於引導 LLMs 進行具反思性的推理。與傳統依賴提示語設計(prompt engineering)或參數微調(fine-tuning)的方法不同,我們透過一組由 BERT 編碼器學習得到的潛在語意向量(steering vector),於推論階段動態注入 LLM 的中間層,以調節模型的推理傾向,且無需更動模型本體參數。實驗涵蓋兩大推理任務領域,包含醫學推理與常識推理,使用 MedQA、MMLU-Med、CommonsenseQA 等資料集進行實驗。由實驗結果可知,Reflection Trigger 在不修改模型參數下,整體準確率優於原始模型(vanilla)與提示語設計方法,甚至在多數任務上超越 LoRA 微調方法,同時能顯著減少過度反思與生成冗長內容。綜合所有實驗與分析,本研究指出反思推理可視為一種可學習、可控制的語意操控行為,即使無需提示語設計或模型修改,ReflectionTrigger 仍能穩定引導 LLM 採取更具深度與可控性的思考方式。

    Our study focuses on a core question: How to effectively guide large language models (LLMs) to perform stable reflective reasoning. Although recent methods have attempted to guide models to reflect through prompt engineering or multi-turn dialogues, these approaches often suffer from over-reflection, leading to unstable answers, and rely heavily on manually designed prompts, which increases the cost of generating lengthy content.
    To address these challenges, our study proposes the Reflection Trigger, a vector-based mechanism to guide reflective reasoning in LLMs. In contrast to traditional methods that rely on prompt engineering or parameter fine-tuning, our method uses a steering vector mechanism. A set of latent semantic vectors, learned by a BERT-based module, is dynamically injected into the intermediate layers of the LLM during inference to guide its reasoning tendency, without modifying the model's parameters. The proposed method is tested on two main reasoning domains, including biomedical reasoning and commonsense reasoning, using datasets such as MedQA, MMLU-Med, and CommonsenseQA. The results show that the Reflection Trigger achieves improved overall accuracy compared to the vanilla model, prompt-based reflection methods, and in some cases outperforms LoRA fine-tuning. Additionally, it effectively reduces over-reflection and output length. In summary, this study demonstrates that reflective reasoning can be treated as a learnable and controllable semantic behavior. Without relying on prompt engineering or model modifications, Reflection Trigger provides a stable and controllable mechanism to steer LLMs toward more reflective and consistent reasoning behaviors.

    摘要 i Abstract ii 誌謝 iii Table of Contents iv List of Tables vi List of Figures vii Chapter 1. Introduction 1 1.1. Background 1 1.2. Motivation 2 1.3. Our Work 4 Chapter 2. Related Work 7 2.1. Self-Reflection in Language Models 7 2.2. Steering via Latent Representations 8 Chapter 3. Methodology 11 3.1. Overall Framework 11 3.2. Reflection Trigger 12 3.2.1. Training Data Construction 12 3.2.2. Model Training 16 3.3. LLM Reasoning & Activation Injection 17 Chapter 4. Experiments 18 4.1. Dataset 18 4.1.1. MedQA 20 4.1.2. MedMCQA 20 4.1.3. MMLU-Med 20 4.1.4. ARC Challenge 21 4.1.5. CommonsenseQA 21 4.2. Baseline 21 4.3. Evaluation Metric 22 4.3.1. Accuracy 22 4.3.2. Reflection Rate 22 4.3.3. Over-reflection Rate 23 4.4. Main Results 23 Chapter 5. Analysis 26 5.1. Comparison with Prompt-based Methods 26 5.2. Parameter Sensitivity 28 5.3. Analysis of Training Data Efficiency 29 5.4. Impact of Reflection Intensity on Task Difficulty 30 5.5. Visualization of Reflective Representations 31 5.6. Case Study 32 5.6.1. Case Study 1—Controlled Reflection Prevents Overthinking 32 5.6.2. Case Study 2—Successful Reflection Case 32 5.6.3. Case Study 3—Over-reflection Reflection Case 32 5.6.4. Case Study 4—Effect of Injection Layer on Reflective Reasoning Behavior 33 5.6.5. Case Study 5 —In-Domain vs. Cross-Domain Reflection Trigger 33 Chapter 6. Conclusion 42 References 43

    [1] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
    [2] Sébastien Bubeck, Varun Chadrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
    [3] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
    [4] Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019.
    [5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019.
    [6] Muhammad Usman Hadi, Rizwan Qureshi, Abbas Shah, Muhammad Irfan, Anas Zafar, Muhammad Bilal Shaikh, Naveed Akhtar, Jia Wu, Seyedali Mirjalili, et al. A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints, 2023.
    [7] Thilo Hagendorff, Sarah Fabi, and Michal Kosinski. Thinking fast and slow in large language models. arXiv preprint arXiv:2212.05206, 2022.
    [8] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
    [9] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022.
    [10] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1–38, 2023.
    [11] Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021.
    [12] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
    [13] Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023.
    [14] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine:Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36:46534–46594, 2023.
    [15] Philipp Mondorf and Barbara Plank. Beyond accuracy: Evaluating the reasoning behavior of large language models–a survey. arXiv preprint arXiv:2404.01869, 2024.
    [16] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025.
    [17] Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning, pages 248–260. PMLR, 2022.
    [18] Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023.
    [19] Matthew Renze and Erhan Guven. Self-reflection in large language model agents: Effects on problem-solving performance. In 2024 2nd International Conference on Foundation and Large Language Models (FLLM), pages 516–525. IEEE, 2024.
    [20] Darsh J Shah, Peter Rushton, Somanshu Singla, Mohit Parmar, Kurt Smith, Yash Vanjani, Ashish Vaswani, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, et al. Rethinking reflection in pre-training. arXiv preprint arXiv:2504.04022, 2025.
    [21] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36:8634–8652, 2023.
    [22] Nishant Subramani, Nivedita Suresh, and Matthew E Peters. Extracting latent steering vectors from pretrained language models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 566–581, 2022.
    [23] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019.
    [24] Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
    [25] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
    [26] Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering. arXiv preprint arXiv:2308.10248, 2023.
    [27] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
    [28] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023.

    無法下載圖示 校內:2026-08-22公開
    校外:2026-08-22公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE