簡易檢索 / 詳目顯示

研究生: 蔡沅錡
Tsai, Yuan-Qi
論文名稱: 朝向無害且對齊人類價值觀的回應:一個整合經驗法則一致性與有害行為忘卻的 Logit 優化框架
Towards Harmless and Human Value-Aligned Responses: A Logit Optimization Framework Integrating Rule-of-Thumb Agreement and Harmful Behavior Unlearning
指導教授: 吳宗憲
Wu, Chung-Hsien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 人工智慧科技碩士學位學程
Graduate Program of Artificial Intelligence
論文出版年: 2025
畢業學年度: 113
語文別: 英文
論文頁數: 76
中文關鍵詞: 道德對話系統安全回應人類價值觀對齊大型語言模型Logit 優化有害行為忘卻
外文關鍵詞: Moral Dialogue System, Safe Response, Human Value Alignment, LLM Logit Optimization, Harmful Behavior Unlearning
相關次數: 點閱:4下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著大型語言模型(LLM)在各領域的普及,其衍生的道德挑戰日益凸顯。現有模型在面對潛在有害或不當的用戶輸入時,常難以生成兼具安全性與建設性的回應,甚至可能無意中默許或強化不道德的言論。儘管引入經驗法則(Rules of Thumb, RoT)作為人類價值觀的引導,模型仍傾向於產出過於被動或迴避性的回應(如「我無法提供指導」),這種「安全卻無用」的現象凸顯了現有對齊方法的不足 。在確保安全性的基礎上,如何將人類價值觀更深度地對齊至模型的生成機制,是實現有意義人機互動的關鍵挑戰。

    為應對此挑戰,本論文提出一套無需額外微調的兩階段解碼時Logit優化框架,旨在引導大型語言模型生成更符合人類道德規範的回應。第一階段,我們引入「帶有Logit過濾的有害行為忘卻」(Unlearning with Logit Filtering, ULF)機制。此機制利用一個專門生成有害內容的輔助模型(ToxicLM),以「外科手術式」的精準度識別並抑制高置信度的有害詞元Logit,從而避免了傳統Logit減法可能造成的「誤傷無辜」問題 。第二階段,我們提出「RoT一致性引導取樣」(RoT-Agreement Guided Sampling, RoTagree-GS),此方法將「生成回應」與「目標RoT」之間的語義一致性明確地建模為一個可微分的價值函數,並利用其梯度直接在Logit空間中進行優化,以實現比表層模仿更深層次的價值對齊 。

    我們在ProsocialDialog資料集上對所提出的框架進行了全面評估 。實驗結果表明,相較於強大的上下文學習(In-Context Learning, ICL)基線,我們的模型在提升回應多樣性(Distinct-2指標提升0.043)和降低毒性(Toxicity指標降低0.45%)的同時,顯著增強了與人類價值觀的對齊度(Generated RoT Agreement指標提升0.038)。人工評估結果進一步證實,本系統生成的回應在親社會性、尊重度與整體品質上均顯著優於基線系統。本研究證明,透過精準的抑制與深度的價值引導相結合,能在不犧牲模型通用能力的前提下,於解碼階段有效提升對話系統的道德智慧與互動品質。

    As conversational systems powered by Large Language Models (LLMs) become increasingly prevalent, their associated ethical challenges have grown more pronounced. When responding to problematic user inputs, existing models, even when guided by human values through mechanisms like Rules of Thumb (RoTs), often struggle to generate responses that are both safe and constructive. This frequently leads to overly passive or evasive replies, a phenomenon known as the "safe-but-useless" problem, highlighting a critical need for more sophisticated alignment techniques that can deeply instill human values into the generative process.

    To address this challenge, this thesis proposes a novel, two-stage, tuning-free, decoding-time logit optimization framework. In the first stage, Unlearning with Logit Filtering (ULF), we introduce a surgical suppression mechanism. Unlike methods that blindly subtract logit distributions, ULF leverages a specialized assistant model (ToxicLM) to precisely identify and apply a suppressive penalty only to high-confidence toxic tokens, thereby preserving the integrity of the original language distribution. In the second stage, RoT-Agreement Guided Sampling (RoTagree-GS), we achieve deep value alignment. Instead of relying on static preference vectors, RoTagree-GS dynamically models the semantic agreement between a candidate response and a target RoT as a differentiable value function. We then use the gradient of this function to directly steer the logit distribution, providing a more nuanced and context-aware guidance signal.

    We conducted a comprehensive evaluation of our framework on the ProsocialDialog dataset. Experimental results demonstrate that, compared to a strong In-context Learning (ICL) baseline, our model significantly enhances human value alignment (Generated RoT Agreement +0.038) while concurrently improving response diversity (Distinct-2 +0.043) and reducing toxicity (Toxicity -0.45%). Furthermore, human subjective evaluations confirm that responses generated by our system are markedly superior in terms of prosociality, respect, and overall quality. This research validates that the synergy of precise, filtered suppression and deep, gradient-based value guidance provides an effective and lightweight path to enhance the moral intelligence and interactive quality of dialogue systems at the decoding stage.

    摘要 I Abstract III Contents V List of Tables VIII List of Figures IX Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Literature Review 4 1.3.1 Moral Dialogue Systems 4 1.3.2 Logit-Level Unlearning of Harmful Behaviors 5 1.3.3 Logit-Level Value Alignment 6 1.4 Problems Statement 7 1.5 Brief Description of Research Methods and Contributions 8 Chapter 2 Proposed Methods 10 2.1 Framework Overview 10 2.2 Modeling Response Generation as a Markov Decision Process (MDP) 12 2.3 Unlearning Harmful Behavior with Logit Filtering 12 2.3.1 Reversing the Unlearning Loss Function to Train a Toxic Model 13 2.3.2 Filtering High-Confidence Toxic Tokens 14 2.3.3 Obtaining the Purified Unlearned Logits 14 2.4 RoT-Agreement Guided Sampling 15 2.4.1 Modeling RoT-Agreement as a Value Function 15 2.4.2 Logit Optimization with RoT Agreement 16 2.4.3 Top-k Approximation for Computational Efficiency 17 Chapter 3 Datasets 19 3.1 Introduction 19 3.2 ProsocialDialog 20 3.2.1 Rationale and Design Philosophy 20 3.2.2 Human-Machine Collaborative Data Collection and Annotation 21 3.2.3 Dataset Structure and Safety Classification 22 3.2.4 Application in This Study 25 3.2.5 Core Value and Analysis 26 3.3 The Moral Integrity Corpus 26 3.3.1 Rationale and Design Philosophy 26 3.3.2 Annotation Framework and Data Construction 27 3.3.3 Application in This Study 30 3.3.4 Core Value and Analysis 30 3.4 Real Toxicity Prompts 31 3.4.1 Rationale and Core Problem 31 3.4.2 Data Construction, Scale, and Characteristics 32 3.4.3 Application in This Study 34 3.4.4 Methodological Limitations 34 Chapter 4 Experimental Setup and Results 36 4.1 Experimental Setup 36 4.1.1 Reference RoT Source 36 4.1.2 Training Detail of ToxicLM 37 4.1.3 Training Detail of RoT Agreement Model 41 4.1.4 Baseline System 43 4.2 Evaluation Metrics 43 4.2.1 Fluency Evaluation Metrics 44 4.2.2 Alignment Evaluation Metrics 45 4.2.3 Safety Evaluation Metrics 45 4.2.4 Human Subjective Evaluation Metrics 46 4.3 Experiment Results and Discussion 48 4.3.1 Main Results on ProsocialDialog 48 4.3.2 Ablation Study 50 4.3.3 Human Subjective Evaluation Results 52 4.3.4 Case Study: Dialogue Samples 53 Chapter 5 Conclusion and Future Work 58 5.1 Conclusion 58 5.2 Future Work 60 References 62

    [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is All You Need, Advances in Neural Information Processing Systems 30 (NIPS 2017), Curran Associates, Inc., 2017, pp. 5998-6008.
    [2] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, Improving language understanding by generative pre-training, OpenAI, 2018.
    [3] J. Devlin, M.-W. Chang, K. Lee, K. Toutanoba, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171-4186.
    [4] Y. Zhang, S. Sun, M. Galley, Y.-C. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, B. Dolan, Dialogpt: Large-scale generative pre-training for conversational response generation, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics, 2020, pp. 270-278.
    [5] S. Gehman, S. Gururangan, M. Sap, Y. Choi, N.A. Smith, RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models, Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 3356-3369.
    [6] L. Weidinger, J.F.J. Mellor, M. Rauh, C. Griffin, J. Uesato, P.-S. Huang, M. Cheng, M. Glaese, B. Balle, A. Kasirzadeh, Z. Kenton, S.M. Brown, W.T. Hawkins, T. Stepleton, C. Biles, A. Birhane, J. Haas, L. Rimell, L.A. Hendricks, W.S. Isaac, S. Legassick, G. Irving, I. Gabriel, Ethical and social risks of harm from Language Models, ArXiv abs/2112.04359 (2021).
    [7] D. Hendrycks, N. Carlini, J. Schulman, J. Steinhardt, Aligning AI with shared human values, arXiv preprint arXiv:2008.02275 (2021).
    [8] H. Kim, J. Park, G. Kim, H. Choi, Y. Kim, G. Kim, ProsocialDialog: A Prosocial Backbone for Conversational Agents, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022, pp. 3681-3696.
    [9] X. Wang, C. Hu, P. Röttger, B. Plank, Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation, arXiv preprint arXiv:2410.03415 (2024).
    [10] Y. Zhang, X. Zhu, Y. Li, Aligning Language Models Using Multi-Objective Deep Reinforcement Learning, Proceedings of the Canadian Conference on Artificial Intelligence (2025).
    [11] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, Training a helpful and harmless assistant with reinforcement learning from human feedback, arXiv preprint arXiv:2204.05862 (2022).
    [12] D. Jurafsky, J.H. Martin, Speech and Language Processing (3rd ed. draft), 2023.
    [13] H. Sun, Z. Zheng, Z. Liu, M. Huang, MoralDial: A Framework to Train and Evaluate Moral Dialogue Systems via Constructing Moral Discussions, Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 956-970.
    [14] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, D. Amodei, D. Drain, D. Ganguli, T. Henighan, N. Joseph, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olah, B. Mann, S. McCandlish, m. authors, Constitutional AI: Harmlessness from AI feedback, arXiv preprint arXiv:2212.08073 (2022).
    [15] Y. Cao, J. Yang, Towards making systems forget with machine unlearning, 2015 IEEE Symposium on Security and Privacy, IEEE, 2015, pp. 463-480.
    [16] N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, C. Zhang, Quantifying memorization across hundreds of language models, 32nd USENIX Security Symposium (USENIX Security 23), 2023, pp. 2223-2240.
    [17] Y. Yao, X. Xu, Y. Liu, Large language model unlearning, Advances in Neural Information Processing Systems 37 (2024) 105425-105475.
    [18] J. Ji, Y. Liu, Y. Zhang, G. Liu, R.R. Kompella, S. Liu, S. Chang, Reversing the forget-retain objectives: An efficient llm unlearning framework from logit difference, Advances in Neural Information Processing Systems 37 (2024) 12581-12611.
    [19] Y.H. James, Z. Wenxuan, W. Fei, M. Fred, Z. Sheng, P. Hoifung, C. Muhao, Offset Unlearning for Large Language Models, Transactions on Machine Learning Research (2025).
    [20] P. Maini, A. Bhotika, N. Goyal, TOFU: A Task of Factual Unlearning for Large Language Models, Advances in Neural Information Processing Systems, 2024.
    [21] M. Khanov, A. Vakhitov, ARGS: Alignment as Reward-Guided Search, The Twelfth International Conference on Learning Representations (ICLR), 2024.
    [22] S. Han, G. Shenfeld, Y. Tsvetkov, Value Augmented Sampling: Predict Your Rewards to Align Language Models, ICLR 2024 Workshop on Reliable and Responsible Foundation Models, 2024.
    [23] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, Training language models to follow instructions with human feedback, Advances in Neural Information Processing Systems, 2022, pp. 27730-27744.
    [24] S. Gao, Y. Li, K. Cho, Linear Alignment, Proceedings of the 41st International Conference on Machine Learning (ICML), 2024.
    [25] S. Kim, D. Kim, D.-W. Kim, G. Kim, GrounDial: Human-norm Grounded Safe Dialog Response Generation, Findings of the Association for Computational Linguistics: EACL 2024, 2024, pp. 1655-1662.
    [26] C. Tang, J. Liu, H. Xu, L. Huang, Top-nσ: Not All Logits Are You Need, arXiv preprint arXiv:2411.07641 (2024).
    [27] E.M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ?, Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021, pp. 610-623.
    [28] M. Forbes, J.D. Hwang, R. Zellers, V. Vava, Y. Choi, Social Chemistry 101: Learning to Reason about Social and Moral Norms, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 7475-7491.
    [29] C. Ziems, W. Held, A. Jbara, A. Bhasin, D. Yang, The Moral Integrity Corpus: A Benchmark for Ethical Dialogue Systems, Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 2023, pp. 14360-14379.
    [30] F. Delahunty, Reddit QA Corpus, 2018.
    [31] J. Graham, J. Haidt, S. Koleva, M. Motyl, R. Iyer, S.P. Wojcik, P.H. Ditto, Moral Foundations Theory: The Pragmatic Validity of Moral Pluralism, Advances in Experimental Social Psychology 47 (2013) 55-130.
    [32] A. Gokaslan, V. Cohen, OpenWebText Corpus, 2019. http://Skylion.me/projects/openwebtext.
    [33] M. Sap, D. Card, S. Gabriel, Y. Choi, N.A. Smith, The Risk of Racial Bias in Hate Speech Detection, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 1668-1678.
    [34] T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated Hate Speech Detection and the Problem of Offensive Language, Proceedings of the International AAAI Conference on Web and Social Media, 2017, pp. 512-515.
    [35] E.J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, LoRA: Low-Rank Adaptation of Large Language Models, The Tenth International Conference on Learning Representations, ICLR 2022, 2022.
    [36] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
    [37] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311-318.
    [38] J. Li, M. Galley, C. Brockett, J. Gao, B. Dolan, A diversity-promoting objective function for neural conversation models, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1625-1635.
    [39] S. Mudgal, A. Raghunathan, P. Liang, T.B. Hashimoto, Controlled decoding from language models, International Conference on Machine Learning, PMLR, 2024.
    [40] L. Gao, J. Schulman, J. Hilton, Scaling laws for reward model overoptimization, International Conference on Machine Learning, PMLR, 2023.

    QR CODE