簡易檢索 / 詳目顯示

研究生: 陳杰陞
Chen, Jie-Sheng
論文名稱: 基於Q值變異係數之動態多步深度強化學習用以提升樣本效率
Dynamic Multi-step Deep Reinforcement Learning Based on Q-value Coefficient of Variation for Improved Sample Efficiency
指導教授: 蘇銓清
Sue, Chuan-Ching
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2025
畢業學年度: 113
語文別: 中文
論文頁數: 94
中文關鍵詞: 深度強化學習動態多步驟更新變異係數樣本效率
外文關鍵詞: Deep Reinforcement Learning, Dynamic Multi-step Updates, Coefficient of Variation, Sample Efficiency
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,深度強化學習(Deep Reinforcement Learning, DRL)在各個領域均取得了顯著進展,然而樣本效率低下仍然是一項主要挑戰,本研究針對多步深度Q網路(Multi-step Deep Q-Network, DQN),提出一種新穎的方法,透過Q值變異係數(Coefficient of Variation, CV)動態調整多步回報長度n,以提升樣本效率。
    本研究設計了兩種動態調整策略:1. 每隔數個episodes調整一次n值、2. 每隔數個timesteps即時調整n值,透過比較Q值CV的短期與長期變化趨勢,以評估學習穩定性,並依據結果調升或調降n的大小。
    在標準的OpenAI Gym環境(CartPole-v0、Acrobot-v1以及MountainCar-v0)中的實驗結果顯示,所提出的動態調整策略能有效提升樣本效率,並在最終數個回合的平均表現上,優於固定n值策略與以分群為基礎的動態n步方法。

    In recent years, Deep Reinforcement Learning (DRL) has achieved significant progress across various fields. However, the issue of low sample efficiency remains a major challenge. This research proposes a novel method for improving sample efficiency in multi-step Deep Q-Networks (DQN) by dynamically adjusting the multi-step return length n based on the coefficient of variation (CV) of Q-values. Two dynamic adjustment strategies are designed: one that adjusts n every few episodes and another that adjusts it every few timesteps. By comparing the short-term and long-term trends of Q-value CVs to assess learning stability, the method increases or decreases the value of n accordingly. Experimental results in standard OpenAI Gym environments (CartPole-v0, Acrobot-v1, and MountainCar-v0) demonstrate that the proposed dynamic strategies improve sample efficiency and outperform fixed-n and clustering-based dynamic n-step methods in terms of the average performance over the final episodes.

    摘要 I Summary II 致謝 VI Content VII List of Tables IX List of Figures X 1 Introduction 1 2 Background and Related Work 4 2.1 Background 4 2.1.1 Deep Reinforcement Learning 4 2.2 Related Work 7 2.2.1 Fixed multi-step TD extension 8 2.2.2 Dynamic multi-step TD 10 2.3 Motivation 15 3 Proposed Method 18 3.1 Dynamic multi-step DQN 18 3.2 Algorithm 21 3.2.1 Dynamic multi-step Algorithm (DynDQN_n_E、DynDQN_n_T) 21 4 Evaluation 23 4.1 Environment 23 4.2 Hyperparameter 25 4.3 Result for Cartpole-v0 27 4.3.1 Grid search for DynDQN_n_E 27 4.3.2 Grid search for DynDQN_n_T 31 4.3.3 Performance 34 4.3.4 Analysis of instability 36 4.4 Result for Acrobot-v1 40 4.4.1 Grid search for DynDQN_n_E 40 4.4.2 Grid search for DynDQN_n_T 44 4.4.3 Performance 47 4.4.4 Analysis of instability 49 4.5 Discussion 52 5 Conclusion 56 Reference 57 Appendix 60 A.1 Hyperparameter 60 A.2 ESDQN Implementation problem 63 A.3 Silhouette Score 64 A.4 Hyperparameter for MountainCar-v0 64 A.5 Result for MountainCar-v0 68 A.5.1 Grid search for DynDQN_n_E 68 A.5.2 Grid search for DynDQN_n_T 72 A.5.3 Performance 75 A.5.4 Analysis of instability 77

    [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, L. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg and D. Hassabis, "Human-level control through deep reinforcement learning", Nature, vol. 518, pp. 529-533, Feb 2015.
    [2] T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, "Continuous control with deep reinforcement learning", ICLR, pp. 1-14, 2016.
    [3] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A.Guez, T.Hubert, L.Baker, M.Lai, A. Bolton and Y.Chen, "Mastering the game of go without human knowledge", Nature, pp.354-359, Oct 2017.
    [4] Y. Yu, "Towards Sample Efficient Reinforcement Learning". IJCAI'18: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 5739-5743, July 2018.
    [5] H. Van Hasselt, A. Guez, and D. Silver, "Deep reinforcement learning with double q-learning" Thirtieth AAAI conference on artificial intelligence, pp. 2094-2100, Feb 2016.
    [6] S. Fujimoto, H. Van Hoof and D. Meger, "Addressing Function Approximation Error in Actor-Critic Methods", International Conference on Machine Learning, pp. 1-10, Feb 2018.
    [7] S. Thrun and A. Schwartz, "Issues in using function approximation for reinforcement learning", in Proceedings of the 1993 Connectionist Models Summer School Hillsdale, pp. 1-9, 1993.
    [8] Y. Hou, L. Liu, Q. Wei, X. Xu and C. Chen, "A novel DDPG method with prioritized experience replay", IEEE international conference on systems, man, and cybernetics (SMC), pp. 316-321, Oct 2017.
    [9] R.S. Sutton, "Learning to predict by the methods of temporal differences", Mach Learn, vol. 3, pp.9-44, Feb 1998.
    [10] Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver, "Rainbow: Combining Improvements in Deep Reinforcement learning", Thirty-Second AAAI Conference on Artificial Intelligence, pp. 3215-3222, Feb 2018.
    [11] K. De Asis, J. Hernandez-Garcia, G. Holland, and R. Sutton, "Multi-Step Reinforcement Learning: A Unifying Algorithm", Thirty-Second AAAI Conference on Artificial Intelligence, pp. 2902-2909, Feb 2018.
    [12] Longyuan. Yin, Liangyu. Zhu, Zhenghui. Gu, Yeboah. Yao, Wei. Wu, Xiaoyan. Deng, Jingcong. Li and Yuanqing. Li, “A novel multi-step Q-learning method to improve data efficiency for deep reinforcement learning”, Knowledge-Based Systems, vol. 175, pp. 107-117, July 2019.
    [13] Junmin. Zhong, Ruofan. Wu, and Jennie. Si, "A Long N-step Surrogate Stage Reward for Deep Reinforcement Learning", NeuraIPS2023: Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 12733-12745, Dec 2023.
    [14] B. Daley, M. White and M. Machado "Averaging n-step returns reduces variance in reinforcement learning", ICML'24: Proceedings of the 41st International Conference on Machine Learning, pp. 9904-9930, July 2024.
    [15] L. Meng, R. Gorbet and D. Kulić, "The Effect of Multi-step Methods on Overestimation in Deep Reinforcement Learning", 2020 25th International Conference on Pattern Recognition (ICPR), pp. 347-353, 2021.
    [16] J. Fernando Hernandez-Garcia and R.S. Sutton "Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Target", arXiv preprint arXiv:1912.04002, pp. 1-12, 2019.
    [17] W. Fedus, P. Ramachandran, R. Agarwal, Y. Bengio, H. Larochelle, M. Rowland and W. Dabney, "Revisiting fundamentals of experience replay", ICML'20: Proceedings of the 37th International Conference on Machine Learning, pp. 3061-3071, July 2020.
    [18] J.S.O. Ceron and P.S. Castro, "Revisiting rainbow: Promoting more insightful and inclusive deep reinforcement learning research", International Conference on Machine Learning, pp. 1373-1383, July 2021.
    [19] G. Wang, D. Yan, H. Su and J. Zhu, "Adaptive N-step Bootstrapping with Off-policy Data", pp. 1-10 , 2021.
    [20] A. Ly, R. Dazeley, P. Vamplew, F. Cruz and S. Aryal, "Elastic Step DQN: A novel multi-step algorithm to alleviate overestimation in Deep Q-Networks", Neurocomputing, vol. 576, pp. 1-13, April 2024.
    [21] L. McInnes1, J. Healy and S. Astels, "hdbscan: Hierarchical density based clustering", Journal of Open Source Software, pp. 1-2, Feb 2017.
    [22] A. Ly, R. Dazeley, P. Vamplew, F. Cruz and S. Aryal, "Elastic step DDPG: Multi-step reinforcement learning for improved sample efficiency", 2023 International Joint Conference on Neural Networks (IJCNN), pp. 01-06, 2023.
    [23] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Iintroduction, MIT press, 2018.
    [24] C. J. Watkins and P. Dayan, "Q-learning", Mach Learn, vol. 8, pp. 279-292, May 1992.
    [25] O. Anschel, N. Baram, and N. Shimkin, "Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning", Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 176-185, Aug 2017.
    [26] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, W. Zaremba, "Openai gym", arXiv preprint, pp. 01-04, 2016.
    [27] P.J. Rousseeuw, "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis", Journal of Computational and Applied Mathematics, vol. 20, pp. 53-65, Nov 1987.

    無法下載圖示 校內:2030-08-19公開
    校外:2030-08-19公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE