| 研究生: |
陳杰陞 Chen, Jie-Sheng |
|---|---|
| 論文名稱: |
基於Q值變異係數之動態多步深度強化學習用以提升樣本效率 Dynamic Multi-step Deep Reinforcement Learning Based on Q-value Coefficient of Variation for Improved Sample Efficiency |
| 指導教授: |
蘇銓清
Sue, Chuan-Ching |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 中文 |
| 論文頁數: | 94 |
| 中文關鍵詞: | 深度強化學習 、動態多步驟更新 、變異係數 、樣本效率 |
| 外文關鍵詞: | Deep Reinforcement Learning, Dynamic Multi-step Updates, Coefficient of Variation, Sample Efficiency |
| 相關次數: | 點閱:3 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,深度強化學習(Deep Reinforcement Learning, DRL)在各個領域均取得了顯著進展,然而樣本效率低下仍然是一項主要挑戰,本研究針對多步深度Q網路(Multi-step Deep Q-Network, DQN),提出一種新穎的方法,透過Q值變異係數(Coefficient of Variation, CV)動態調整多步回報長度n,以提升樣本效率。
本研究設計了兩種動態調整策略:1. 每隔數個episodes調整一次n值、2. 每隔數個timesteps即時調整n值,透過比較Q值CV的短期與長期變化趨勢,以評估學習穩定性,並依據結果調升或調降n的大小。
在標準的OpenAI Gym環境(CartPole-v0、Acrobot-v1以及MountainCar-v0)中的實驗結果顯示,所提出的動態調整策略能有效提升樣本效率,並在最終數個回合的平均表現上,優於固定n值策略與以分群為基礎的動態n步方法。
In recent years, Deep Reinforcement Learning (DRL) has achieved significant progress across various fields. However, the issue of low sample efficiency remains a major challenge. This research proposes a novel method for improving sample efficiency in multi-step Deep Q-Networks (DQN) by dynamically adjusting the multi-step return length n based on the coefficient of variation (CV) of Q-values. Two dynamic adjustment strategies are designed: one that adjusts n every few episodes and another that adjusts it every few timesteps. By comparing the short-term and long-term trends of Q-value CVs to assess learning stability, the method increases or decreases the value of n accordingly. Experimental results in standard OpenAI Gym environments (CartPole-v0, Acrobot-v1, and MountainCar-v0) demonstrate that the proposed dynamic strategies improve sample efficiency and outperform fixed-n and clustering-based dynamic n-step methods in terms of the average performance over the final episodes.
[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, L. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg and D. Hassabis, "Human-level control through deep reinforcement learning", Nature, vol. 518, pp. 529-533, Feb 2015.
[2] T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, "Continuous control with deep reinforcement learning", ICLR, pp. 1-14, 2016.
[3] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A.Guez, T.Hubert, L.Baker, M.Lai, A. Bolton and Y.Chen, "Mastering the game of go without human knowledge", Nature, pp.354-359, Oct 2017.
[4] Y. Yu, "Towards Sample Efficient Reinforcement Learning". IJCAI'18: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 5739-5743, July 2018.
[5] H. Van Hasselt, A. Guez, and D. Silver, "Deep reinforcement learning with double q-learning" Thirtieth AAAI conference on artificial intelligence, pp. 2094-2100, Feb 2016.
[6] S. Fujimoto, H. Van Hoof and D. Meger, "Addressing Function Approximation Error in Actor-Critic Methods", International Conference on Machine Learning, pp. 1-10, Feb 2018.
[7] S. Thrun and A. Schwartz, "Issues in using function approximation for reinforcement learning", in Proceedings of the 1993 Connectionist Models Summer School Hillsdale, pp. 1-9, 1993.
[8] Y. Hou, L. Liu, Q. Wei, X. Xu and C. Chen, "A novel DDPG method with prioritized experience replay", IEEE international conference on systems, man, and cybernetics (SMC), pp. 316-321, Oct 2017.
[9] R.S. Sutton, "Learning to predict by the methods of temporal differences", Mach Learn, vol. 3, pp.9-44, Feb 1998.
[10] Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver, "Rainbow: Combining Improvements in Deep Reinforcement learning", Thirty-Second AAAI Conference on Artificial Intelligence, pp. 3215-3222, Feb 2018.
[11] K. De Asis, J. Hernandez-Garcia, G. Holland, and R. Sutton, "Multi-Step Reinforcement Learning: A Unifying Algorithm", Thirty-Second AAAI Conference on Artificial Intelligence, pp. 2902-2909, Feb 2018.
[12] Longyuan. Yin, Liangyu. Zhu, Zhenghui. Gu, Yeboah. Yao, Wei. Wu, Xiaoyan. Deng, Jingcong. Li and Yuanqing. Li, “A novel multi-step Q-learning method to improve data efficiency for deep reinforcement learning”, Knowledge-Based Systems, vol. 175, pp. 107-117, July 2019.
[13] Junmin. Zhong, Ruofan. Wu, and Jennie. Si, "A Long N-step Surrogate Stage Reward for Deep Reinforcement Learning", NeuraIPS2023: Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 12733-12745, Dec 2023.
[14] B. Daley, M. White and M. Machado "Averaging n-step returns reduces variance in reinforcement learning", ICML'24: Proceedings of the 41st International Conference on Machine Learning, pp. 9904-9930, July 2024.
[15] L. Meng, R. Gorbet and D. Kulić, "The Effect of Multi-step Methods on Overestimation in Deep Reinforcement Learning", 2020 25th International Conference on Pattern Recognition (ICPR), pp. 347-353, 2021.
[16] J. Fernando Hernandez-Garcia and R.S. Sutton "Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Target", arXiv preprint arXiv:1912.04002, pp. 1-12, 2019.
[17] W. Fedus, P. Ramachandran, R. Agarwal, Y. Bengio, H. Larochelle, M. Rowland and W. Dabney, "Revisiting fundamentals of experience replay", ICML'20: Proceedings of the 37th International Conference on Machine Learning, pp. 3061-3071, July 2020.
[18] J.S.O. Ceron and P.S. Castro, "Revisiting rainbow: Promoting more insightful and inclusive deep reinforcement learning research", International Conference on Machine Learning, pp. 1373-1383, July 2021.
[19] G. Wang, D. Yan, H. Su and J. Zhu, "Adaptive N-step Bootstrapping with Off-policy Data", pp. 1-10 , 2021.
[20] A. Ly, R. Dazeley, P. Vamplew, F. Cruz and S. Aryal, "Elastic Step DQN: A novel multi-step algorithm to alleviate overestimation in Deep Q-Networks", Neurocomputing, vol. 576, pp. 1-13, April 2024.
[21] L. McInnes1, J. Healy and S. Astels, "hdbscan: Hierarchical density based clustering", Journal of Open Source Software, pp. 1-2, Feb 2017.
[22] A. Ly, R. Dazeley, P. Vamplew, F. Cruz and S. Aryal, "Elastic step DDPG: Multi-step reinforcement learning for improved sample efficiency", 2023 International Joint Conference on Neural Networks (IJCNN), pp. 01-06, 2023.
[23] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Iintroduction, MIT press, 2018.
[24] C. J. Watkins and P. Dayan, "Q-learning", Mach Learn, vol. 8, pp. 279-292, May 1992.
[25] O. Anschel, N. Baram, and N. Shimkin, "Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning", Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 176-185, Aug 2017.
[26] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, W. Zaremba, "Openai gym", arXiv preprint, pp. 01-04, 2016.
[27] P.J. Rousseeuw, "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis", Journal of Computational and Applied Mathematics, vol. 20, pp. 53-65, Nov 1987.
校內:2030-08-19公開