成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳杰陞 Chen, Jie-Sheng
論文名稱：	基於Q值變異係數之動態多步深度強化學習用以提升樣本效率 Dynamic Multi-step Deep Reinforcement Learning Based on Q-value Coefficient of Variation for Improved Sample Efficiency
指導教授：	蘇銓清 Sue, Chuan-Ching
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering
論文出版年：	2025
畢業學年度：	113
語文別：	中文
論文頁數：	94
中文關鍵詞：	深度強化學習、動態多步驟更新、變異係數、樣本效率
外文關鍵詞：	Deep Reinforcement Learning, Dynamic Multi-step Updates, Coefficient of Variation, Sample Efficiency
相關次數：	點閱：3 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年來，深度強化學習(Deep Reinforcement Learning, DRL)在各個領域均取得了顯著進展，然而樣本效率低下仍然是一項主要挑戰，本研究針對多步深度Q網路(Multi-step Deep Q-Network, DQN)，提出一種新穎的方法，透過Q值變異係數(Coefficient of Variation, CV)動態調整多步回報長度n，以提升樣本效率。
本研究設計了兩種動態調整策略：1. 每隔數個episodes調整一次n值、2. 每隔數個timesteps即時調整n值，透過比較Q值CV的短期與長期變化趨勢，以評估學習穩定性，並依據結果調升或調降n的大小。
在標準的OpenAI Gym環境(CartPole-v0、Acrobot-v1以及MountainCar-v0)中的實驗結果顯示，所提出的動態調整策略能有效提升樣本效率，並在最終數個回合的平均表現上，優於固定n值策略與以分群為基礎的動態n步方法。

In recent years, Deep Reinforcement Learning (DRL) has achieved significant progress across various fields. However, the issue of low sample efficiency remains a major challenge. This research proposes a novel method for improving sample efficiency in multi-step Deep Q-Networks (DQN) by dynamically adjusting the multi-step return length n based on the coefficient of variation (CV) of Q-values. Two dynamic adjustment strategies are designed: one that adjusts n every few episodes and another that adjusts it every few timesteps. By comparing the short-term and long-term trends of Q-value CVs to assess learning stability, the method increases or decreases the value of n accordingly. Experimental results in standard OpenAI Gym environments (CartPole-v0, Acrobot-v1, and MountainCar-v0) demonstrate that the proposed dynamic strategies improve sample efficiency and outperform fixed-n and clustering-based dynamic n-step methods in terms of the average performance over the final episodes.

摘要	I
Summary	II
致謝	VI
Content	VII
List of Tables	IX
List of Figures	X
1	Introduction	1
2	Background and Related Work	4
2.1	Background	4
2.1.1	Deep Reinforcement Learning	4
2.2	Related Work	7
2.2.1	Fixed multi-step TD extension	8
2.2.2	Dynamic multi-step TD	10
2.3	Motivation	15
3	Proposed Method	18
3.1	Dynamic multi-step DQN	18
3.2	Algorithm	21
3.2.1	Dynamic multi-step Algorithm (DynDQN_n_E、DynDQN_n_T)	21
4	Evaluation	23
4.1	Environment	23
4.2	Hyperparameter	25
4.3	Result for Cartpole-v0	27
4.3.1	Grid search for DynDQN_n_E	27
4.3.2	Grid search for DynDQN_n_T	31
4.3.3	Performance	34
4.3.4	Analysis of instability	36
4.4	Result for Acrobot-v1	40
4.4.1	Grid search for DynDQN_n_E	40
4.4.2	Grid search for DynDQN_n_T	44
4.4.3	Performance	47
4.4.4	Analysis of instability	49
4.5	Discussion	52
5	Conclusion	56
Reference	57
Appendix	60
A.1	Hyperparameter	60
A.2	ESDQN Implementation problem	63
A.3	Silhouette Score	64
A.4	Hyperparameter for MountainCar-v0	64
A.5	Result for MountainCar-v0	68
A.5.1	Grid search for DynDQN_n_E	68
A.5.2	Grid search for DynDQN_n_T	72
A.5.3	Performance	75
A.5.4	Analysis of instability	77
                                    

[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, L. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg and D. Hassabis, "Human-level control through deep reinforcement learning", Nature, vol. 518, pp. 529-533, Feb 2015.
[2] T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, "Continuous control with deep reinforcement learning", ICLR, pp. 1-14, 2016.
[3] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A.Guez, T.Hubert, L.Baker, M.Lai, A. Bolton and Y.Chen, "Mastering the game of go without human knowledge", Nature, pp.354-359, Oct 2017.
[4] Y. Yu, "Towards Sample Efficient Reinforcement Learning". IJCAI'18: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 5739-5743, July 2018.
[5] H. Van Hasselt, A. Guez, and D. Silver, "Deep reinforcement learning with double q-learning" Thirtieth AAAI conference on artificial intelligence, pp. 2094-2100, Feb 2016.
[6] S. Fujimoto, H. Van Hoof and D. Meger, "Addressing Function Approximation Error in Actor-Critic Methods", International Conference on Machine Learning, pp. 1-10, Feb 2018.
[7] S. Thrun and A. Schwartz, "Issues in using function approximation for reinforcement learning", in Proceedings of the 1993 Connectionist Models Summer School Hillsdale, pp. 1-9, 1993.
[8] Y. Hou, L. Liu, Q. Wei, X. Xu and C. Chen, "A novel DDPG method with prioritized experience replay", IEEE international conference on systems, man, and cybernetics (SMC), pp. 316-321, Oct 2017.
[9] R.S. Sutton, "Learning to predict by the methods of temporal differences", Mach Learn, vol. 3, pp.9-44, Feb 1998.
[10] Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver, "Rainbow: Combining Improvements in Deep Reinforcement learning", Thirty-Second AAAI Conference on Artificial Intelligence, pp. 3215-3222, Feb 2018.
[11] K. De Asis, J. Hernandez-Garcia, G. Holland, and R. Sutton, "Multi-Step Reinforcement Learning: A Unifying Algorithm", Thirty-Second AAAI Conference on Artificial Intelligence, pp. 2902-2909, Feb 2018.
[12] Longyuan. Yin, Liangyu. Zhu, Zhenghui. Gu, Yeboah. Yao, Wei. Wu, Xiaoyan. Deng, Jingcong. Li and Yuanqing. Li, “A novel multi-step Q-learning method to improve data efficiency for deep reinforcement learning”, Knowledge-Based Systems, vol. 175, pp. 107-117, July 2019.
[13] Junmin. Zhong, Ruofan. Wu, and Jennie. Si, "A Long N-step Surrogate Stage Reward for Deep Reinforcement Learning", NeuraIPS2023: Proceedings of the 37th International Conference on Neural Information Processing Systems, pp. 12733-12745, Dec 2023.
[14] B. Daley, M. White and M. Machado "Averaging n-step returns reduces variance in reinforcement learning", ICML'24: Proceedings of the 41st International Conference on Machine Learning, pp. 9904-9930, July 2024.
[15] L. Meng, R. Gorbet and D. Kulić, "The Effect of Multi-step Methods on Overestimation in Deep Reinforcement Learning", 2020 25th International Conference on Pattern Recognition (ICPR), pp. 347-353, 2021.
[16] J. Fernando Hernandez-Garcia and R.S. Sutton "Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Target", arXiv preprint arXiv:1912.04002, pp. 1-12, 2019.
[17] W. Fedus, P. Ramachandran, R. Agarwal, Y. Bengio, H. Larochelle, M. Rowland and W. Dabney, "Revisiting fundamentals of experience replay", ICML'20: Proceedings of the 37th International Conference on Machine Learning, pp. 3061-3071, July 2020.
[18] J.S.O. Ceron and P.S. Castro, "Revisiting rainbow: Promoting more insightful and inclusive deep reinforcement learning research", International Conference on Machine Learning, pp. 1373-1383, July 2021.
[19] G. Wang, D. Yan, H. Su and J. Zhu, "Adaptive N-step Bootstrapping with Off-policy Data", pp. 1-10 , 2021.
[20] A. Ly, R. Dazeley, P. Vamplew, F. Cruz and S. Aryal, "Elastic Step DQN: A novel multi-step algorithm to alleviate overestimation in Deep Q-Networks", Neurocomputing, vol. 576, pp. 1-13, April 2024.
[21] L. McInnes1, J. Healy and S. Astels, "hdbscan: Hierarchical density based clustering", Journal of Open Source Software, pp. 1-2, Feb 2017.
[22] A. Ly, R. Dazeley, P. Vamplew, F. Cruz and S. Aryal, "Elastic step DDPG: Multi-step reinforcement learning for improved sample efficiency", 2023 International Joint Conference on Neural Networks (IJCNN), pp. 01-06, 2023.
[23] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Iintroduction, MIT press, 2018.
[24] C. J. Watkins and P. Dayan, "Q-learning", Mach Learn, vol. 8, pp. 279-292, May 1992.
[25] O. Anschel, N. Baram, and N. Shimkin, "Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning", Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 176-185, Aug 2017.
[26] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, W. Zaremba, "Openai gym", arXiv preprint, pp. 01-04, 2016.
[27] P.J. Rousseeuw, "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis", Journal of Computational and Applied Mathematics, vol. 20, pp. 53-65, Nov 1987.

校內：2030-08-19公開
校外：2030-08-19公開電子論文尚未授權公開，紙本請查館藏目錄

簡易檢索 / 詳目顯示

相關論文