| 研究生: |
吳東承 Wu, Tung-Cheng |
|---|---|
| 論文名稱: |
基於期望值方法的強化學習價值逼近函數之實務計算方式 Moderating Maximal Value - a Practical Expectation-Based Method for Value Function Approximation in Reinforcement Learning |
| 指導教授: |
賴槿峰
Lai, Chin-Feng |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 工程科學系 Department of Engineering Science |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 中文 |
| 論文頁數: | 51 |
| 中文關鍵詞: | 優勢學習 、期望值方法 、價值逼近函數 |
| 外文關鍵詞: | Advantage Learning, Expectation-based Method, Value Function Approximation |
| 相關次數: | 點閱:138 下載:23 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究提出新的期望值方法之實務計算方式,應用於強化學習中的價值逼近函數中。減緩同時使用最佳化方法與 ϵ - greedy 策略時,產生的價值高估問題。並將本研究的價值估計方法,應用於優勢學習演算法,提出期望優勢學習演算法。
相較於傳統的 softmax policy 直接使用動作價值計算動作機率分布,本研究使用動作價值的 tanh 數值計算動作機率分布。藉由 tanh 函數限制增長的特性,避免最大動作價值主導動作機率分布,讓產生的動作機率分布不會集中於特定動作。確保使用期望值方法計算價值時,產生的價值不會有嚴重的價值高估情況。
於實驗結果中,本研究以總得分、總步數、平均得分與最高得分四個指標,評估Deep Q Network 演算法、優勢學習演算法、期望優勢學習演算法三者的表現。並於本研究選擇的三個實驗環境中,期望優勢學習皆獲得最高分。透過觀察總得分可以發現,在相同的訓練總次數下,使用期望優勢學習演算法相較於優勢學習演算法,至多增加 6% 的總得分。
In this study, we propose a practical expectation-based value function approximation method to decrease the value overestimation in temporal-difference (TD) learning. Because of Optimizer's Curse, value will be overestimated easily either in softmax policy or greedy policy.
In order to address this problem, we use the tanh value of action-value instead of action-value to calculate policy. Tanh function will limit the extreme and decline the influence of maximal action-value in policy. With this tanh softmax policy, our expectation-based method can decrease the value overestimation successfully.
Examining the benefit of decreasing value overestimation, we corporate our expectation-based method into advantage learning (AL) and propose expected advantage learning (eAL). We use four criteria to evaluate the performance of Deep Q Network (DQN), AL and eAL in every episodes. Our result shows that eAL can improve the performance in score-related criteria, and got maximal 6% more than AL got in the highest score criterion.
[1] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning. Cambridge, MA,
USA: MIT Press, 1st ed., 1998.
[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrit-
twieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., “Mastering the game of
go with deep neural networks and tree search,” nature, vol. 529, no. 7587, p. 484, 2016.
[3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep
reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
[4] M. G. Bellemare, G. Ostrovski, A. Guez, P. S. Thomas, and R. Munos, “Increasing the
action gap: New operators for reinforcement learning,” CoRR, vol. abs/1512.04860,
2015.
[5] L. C. Baird III, “Reinforcement learning through gradient descent,” tech. rep.,
CARNEGIE-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE,
1999.
[6] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine
learning, vol. 3, no. 1, pp. 9–44, 1988.
[7] H. V. Hasselt, “Double q-learning,” in Advances in Neural Information Processing Sys-
tems, pp. 2613–2621, 2010.
[8] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-
learning,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep con-
volutional neural networks,” in Advances in neural information processing systems,
pp. 1097–1105, 2012.
[10] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neu-
ral networks with pruning, trained quantization and huffman coding,” arXiv preprint
arXiv:1510.00149, 2015.
[11] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1–9, 2015.
[12] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by
reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
[13] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception
architecture for computer vision,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 2818–2826, 2016.
[14] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet
and the impact of residual connections on learning,” in Thirty-First AAAI Conference
on Artificial Intelligence, 2017.
[15] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Pro-
ceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–
1258, 2017.
[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
in Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 770–778, 2016.
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in
European conference on computer vision, pp. 630–645, Springer, 2016.
[18] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations
for deep neural networks,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 1492–1500, 2017.
[19] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures
for scalable image recognition,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 8697–8710, 2018.
[20] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” in Ad-
vances in neural information processing systems, pp. 3856–3866, 2017.
[21] G. E. Hinton, S. Sabour, and N. Frosst, “Matrix capsules with EM routing,” in Interna-
tional Conference on Learning Representations, 2018.
[22] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations
by error propagation,” tech. rep., California Univ San Diego La Jolla Inst for Cognitive
Science, 1985.
[23] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate
object detection and semantic segmentation,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 580–587, 2014.
[24] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on com-
puter vision, pp. 1440–1448, 2015.
[25] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection
with region proposal networks,” in Advances in neural information processing systems,
pp. 91–99, 2015.
[26] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the
IEEE international conference on computer vision, pp. 2961–2969, 2017.
[27] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified,
real-time object detection,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 779–788, 2016.
[28] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto,
and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision
applications,” arXiv preprint arXiv:1704.04861, 2017.
[29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation,
vol. 9, no. 8, pp. 1735–1780, 1997.
[30] W.-C. Chien, H.-Y. Weng, C.-F. Lai, Z. Fan, H.-C. Chao, and Y. Hu, “A sfc-based ac-
cess point switching mechanism for software-defined wireless network in iov,” Future
Generation Computer Systems, vol. 98, pp. 577 – 585, 2019.
[31] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural net-
works,” in Advances in neural information processing systems, pp. 3104–3112, 2014.
[32] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image cap-
tion generator,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 3156–3164, 2015.
[33] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Ben-
gio, “Show, attend and tell: Neural image caption generation with visual attention,” in
International conference on machine learning, pp. 2048–2057, 2015.
[34] T. Rocktäschel, E. Grefenstette, K. M. Hermann, T. Kočiskỳ, and P. Blunsom, “Rea-
soning about entailment with neural attention,” arXiv preprint arXiv:1509.06664, 2015.
[35] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning
to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[36] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and
Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical
machine translation,” arXiv preprint arXiv:1406.1078, 2014.
[37] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based
neural machine translation,” arXiv preprint arXiv:1508.04025, 2015.
[38] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and
I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Process-
ing Systems, pp. 5998–6008, 2017.
[39] R. Y. Rubinstein, “Optimization of computer simulation models with rare events,” Eu-
ropean Journal of Operational Research, vol. 99, no. 1, pp. 89 – 112, 1997.
[40] R. Y. Rubinstein, Combinatorial Optimization, Cross-Entropy, Ants and Rare Events,
pp. 303–363. Boston, MA: Springer US, 2001.
[41] P.-T. de Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A tutorial on the cross-
entropy method,” Annals of Operations Research, vol. 134, pp. 19–67, Feb 2005.
[42] L. Rosasco, E. D. Vito, A. Caponnetto, M. Piana, and A. Verri, “Are loss functions all
the same?,” Neural Computation, vol. 16, no. 5, pp. 1063–1076, 2004.
[43] T. Chai and R. R. Draxler, “Root mean square error (rmse) or mean absolute error
(mae)?–arguments against avoiding rmse in the literature,” Geoscientific model devel-
opment, vol. 7, no. 3, pp. 1247–1250, 2014.
[44] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceed-
ings of COMPSTAT’2010, pp. 177–186, Springer, 2010.
[45] A. Cauchy, “Méthode générale pour la résolution des systemes d’équations simul-
tanées,” Comp. Rend. Sci. Paris, vol. 25, no. 1847, pp. 536–538, 1847.
[46] H. Robbins and S. Monro, “A stochastic approximation method,” Ann. Math. Statist.,
vol. 22, pp. 400–407, 09 1951.
[47] J. Kiefer and J. Wolfowitz, “Stochastic estimation of the maximum of a regression func-
tion,” Ann. Math. Statist., vol. 23, pp. 462–466, 09 1952.
[48] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine
learning,” Siam Review, vol. 60, no. 2, pp. 223–311, 2018.
[49] J. Śniatycki and A. Weinstein, “Reduction and quantization for singular momentum
mappings,” Letters in Mathematical Physics, vol. 7, pp. 155–161, Mar 1983.
[50] Y. E. Nesterov, “A method for solving the convex programming problem with conver-
gence rate o (1/k^ 2),” in Dokl. akad. nauk Sssr, vol. 269, pp. 543–547, 1983.
[51] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning
and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul,
pp. 2121–2159, 2011.
[52] T. Tieleman and G. Hinton, “Lecture 6.5—RmsProp: Divide the gradient by a running
average of its recent magnitude.” COURSERA: Neural Networks for Machine Learn-
ing, 2012.
[53] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint
arXiv:1212.5701, 2012.
[54] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint
arXiv:1412.6980, 2014.
[55] T. Dozat, “Incorporating nesterov momentum into adam,” 2016.
[56] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization
and momentum in deep learning,” in International conference on machine learning,
pp. 1139–1147, 2013.
[57] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” in
International Conference on Learning Representations, 2018.
[58] G. Lample and D. S. Chaplot, “Playing fps games with deep reinforcement learning,”
in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[59] T. C. Wu, S. Y. Tseng, C. F. Lai, C. Y. Ho, and Y. S. Lai, “Navigating assistance system
for quadcopter with deep reinforcement learning,” in 2018 1st International Cognitive
Cities Conference (IC3), pp. 16–19, Aug 2018.
[60] C. Y. Ho, S. Y. Tseng, C. F. Lai, M. S. Wang, and C. J. Chen, “A parameter sharing
method for reinforcement learning model between airsim and uavs,” in 2018 1st Inter-
national Cognitive Cities Conference (IC3), pp. 20–23, Aug 2018.
[61] W. Koch, R. Mancuso, R. West, and A. Bestavros, “Reinforcement learning for uav
attitude control,” ACM Transactions on Cyber-Physical Systems, vol. 3, no. 2, p. 22,
2019.
[62] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning
augmentation policies from data,” arXiv preprint arXiv:1805.09501, 2018.
[63] J. Huang, N. Li, T. Zhang, G. Li, T. Huang, and W. Gao, “Sap: Self-adaptive proposal
model for temporal action detection based on reinforcement learning,” in Thirty-Second
AAAI Conference on Artificial Intelligence, 2018.
[64] D. P. Bertsekas, Dynamic programming and optimal control, vol. 1. Athena scientific
Belmont, MA, 1995.
[65] L. Buşoniu, B. De Schutter, and R. Babuška, Approximate Dynamic Programming and
Reinforcement Learning, pp. 3–44. Berlin, Heidelberg: Springer Berlin Heidelberg,
2010.
[66] R. Bellman, “A markovian decision process,” Indiana Univ. Math. J., vol. 6, pp. 679–
684, 1957.
[67] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming.
John Wiley & Sons, 2014.
[68] R. Bellman, Dynamic Programming. Princeton, NJ, USA: Princeton University Press,
1 ed., 1957.
[69] C. J. C. H. Watkins, “Learning from delayed rewards,” 1989.
[70] C. J. Watkins and P. Dayan, “Technical note: Q-learning,” Machine Learning, vol. 8,
pp. 279–292, May 1992.
[71] G. A. Rummery and M. Niranjan, On-line Q-learning using connectionist systems,
vol. 37. 1994.
[72] M. Tokic, “Adaptive ϵ-greedy exploration in reinforcement learning based on value dif-
ferences,” in Proceedings of the 33rd Annual German Conference on Advances in Ar-
tificial Intelligence, KI’10, (Berlin, Heidelberg), pp. 203–210, Springer-Verlag, 2010.
[73] J. E. Smith and R. L. Winkler, “The optimizer’s curse: Skepticism and postdecision
surprise in decision analysis,” Management Science, vol. 52, no. 3, pp. 311–322, 2006.
[74] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas, “Duel-
ing network architectures for deep reinforcement learning,” in Proceedings of the 33rd
International Conference on International Conference on Machine Learning-Volume
48, pp. 1995–2003, JMLR. org, 2016.
[75] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and
K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in Inter-
national conference on machine learning, pp. 1928–1937, 2016.
[76] M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, and J. Kautz, “Reinforcement learning
through asynchronous advantage actor-critic on a gpu,” in Learning Representations,
pp. 1–12, ICLR, 2017.
[77] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” CoRR,
vol. abs/1511.05952, 2016.
[78] D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. van Hasselt, and D. Sil-
ver, “Distributed prioritized experience replay,” in International Conference on Learn-
ing Representations, 2018.
[79] H. van Seijen, H. van Hasselt, S. Whiteson, and M. Wiering, “A theoretical and em-
pirical analysis of expected sarsa,” in 2009 IEEE Symposium on Adaptive Dynamic
Programming and Reinforcement Learning, pp. 177–184, March 2009.
[80] L. C. Baird, “Reinforcement learning in continuous time: Advantage updating,” in
Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94),
vol. 4, pp. 2448–2453, IEEE, 1994.
[81] S. Syafiie, F. Tadeo, and E. Martinez, “Softmax and ε-greedy policies applied to pro-
cess control,” IFAC Proceedings Volumes, vol. 37, no. 12, pp. 729 – 734, 2004. IFAC
Workshop on Adaptation and Learning in Control and Signal Processing (ALCOSP 04)
and IFAC Workshop on Periodic Control Systems (PSYCO 04), Yokohama, Japan, 30
August - 1 September, 2004.
[82] N. Ding and R. Soricut, “Cold-start reinforcement learning with softmax policy gradi-
ent,” in Advances in Neural Information Processing Systems, pp. 2817–2826, 2017.
[83] K. Asadi and M. L. Littman, “An alternative softmax operator for reinforcement learn-
ing,” in Proceedings of the 34th International Conference on Machine Learning - Vol-
ume 70, ICML’17, pp. 243–252, JMLR.org, 2017.
[84] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and
W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
[85] B. Widrow and F. W. Smith, “Pattern-recognizing control systems,” 1964.
[86] A. W. Moore, “Efficient memory-based learning for robot control,” 1990.
[87] M. W. Spong, “The swing up control problem for the acrobot,” IEEE Control Systems
Magazine, vol. 15, pp. 49–55, Feb 1995.
[88] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning envi-
ronment: An evaluation platform for general agents,” Journal of Artificial Intelligence
Research, vol. 47, pp. 253–279, 2013.
[89] T. Erez, Y. Tassa, and E. Todorov, “Infinite horizon model predictive control for non-
linear periodic tasks,” Manuscript under review, vol. 4, 2011.