簡易檢索 / 詳目顯示

研究生: 吳東承
Wu, Tung-Cheng
論文名稱: 基於期望值方法的強化學習價值逼近函數之實務計算方式
Moderating Maximal Value - a Practical Expectation-Based Method for Value Function Approximation in Reinforcement Learning
指導教授: 賴槿峰
Lai, Chin-Feng
學位類別: 碩士
Master
系所名稱: 工學院 - 工程科學系
Department of Engineering Science
論文出版年: 2019
畢業學年度: 107
語文別: 中文
論文頁數: 51
中文關鍵詞: 優勢學習期望值方法價值逼近函數
外文關鍵詞: Advantage Learning, Expectation-based Method, Value Function Approximation
相關次數: 點閱:138下載:23
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究提出新的期望值方法之實務計算方式,應用於強化學習中的價值逼近函數中。減緩同時使用最佳化方法與 ϵ - greedy 策略時,產生的價值高估問題。並將本研究的價值估計方法,應用於優勢學習演算法,提出期望優勢學習演算法。

    相較於傳統的 softmax policy 直接使用動作價值計算動作機率分布,本研究使用動作價值的 tanh 數值計算動作機率分布。藉由 tanh 函數限制增長的特性,避免最大動作價值主導動作機率分布,讓產生的動作機率分布不會集中於特定動作。確保使用期望值方法計算價值時,產生的價值不會有嚴重的價值高估情況。

    於實驗結果中,本研究以總得分、總步數、平均得分與最高得分四個指標,評估Deep Q Network 演算法、優勢學習演算法、期望優勢學習演算法三者的表現。並於本研究選擇的三個實驗環境中,期望優勢學習皆獲得最高分。透過觀察總得分可以發現,在相同的訓練總次數下,使用期望優勢學習演算法相較於優勢學習演算法,至多增加 6% 的總得分。

    In this study, we propose a practical expectation-based value function approximation method to decrease the value overestimation in temporal-difference (TD) learning. Because of Optimizer's Curse, value will be overestimated easily either in softmax policy or greedy policy.

    In order to address this problem, we use the tanh value of action-value instead of action-value to calculate policy. Tanh function will limit the extreme and decline the influence of maximal action-value in policy. With this tanh softmax policy, our expectation-based method can decrease the value overestimation successfully.

    Examining the benefit of decreasing value overestimation, we corporate our expectation-based method into advantage learning (AL) and propose expected advantage learning (eAL). We use four criteria to evaluate the performance of Deep Q Network (DQN), AL and eAL in every episodes. Our result shows that eAL can improve the performance in score-related criteria, and got maximal 6% more than AL got in the highest score criterion.

    章節目錄 摘要 i 英文延伸摘要 ii 誌謝 xiii 章節目錄 xiv 表目錄 xvi 圖目錄 xvii 縮寫表 xviii 符號表 xix Chapter 1. 簡介 1 1.1. 研究動機 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2. 研究目標 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3. 章節提要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Chapter 2. 研究背景與相關文獻 3 2.1. 研究背景 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.1. 神經網路架構研究 . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.2. 最佳化算法研究 . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.3. 模型結合研究 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2. 強化學習介紹 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1. 馬可夫決策過程 . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.2. 價值函數 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.3. Temporal Difference Learning . . . . . . . . . . . . . . . . . . . . 11 2.2.4. Deep Q Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3. 價值高估 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1. 價值高估之因與其問題 . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2. 近年相關研究 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 3. 研究方法 15 3.1. 期望優勢學習 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.1. 優勢學習 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.2. 新價值定義 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.3. 優勢更新分析 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.4. 候選策略 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2. 學習流程 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.1. 價值評估模型 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 xiv 3.2.2. 決策過程 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.3. 更新模型參數 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.4. 期望優勢學習演算法之學習流程 . . . . . . . . . . . . . . . . . . 26 Chapter 4. 研究結果與討論 28 4.1. 實驗設計 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.1.1. 實驗環境 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.1.2. 實驗流程 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2. 實驗結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2.1. 策略選擇實驗 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2.2. 表現比較實驗 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3. 結果討論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Chapter 5. 結論與未來展望 42 5.1. 研究結論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.2. 未來展望 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 References 44 Appendix A. 相關研究虛擬碼 50 A.1. DQN 之虛擬碼 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 A.2. AL 之虛擬碼 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    [1] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning. Cambridge, MA,
    USA: MIT Press, 1st ed., 1998.
    [2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrit-
    twieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., “Mastering the game of
    go with deep neural networks and tree search,” nature, vol. 529, no. 7587, p. 484, 2016.
    [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
    M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep
    reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
    [4] M. G. Bellemare, G. Ostrovski, A. Guez, P. S. Thomas, and R. Munos, “Increasing the
    action gap: New operators for reinforcement learning,” CoRR, vol. abs/1512.04860,
    2015.
    [5] L. C. Baird III, “Reinforcement learning through gradient descent,” tech. rep.,
    CARNEGIE-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE,
    1999.
    [6] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine
    learning, vol. 3, no. 1, pp. 9–44, 1988.
    [7] H. V. Hasselt, “Double q-learning,” in Advances in Neural Information Processing Sys-
    tems, pp. 2613–2621, 2010.
    [8] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-
    learning,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
    [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep con-
    volutional neural networks,” in Advances in neural information processing systems,
    pp. 1097–1105, 2012.
    [10] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neu-
    ral networks with pruning, trained quantization and huffman coding,” arXiv preprint
    arXiv:1510.00149, 2015.
    [11] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
    and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE
    conference on computer vision and pattern recognition, pp. 1–9, 2015.
    [12] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by
    reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
    [13] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception
    architecture for computer vision,” in Proceedings of the IEEE conference on computer
    vision and pattern recognition, pp. 2818–2826, 2016.
    [14] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet
    and the impact of residual connections on learning,” in Thirty-First AAAI Conference
    on Artificial Intelligence, 2017.
    [15] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Pro-
    ceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–
    1258, 2017.
    [16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
    in Proceedings of the IEEE conference on computer vision and pattern recognition,
    pp. 770–778, 2016.
    [17] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in
    European conference on computer vision, pp. 630–645, Springer, 2016.
    [18] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations
    for deep neural networks,” in Proceedings of the IEEE Conference on Computer Vision
    and Pattern Recognition, pp. 1492–1500, 2017.
    [19] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures
    for scalable image recognition,” in Proceedings of the IEEE conference on computer
    vision and pattern recognition, pp. 8697–8710, 2018.
    [20] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” in Ad-
    vances in neural information processing systems, pp. 3856–3866, 2017.
    [21] G. E. Hinton, S. Sabour, and N. Frosst, “Matrix capsules with EM routing,” in Interna-
    tional Conference on Learning Representations, 2018.
    [22] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations
    by error propagation,” tech. rep., California Univ San Diego La Jolla Inst for Cognitive
    Science, 1985.
    [23] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate
    object detection and semantic segmentation,” in Proceedings of the IEEE conference
    on computer vision and pattern recognition, pp. 580–587, 2014.
    [24] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on com-
    puter vision, pp. 1440–1448, 2015.
    [25] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection
    with region proposal networks,” in Advances in neural information processing systems,
    pp. 91–99, 2015.
    [26] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the
    IEEE international conference on computer vision, pp. 2961–2969, 2017.
    [27] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified,
    real-time object detection,” in Proceedings of the IEEE conference on computer vision
    and pattern recognition, pp. 779–788, 2016.
    [28] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto,
    and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision
    applications,” arXiv preprint arXiv:1704.04861, 2017.
    [29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation,
    vol. 9, no. 8, pp. 1735–1780, 1997.
    [30] W.-C. Chien, H.-Y. Weng, C.-F. Lai, Z. Fan, H.-C. Chao, and Y. Hu, “A sfc-based ac-
    cess point switching mechanism for software-defined wireless network in iov,” Future
    Generation Computer Systems, vol. 98, pp. 577 – 585, 2019.
    [31] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural net-
    works,” in Advances in neural information processing systems, pp. 3104–3112, 2014.
    [32] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image cap-
    tion generator,” in Proceedings of the IEEE conference on computer vision and pattern
    recognition, pp. 3156–3164, 2015.
    [33] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Ben-
    gio, “Show, attend and tell: Neural image caption generation with visual attention,” in
    International conference on machine learning, pp. 2048–2057, 2015.
    [34] T. Rocktäschel, E. Grefenstette, K. M. Hermann, T. Kočiskỳ, and P. Blunsom, “Rea-
    soning about entailment with neural attention,” arXiv preprint arXiv:1509.06664, 2015.
    [35] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning
    to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
    [36] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and
    Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical
    machine translation,” arXiv preprint arXiv:1406.1078, 2014.
    [37] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based
    neural machine translation,” arXiv preprint arXiv:1508.04025, 2015.
    [38] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and
    I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Process-
    ing Systems, pp. 5998–6008, 2017.
    [39] R. Y. Rubinstein, “Optimization of computer simulation models with rare events,” Eu-
    ropean Journal of Operational Research, vol. 99, no. 1, pp. 89 – 112, 1997.
    [40] R. Y. Rubinstein, Combinatorial Optimization, Cross-Entropy, Ants and Rare Events,
    pp. 303–363. Boston, MA: Springer US, 2001.
    [41] P.-T. de Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A tutorial on the cross-
    entropy method,” Annals of Operations Research, vol. 134, pp. 19–67, Feb 2005.
    [42] L. Rosasco, E. D. Vito, A. Caponnetto, M. Piana, and A. Verri, “Are loss functions all
    the same?,” Neural Computation, vol. 16, no. 5, pp. 1063–1076, 2004.
    [43] T. Chai and R. R. Draxler, “Root mean square error (rmse) or mean absolute error
    (mae)?–arguments against avoiding rmse in the literature,” Geoscientific model devel-
    opment, vol. 7, no. 3, pp. 1247–1250, 2014.
    [44] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceed-
    ings of COMPSTAT’2010, pp. 177–186, Springer, 2010.
    [45] A. Cauchy, “Méthode générale pour la résolution des systemes d’équations simul-
    tanées,” Comp. Rend. Sci. Paris, vol. 25, no. 1847, pp. 536–538, 1847.
    [46] H. Robbins and S. Monro, “A stochastic approximation method,” Ann. Math. Statist.,
    vol. 22, pp. 400–407, 09 1951.
    [47] J. Kiefer and J. Wolfowitz, “Stochastic estimation of the maximum of a regression func-
    tion,” Ann. Math. Statist., vol. 23, pp. 462–466, 09 1952.
    [48] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine
    learning,” Siam Review, vol. 60, no. 2, pp. 223–311, 2018.
    [49] J. Śniatycki and A. Weinstein, “Reduction and quantization for singular momentum
    mappings,” Letters in Mathematical Physics, vol. 7, pp. 155–161, Mar 1983.
    [50] Y. E. Nesterov, “A method for solving the convex programming problem with conver-
    gence rate o (1/k^ 2),” in Dokl. akad. nauk Sssr, vol. 269, pp. 543–547, 1983.
    [51] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning
    and stochastic optimization,” Journal of Machine Learning Research, vol. 12, no. Jul,
    pp. 2121–2159, 2011.
    [52] T. Tieleman and G. Hinton, “Lecture 6.5—RmsProp: Divide the gradient by a running
    average of its recent magnitude.” COURSERA: Neural Networks for Machine Learn-
    ing, 2012.
    [53] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint
    arXiv:1212.5701, 2012.
    [54] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint
    arXiv:1412.6980, 2014.
    [55] T. Dozat, “Incorporating nesterov momentum into adam,” 2016.
    [56] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization
    and momentum in deep learning,” in International conference on machine learning,
    pp. 1139–1147, 2013.
    [57] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” in
    International Conference on Learning Representations, 2018.
    [58] G. Lample and D. S. Chaplot, “Playing fps games with deep reinforcement learning,”
    in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
    [59] T. C. Wu, S. Y. Tseng, C. F. Lai, C. Y. Ho, and Y. S. Lai, “Navigating assistance system
    for quadcopter with deep reinforcement learning,” in 2018 1st International Cognitive
    Cities Conference (IC3), pp. 16–19, Aug 2018.
    [60] C. Y. Ho, S. Y. Tseng, C. F. Lai, M. S. Wang, and C. J. Chen, “A parameter sharing
    method for reinforcement learning model between airsim and uavs,” in 2018 1st Inter-
    national Cognitive Cities Conference (IC3), pp. 20–23, Aug 2018.
    [61] W. Koch, R. Mancuso, R. West, and A. Bestavros, “Reinforcement learning for uav
    attitude control,” ACM Transactions on Cyber-Physical Systems, vol. 3, no. 2, p. 22,
    2019.
    [62] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning
    augmentation policies from data,” arXiv preprint arXiv:1805.09501, 2018.
    [63] J. Huang, N. Li, T. Zhang, G. Li, T. Huang, and W. Gao, “Sap: Self-adaptive proposal
    model for temporal action detection based on reinforcement learning,” in Thirty-Second
    AAAI Conference on Artificial Intelligence, 2018.
    [64] D. P. Bertsekas, Dynamic programming and optimal control, vol. 1. Athena scientific
    Belmont, MA, 1995.
    [65] L. Buşoniu, B. De Schutter, and R. Babuška, Approximate Dynamic Programming and
    Reinforcement Learning, pp. 3–44. Berlin, Heidelberg: Springer Berlin Heidelberg,
    2010.
    [66] R. Bellman, “A markovian decision process,” Indiana Univ. Math. J., vol. 6, pp. 679–
    684, 1957.
    [67] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming.
    John Wiley & Sons, 2014.
    [68] R. Bellman, Dynamic Programming. Princeton, NJ, USA: Princeton University Press,
    1 ed., 1957.
    [69] C. J. C. H. Watkins, “Learning from delayed rewards,” 1989.
    [70] C. J. Watkins and P. Dayan, “Technical note: Q-learning,” Machine Learning, vol. 8,
    pp. 279–292, May 1992.
    [71] G. A. Rummery and M. Niranjan, On-line Q-learning using connectionist systems,
    vol. 37. 1994.
    [72] M. Tokic, “Adaptive ϵ-greedy exploration in reinforcement learning based on value dif-
    ferences,” in Proceedings of the 33rd Annual German Conference on Advances in Ar-
    tificial Intelligence, KI’10, (Berlin, Heidelberg), pp. 203–210, Springer-Verlag, 2010.
    [73] J. E. Smith and R. L. Winkler, “The optimizer’s curse: Skepticism and postdecision
    surprise in decision analysis,” Management Science, vol. 52, no. 3, pp. 311–322, 2006.
    [74] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas, “Duel-
    ing network architectures for deep reinforcement learning,” in Proceedings of the 33rd
    International Conference on International Conference on Machine Learning-Volume
    48, pp. 1995–2003, JMLR. org, 2016.
    [75] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and
    K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in Inter-
    national conference on machine learning, pp. 1928–1937, 2016.
    [76] M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, and J. Kautz, “Reinforcement learning
    through asynchronous advantage actor-critic on a gpu,” in Learning Representations,
    pp. 1–12, ICLR, 2017.
    [77] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” CoRR,
    vol. abs/1511.05952, 2016.
    [78] D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. van Hasselt, and D. Sil-
    ver, “Distributed prioritized experience replay,” in International Conference on Learn-
    ing Representations, 2018.
    [79] H. van Seijen, H. van Hasselt, S. Whiteson, and M. Wiering, “A theoretical and em-
    pirical analysis of expected sarsa,” in 2009 IEEE Symposium on Adaptive Dynamic
    Programming and Reinforcement Learning, pp. 177–184, March 2009.
    [80] L. C. Baird, “Reinforcement learning in continuous time: Advantage updating,” in
    Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94),
    vol. 4, pp. 2448–2453, IEEE, 1994.
    [81] S. Syafiie, F. Tadeo, and E. Martinez, “Softmax and ε-greedy policies applied to pro-
    cess control,” IFAC Proceedings Volumes, vol. 37, no. 12, pp. 729 – 734, 2004. IFAC
    Workshop on Adaptation and Learning in Control and Signal Processing (ALCOSP 04)
    and IFAC Workshop on Periodic Control Systems (PSYCO 04), Yokohama, Japan, 30
    August - 1 September, 2004.
    [82] N. Ding and R. Soricut, “Cold-start reinforcement learning with softmax policy gradi-
    ent,” in Advances in Neural Information Processing Systems, pp. 2817–2826, 2017.
    [83] K. Asadi and M. L. Littman, “An alternative softmax operator for reinforcement learn-
    ing,” in Proceedings of the 34th International Conference on Machine Learning - Vol-
    ume 70, ICML’17, pp. 243–252, JMLR.org, 2017.
    [84] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and
    W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
    [85] B. Widrow and F. W. Smith, “Pattern-recognizing control systems,” 1964.
    [86] A. W. Moore, “Efficient memory-based learning for robot control,” 1990.
    [87] M. W. Spong, “The swing up control problem for the acrobot,” IEEE Control Systems
    Magazine, vol. 15, pp. 49–55, Feb 1995.
    [88] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning envi-
    ronment: An evaluation platform for general agents,” Journal of Artificial Intelligence
    Research, vol. 47, pp. 253–279, 2013.
    [89] T. Erez, Y. Tassa, and E. Todorov, “Infinite horizon model predictive control for non-
    linear periodic tasks,” Manuscript under review, vol. 4, 2011.

    下載圖示 校內:2020-07-23公開
    校外:2020-07-23公開
    QR CODE