簡易檢索 / 詳目顯示

研究生: 陳杰翰
Chen, Jie-Han
論文名稱: 神經嫁接網絡: 利用遷移學習建構多任務情境策略
Neural Grafting Network: Construct Multi-Task Contextual Policy by Transfer Learning
指導教授: 莊坤達
Chuang, Kun-Ta
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 英文
論文頁數: 36
中文關鍵詞: 神經網絡Dropout遷移學習多任務強化學習星海爭霸
外文關鍵詞: Neural Network, Dropout, transfer learning, multi-task reinforcement learning, StarCraft
相關次數: 點閱:94下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,深度強化學習在許多領域都取得令人印象深刻的成果。然而,對於強化學習而言,如何定義獎勵函式對於引導強化學習是至關重要的,獎勵函式在多任務情境下更具挑戰。在沒有足夠獎勵的引導下,強化學習演算法很難引導人工智慧學習到有用的知識。我們在本篇論文中提出一種嶄新的概念來引導強化學習學會多任務相關知識。我們引入 Dropout 作為動態分流的方法,用以控制多任務情境下不同神經元的的激發。同時,也藉由遷移神經網絡策略層的參數來幫助學習。基於這樣的概念,我們實做了一個特殊的神經網絡結構——神經嫁接網絡。實驗結果顯示神經嫁接網絡可藉由強化學習處理多任務的複雜情境,遷移學習也大量減少了學習的時間及幫助神經網絡學習策略知識,並且在這樣的網絡架構下能夠減輕災難性遺忘的效應。此外,遷移策略層的參數在神經嫁接網絡下也出現顯著效果,可支持以往研究認為神經網絡輸出層與任務本身較為相關的假說論點。

    Deep reinforcement learning has made impressing performance in many fields recently. However, it’s critical to design the reward function for the learning agent, especially in multi-task reinforcement learning. Without explicit rewards, it’s challenging to learn useful knowledge in multi-task scenario. We proposed a novel concept to make the agent learn the knowledge in multitask decision problem, which uses transfer learning and dynamic neural network routing with Dropout. Based on dynamic neural network routing, we implemented this concept with a special neural network architecture called neural grafting network. The results has shown that neural grafting network can handle domain adaptation problem in multi-task environment, mitigate catastrophic forgetting issue in transferring different prior knowledge for specific task. Besides, transferring the weights from both early layers and latter layers without hidden layers in neural grafting network also helped multi-task agent in training stage significantly, which can be considered as an endorsement of previous assumption in transfer learning which means the weights in latter layers are highly related to the knowledge about the specific task.

    中文摘要............................................ i Abstract............................................. ii Acknowledgment ........................................ iii Contents............................................. iv ListofFigures ......................................... vi 1 Introduction......................................... 1 2 Background......................................... 4 3 RelatedWork........................................ 7 4 Method ........................................... 9 4.1 Inspiration....................................... 9 4.2 NeuralGraftingNetwork............................... 10 5 Experimental Settings ................................... 13 5.1 LearningEnvironment ................................ 13 5.2 Base model ...................................... 14 5.3 LearningAlgorithm.................................. 15 5.4 Basic tasks....................................... 15 5.4.1 CollectMineralShards-SingleMarine ..................... 15 5.4.2 CollectMineralShards-SingleSCV ...................... 16 5.4.3 DestroyBuildings-SingleBanShee....................... 16 5.5 Complicated tasks .................................. 17 5.5.1 CollectByFiveMarines ............................ 17 5.5.2 CollectByFiveMarines-Sparse ........................ 18 5.5.3 CollectAndDestroy-Sparse .......................... 18 5.6 NeuralGraftingModel ................................ 19 6 Experimental Results.................................... 20 6.1 PerformanceEvaluation ............................... 20 6.1.1 CollectByFiveMarines ............................ 20 6.1.2 CollectByFiveMarines-Sparse ........................ 22 6.1.3 CollectAndDestroy-Sparse .......................... 23 6.2 AblationTest ..................................... 25 6.3 TransferAnalysis ................................... 26 7 Conclusions ......................................... 30 Bibliography .......................................... 32

    [1] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” Journal of Machine Learning Research, JMLR, vol. 17, pp. 39:1–39:40, 2016.
    [2] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” I. J. Robotics Res., vol. 37, no. 4-5, pp. 421–436, 2018.
    [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
    [4] Y. Wu and Y. Tian, “Training agent for first-person shooter game with actor-critic curriculum learning,” in Fifth International Conference on Learning Representations, ICLR, 2016.
    [5] G. Lample and D. S. Chaplot, “Playing fps games with deep reinforcement learning.” in AAAI, 2017, pp. 2140–2146.
    [6] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
    [7] J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao, “Deep reinforcement learning for dialogue generation,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP, 2016, pp. 1192–1202.
    [8] B. Dhingra, L. Li, X. Li, J. Gao, Y. Chen, F. Ahmed, and L. Deng, “Towards end-to-end reinforcement learning of dialogue agents for information access,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL, 2017, pp. 484–495.
    [9] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, p. 484, 2016.
    [10] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, 2017.
    [11] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al., “Mastering chess and shogi by self-play with a general reinforcement learning algorithm,” arXiv preprint arXiv:1712.01815, 2017.
    [12] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” in Proceedings of the International Conference on Learning Representations, ICLR, 2017.
    [13] C. Liu, B. Zoph, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search,” arXiv preprint arXiv:1712.00559, 2017.
    [14] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Ecient neural architecture search via parameter sharing,” arXiv preprint arXiv:1802.03268, 2018.
    [15] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems, NIPS, 2014, pp. 3320–3328.
    [16] J. Blitzer, R. McDonald, and F. Pereira, “Domain adaptation with structural correspondence learning,” in Proceedings of the 2006 conference on empirical methods in natural language processing, EMNLP, 2006, pp. 120–128.
    [17] H. Daume III and D. Marcu, “Domain adaptation for statistical classifiers,” Journal of Artificial Intelligence Research, JAIR, vol. 26, pp. 101–126, 2006.
    [18] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199–210, 2011.
    [19] R. M. French, “Catastrophic forgetting in connectionist networks,” Trends in cognitive sciences, vol. 3, no. 4, pp. 128–135, 1999.
    [20] M. McCloskey and N. J. Cohen, “Catastrophic interference in connectionist networks: The sequential learning problem,” in Psychology of learning and motivation, 1989, vol. 24, pp. 109–165.
    [21] J. L. McClelland, B. L. McNaughton, and R. C. O’reilly, “Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory.” Psychological review, vol. 102, no. 3, p. 419, 1995.
    [22] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Ku ̈ttler, J. Agapiou, J. Schrittwieser, J. Quan, S. Ga↵ney, S. Petersen, K. Simonyan, T. Schaul, H. van Hasselt, D. Silver, T. P. Lillicrap, K. Calderone, P. Keet, A. Brunasso, D. Lawrence, A. Ekermo, J. Repp, and R. Tsing, “Starcraft II: A new challenge for rein- forcement learning,” CoRR, vol. abs/1708.04782, 2017.
    [23] R. S. Sutton, A. G. Barto et al., Reinforcement learning: An introduction, 1998.
    [24] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in Proceedings of the 33nd International Conference on Machine Learning, ICML, 2016, pp. 1928–1937.
    [25] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in neural infor- mation processing systems, NIPS, 2000, pp. 1008–1014.
    [26] J. Andreas, D. Klein, and S. Levine, “Modular multitask reinforcement learning with policy sketches,” in Proceedings of the 34th International Conference on Machine Learning, ICML, 2017, pp. 166–175.
    [27] C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine, “Learning modular neural network policies for multi-task and multi-robot transfer,” in International Conference on Robotics and Automation, ICRA, 2017, pp. 2169–2176.
    [28] R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 1-2, pp. 181–211, 1999.
    [29] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation,” in Advances in neural information processing systems, NIPS, 2016, pp. 3675–3683.
    [30] K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman, “META LEARNING SHARED HIERARCHIES,” in International Conference on Learning Representations, ICLR, 2018.
    [31] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in NIPS Deep Learning and Representation Learning Workshop, 2015.
    [32] E. Parisotto, L. J. Ba, and R. Salakhutdinov, “Actor-mimic: Deep multitask and transfer reinforcement learning,” in International Conference on Learning Representations, ICLR, 2016.
    [33] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu, “Distral: Robust multitask reinforcement learning,” in Advances in Neural Information Processing Systems, NIPS, 2017, pp. 4496–4506.
    [34] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,” arXiv preprint arXiv:1606.04671, 2016.
    [35] S. Tubbs, E. Rizk, M. M. Shoja, M. Loukas, N. Barbaro, and R. J. Spinne, “Chapter 17 - nerve grafting methods,” in Nerves and Nerve Injuries. Academic Press, 2015, pp. 237 – 248.
    [36] X. Glorot and Y. Bengio, “Understanding the diculty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics, AISTATS, 2010, pp. 249–256.
    [37] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human- level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, ICCV, 2015, pp. 1026–1034.
    [38] S. Io↵e and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, ICML, 2015, pp. 448–456.
    [39] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, JMLR, vol. 15, no. 1, pp. 1929–1958, 2014.
    [40] R. Pascanu, T. Mikolov, and Y. Bengio, “On the diculty of training recurrent neural networks,” in Proceedings of the 30th International Conference on Machine Learning, ICML, 2013, pp. 1310–1318.
    [41] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2015, pp. 3431–3440.
    [42] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” CoRR, vol. abs/1506.02438, 2015.
    [43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, ICLR, 2015.
    [44] S. Thrun and L. Pratt, Learning to learn, 2012.
    [45] T. G. Dietterich, “Hierarchical reinforcement learning with the maxq value function decomposition,” Journal of Artificial Intelligence Research, JAIR, vol. 13, pp. 227–303, 2000.

    無法下載圖示 校內:2020-08-31公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE