| 研究生: |
葉庭瑜 Ye, Ting-Yu |
|---|---|
| 論文名稱: |
基於深度增強式學習之二軸機械手臂視覺追蹤系統研究 Study on Visual Tracking System of a 2-DOF Manipulator Based on Deep Reinforcement Learning |
| 指導教授: |
鄭銘揚
Cheng, Ming-Yang |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2019 |
| 畢業學年度: | 107 |
| 語文別: | 中文 |
| 論文頁數: | 103 |
| 中文關鍵詞: | 機械手臂 、深度增強式學習 、視覺追蹤系統 |
| 外文關鍵詞: | Industrial Manipulators, Deep Reinforcement Learning, Visual Tracking System |
| 相關次數: | 點閱:208 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著全球產業轉型邁向工廠自動化,機械手臂的身影早已遍布生產線上,為了賦予機械手臂學習與思考能力進而能因應不同狀況做出智慧決策,本論文引入增強式學習於機械手臂中,其採取獎勵機制進行學習,極適合用於控制機械手臂之序列決策,學習後即能因應不同的環境狀態給出最佳的決策,其中深度增強式學習為增強式學習之熱門研究領域,由於整合了深度學習與增強式學習,使之能解決更複雜的連續控制問題。有鑑於此,本論文實現三種基於深度增強式學習方法—Proximal Policy Optimization with KL-Penalty、Proximal Policy Optimization with CLIP與Deep Deterministic Policy Gradient之視覺追蹤系統,並應用於二軸SCARA機械手臂,使之能根據環境狀態變化給予最佳決策,最後根據學習性能與實機運行結果比較整體性能及應用於該任務之適用性,經由實驗結果顯示Deep Deterministic Policy Gradient在三者之中表現最為突出。
With the state of global industry trending towards factory automation, the presence of robot manipulators has already begun to dominate production lines. In order to provide robot manipulators with the ability to make intelligent decisions according to different situations, this thesis introduces reinforcement learning into robot manipulators. The learning method is based on the given reward and is well-suited for controlling the sequence decisions of the robot manipulators. After completing its learning, it can make its optimal decisions according to different environmental states. Deep reinforcement learning is a popular branch of research in reinforcement learning. Due to the introduction of deep learning into reinforcement learning, more complex continuous control problems can be solved. In light of this, this thesis implements three deep reinforcement learning-based visual tracking systems—Proximal Policy Optimization with KL-Penalty, Proximal Policy Optimization with CLIP, and Deep Deterministic Policy Gradient—which are applied in a 2-DOF robot manipulator to enable it to make the best decision according to environmental changes. This thesis evaluates the overall performance and applicability of the three methods, including learning performance and visual tracking experimental results. Experimental results indicate that the Deep Deterministic Policy Gradient has the best performance among the three.
[1] H. Lasi, P. Fettke, H. G. Kemper, T. Feld, and M. Hoffmann, “Industry 4.0,” Business & Information Systems Engineering, vol. 6, no. 4, pp. 239-242, Aug. 2014.
[2] J. R. Parker, Algorithms for image processing and computer vision, Hoboken, New Jersey:John Wiley & Sons, 2010.
[3] S. J. Russell and P. Norvig, Artificial intelligence: a modern approach, Upper Saddle River, New Jersey: Prentice Hall, 2009.
[4] J. H. Holland, Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control and artificial intelligence, Cambridge, MA: MIT Press, 1992.
[5] Y. Shoham, “Towards the AI index,” AI Magazine, vol. 38, no. 4, pp. 71-77, Win. 2017.
[6] R. S. Sutton and A. G. Barto, Reinforcement learning: an introduction, Cambridge, MA:MIT Press, 1998.
[7] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: a survey,” Journal of Artificial Intelligence Research, vol. 4, no. 1, pp.237-285, Jan. 1996.
[8] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436-444, May 2015.
[9] J. Schmidhuber, “Deep learning in neural networks: an overview,” Neural Networks, vol. 61, pp. 85-117, Jan. 2015.
[10] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “Deep reinforcement learning: a brief survey,” IEEE Signal Processing Magazine, vol. 34, no. 6, pp. 26-38, Nov. 2017.
[11] Y. Li, “Deep reinforcement learning: an overview,” arXiv preprint arXiv:1701.07274, Jan. 2017.
[12] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: a survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238-1274, Aug. 2013.
[13] P. Kormushev, S. Calinon, and D. Caldwell, “Reinforcement learning in robotics: Applications and real-world challenges,” Robotics, vol. 2, no. 3, pp.122-148, Jul. 2013.
[14] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, F. F. Li, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” in Proc. of IEEE international conference on robotics and automation, 2017, pp. 3357-3364.
[15] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates,” in Proc. of IEEE international conference on robotics and automation, 2017, pp. 3389-3396.
[16] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” arXiv preprint arXiv:1504.00702, Apr. 2015.
[17] M. Breyer, F. Furrer, T. Novkovic, R. Siegwart, and J. Nieto, “Comparing task simplifications to learn closed-loop object picking using deep reinforcement learning,” IEEE Robotics and Automation Letters, vol. 4 , no. 2 , pp. 1549-1556, Apr. 2019.
[18] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” The International Journal of Robotics Research, vol. 37, no. 4-5, pp. 421-436, Apr. 2018.
[19] M. Bishop, Pattern recognition and machine learning, New York, NY: Springer-Verlag, 2011.
[20] E. Alpaydın, Introduction to machine learning, England, London: MIT Press, 2010.
[21] D. R. Hush and B. G. Horne, “Progress in supervised neural networks,” IEEE Signal Processing Magazine, vol. 10, no. 1, pp. 8-39, Jan. 1993.
[22] R. D. Reed and R. J. Marks, Neural smithing: supervised learning in feedforward artificial neural networks, Cambridge, MA: MIT Press, 1988.
[23] O. Chapelle, B. Scholkopf, and A. Z. Eds, “Semi-supervised learning,” IEEE Transactions on Neural Networks, vol. 20, no.3, pp.542-542, Mar. 2009.
[24] Chakraborty and B. Chakraborty, “A novel normalization technique for unsupervised learning in ANN,” IEEE Transactions on Neural Networks, vol. 11, no. 1, pp. 253-257, Jan. 2000.
[25] M. L. Littman, “Reinforcement learning improves behavior from evaluative feedback,” Nature, vol. 521, no. 7553, pp. 445-451, May 2015.
[26] R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731, pp. 34-37, Jul. 1966.
[27] C. Robert and G. Casella, Monte carlo statistical methods, New York, NY: Springer-Verlag, 2004.
[28] G. Tesauro, “Temporal difference learning and td-gammon,” Communications of the ACM, vol. 38, no. 3, pp. 58-68, Mar. 1995.
[29] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, no. 3-4, pp. 279-292, May 1992.
[30] H. Shi, X. Li, K. S. Hwang, W. Pan, and G. Xu, “Decoupled visual servoing with fuzzy q-learning,” IEEE Transactions on Industrial Informatics, vol. 14, no. 1, pp.241-252, Jan. 2018.
[31] Y. Wang, H. Lang, and C. W. D. Silva, “A hybrid visual servo controller for robust grasping by wheeled mobile robots,” IEEE/ASME Transactions on Mechatronics, vol. 15, no. 5, pp.757-769, Oct. 2010.
[32] C. Sammut and G. I. Webb, Encyclopedia of machine learning and data mining, New York, NY: Springer US, 2017
[33] W. B. Powell, Approximate dynamic programming: solving the curses of dimensionality, New York, NY: John Wiley & Sons, 2007.
[34] R. S. Sutton, D. A. Mcallester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Proc. of the 12th International Conf. on Neural Information Processing Systems, 2000, pp. 1057-1063.
[35] S. Melo, S. P. Meyn, and M. I. Ribeiro, “An analysis of reinforcement learning with function approximation,” in Proc. of the 25th international conference on Machine learning, 2008, pp. 664-671.
[36] J. N. Tsitsiklis and B. V. Roy, “An analysis of temporal-difference learning with function approximation,” IEEE Transactions on Automatic Control, vol. 42, no. 5, pp. 674-690, May 1997.
[37] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: a review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798-1828, Aug. 2013.
[38] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. V. D. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484-489, Jan. 2016.
[39] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. V. D. Driessche, T. Graepel, and D. Hassabis, “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 76766, pp. 354-359, Oct. 2017.
[40] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, Dec. 2013.
[41] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529-533, Feb. 2015.
[42] T. P. Le, N. D. Quang, S. Y. Choi, and T. C. Chung, “Learning a self-driving bicycle using deep deterministic policy gradient,” in Proc. of International Conference on Control, Automation and Systems, 2018, pp. 231-236.
[43] J. Peters and S. Schaal, “Natural actor-critic,” Neurocomputing, vol. 71, no. 7-9, pp.1180-1190, Mar. 2008.
[44] H. V. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,”in Proc. of the Thirtieth AAAI Conference on Artificial Intelligence, 2016, pp.2094-2100.
[45] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, Nov. 2015.
[46] Z. Wang, T. Schaul, M. Hessel, H. V. Hasselt, M. Lanctot, and N. D. Freitas, “Dueling network architectures for deep reinforcement learning,” arXiv preprint arXiv:1511.06581, Nov. 2015.
[47] M. Hessel, J. Modayil, H. V. Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow: combining improvements in deep reinforcement learning,”in Proc. of Thirty-Second AAAI Conference on Artificial Intelligence, 2018, pp. 3215-3222.
[48] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,”in Proc. of International Conference on Machine Learning, 2015, pp. 1889-1897.
[49] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, Jul. 2017.
[50] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. M. A. Eslami, M. Riedmiller, and D. Silver, “Emergence of locomotion behaviours in rich environments,”arXiv preprint arXiv:1707.02286, Jul. 2017.
[51] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical recipes: the art of scientific computing, New York, NY: Cambridge University Press, 1986.
[52] L. C. Baird, “Reinforcement learning in continuous time: advantage updating,” in Proc. of the IEEE International Conference on Neural Networks, 1994, pp. 2448-2453.
[53] B. M. Gabriel, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. TB, A. Muldal, N. Heess, and T. Lillicrap, “Distributed distributional deterministic policy gradients,” arXiv preprint arXiv:1804.08617, Apr. 2018.
[54] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, Sep. 2015.
[55] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,”in Proc. of International conference on machine learning, 2016, pp. 1928-1937.
[56] R. Hartley and A. Zisserman, Multiple view geometry in computer vision, New York, NY : Cambridge University Press, 2004.
[57] J. Kannala and S. S. Brandt, “A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses,” IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 8, pp. 1335-1340, Aug. 2006.
[58] 林潔君,基於視覺之工業用機械手臂物件夾取研究,碩士論文,國立成功大學,電機工程學系,台灣,2015。
[59] 賴彥均,應用Q-learning於搭載機械手臂自走車系統之基於影像視覺伺服研究,碩士論文,電機工程學系,台灣,2017。
[60] 張庭育,虛擬視覺伺服估測器及動態視覺伺服架構之研究,碩士論文,國立成功大學,電機工程學系,台灣,2018。
[61] Y. Motai and A. Kosaka, “Hand–eye calibration applied to viewpoint selection for robotic vision,” IEEE Transactions on Industrial Electronics, vol. 55, no. 10, pp. 3731-3741, Mar. 2008.
[62] R. Horaud and F. Dornaika, “Hand-eye calibration,” The international journal of robotics research, vol. 14, no. 3, pp. 195-210, Jun. 1995.
[63] C. Cai, N. Somani, S. Nair, D. Mendoza, and A. Knoll, “Uncalibrated stereo visual servoing for manipulators using virtual impedance control,” in Proc. of the 13th International Conference on Control Automation Robotics & Vision, 2014, pp. 1888-1893.
[64] R. P. Paul, Robot manipulators: mathematics, programming, and control : the computer control of robot manipulators, Cambridge, MA: MIT Press, 1982.
[65] J. J. Craig, Introduction to robotics: mechanics and control, Boston, MA: Addison-Wesley, 1989.
[66] L. Sciavicco and B. Siciliano, Modelling and control of robot manipulators, London, U.K.: Springer, 2000.
[67] 胡智皓,選擇順應性裝配機械手臂之外力估測與順應控制研究,碩士論文,國立成功大學,電機工程學系,台灣,2016。
[68] M. W. Spong, S. Hutchinson, and M. Vidyasagar, Robot modeling and control, Hoboken, NJ: Wiley, 2006.
[69] H. R. Frederick, D. A. Waterman, and D. B. Lenat, Building expert systems, Boston, MA: Addison-Wesley, 1983.
[70] T. Bäck, Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms, New York, NY: Oxford university press, 1996.
[71] G. E. Monahan, “State of the art—a survey of partially observable markov decision processes: theory, models, and algorithms,” Management Science, vol. 28, no.1, pp. 1-16, Jan. 1982.
[72] R. A. Howard, Dynamic programming and markov processes, England, Oxford: John Wiley, 1960.
[73] K. Mehlhorn, B. R. Newell, P. M. Todd, M. D. Lee, K. Morgan, V. A. Braithwaite, D. Hausmann, K. Fiedler, and C. Gonzalez, “Unpacking the exploration-exploitation tradeoff: a synthesis of human and animal literatures,” Decision, vol. 2, no. 3, pp. 191-215, Jul. 2015.
[74] J. G. March, “Exploration and exploitation in organizational learning,” Organization Science, vol. 2, no. 1, pp. 71-87, Feb. 1991.
[75] 姜佳伶,基於個體推薦系統之Q-learning性能改良研究,碩士論文,國立成功大學,電機工程學系,台灣,2018。
[76] M. Sniedovich, Dynamic programming: foundations and principles, Boca Raton, Florida : CYC Press, 2010.
[77] D. Silver, G. Lever, N. Heess, T. Degris, D, Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in Proc. of the International Conference on International Conference on Machine Learning, 2014, pp. I-387-I-395.
校內:2024-06-18公開