| 研究生: |
黎明智 Le, Minh-Tri |
|---|---|
| 論文名稱: |
使用基於學習的成對相似矩陣進行物件位置估計和自學習進行物件旋轉估計的機械臂抓取 Robot Arm Grasping Using Learning-Based Pairwise Similarity Matrix for Object Location Estimation and Self-Learning for Object Rotation Estimation |
| 指導教授: |
連震杰
Lien, Jenn-Jier |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 英文 |
| 論文頁數: | 127 |
| 外文關鍵詞: | Robot Arm Grasping, Embedded Systems, Lightweight Model, Self-Supervised Learning |
| 相關次數: | 點閱:58 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
Deep neural network (DNN) models are increasingly employed to detect the location and rotation of objects in robotics grasping tasks. However, in order to train DNN models, a large amount of data is required and consumes a lot of time for annotating data. Moreover, although deploying deep neural network (DNN) models on embedded systems increases the mobility and versatility of robotic grasping systems, implementing DNN models on embedded systems is challenging since such systems have only limited computing and memory resources. Accordingly, the present study proposes a two-stage robotic grasping model, in which a learning-based template-matching algorithm is first employed to estimate the object position, and a deep learning model is then used to estimate the rotation angle. Since using template matching algorithms on rotated objects, the matching accuracy is reduced by the presence of confused matching scores (i.e., high similarity scores between the target in the template and mis-matched objects in the search image). A density-based clustering algorithm is used in this study to detect and remove these confused scores in order to increase the robustness of template matching. By analyzing the likelihood similarity score intensity within the region of the detected object, the accuracy of the matching results is further improved. The simulation results show that the proposed density and intensity-based template-matching (DITM) algorithm outperforms existing template-matching approaches when applied to a benchmark OTB-100 dataset and real-world object data, respectively. In estimating the rotation angle of the detected object, a lightweight robot arm grasping model was proposed based on the proposed learning-based template matching and depth image. In the further experimental implementation, this lightweight model obtains an accuracy of 92.5%, which is capable of comparison to that of existing state-of-the-art grasping approaches, on twenty untrained objects selected from the Cornell grasping dataset with just 1.5 million architecture parameters. In order to reduce the time consuming of the data annotation process, a self-rotation learning network (SRL) is proposed for the rotation-angle estimation task. The SRL uses a Siamese network to perform the automatic rotation and labeling of the rotation-angle of the input data for training purposes. A robotic arm grasping model based on SRL is able to achieve a higher accuracy (88.5%) when performing practical grasping tasks on an embedded system (NVidia Jetson Tx2 developer kit).
[1] M. Annaby, Y. Fouda, and M. Rushdi, "Improved normalized cross-correlation for defect detection in printed-circuit boards," IEEE Transactions on Semiconductor Manufacturing, vol. 32, no. 2, pp. 199-211, 2019.
[2] U. Asif, J. Tang, and S. Harrer, "GraspNet: An Efficient Convolutional Neural Network for Real-time Grasp Detection for Low-powered Devices," in International Joint Conferences on Artificial Intelligence, vol. 7, pp. 4875-4882, 2018.
[3] L. Berscheid, P. Meißner, and T. Kröger, "Self-supervised learning for precise pick-and-place without object model," IEEE Robotics Automation Letters, vol. 5, no. 3, pp. 4828-4835, 2020.
[4] L. Berscheid, T. Rühr, and T. Kröger, "Improving data efficiency of self-supervised learning for robotic grasping," in International Conference on Robotics and Automation (ICRA), pp. 2125-2131, 2019.
[5] K. Briechle and U. Hanebeck, "Template matching using fast normalized cross correlation," in Optical Pattern Recognition XII, vol. 4387, pp. 95-102, 2001.
[6] H. Cao, G. Chen, Z. Li, J. Lin, and A. Knoll, "Residual Squeeze-and-Excitation Network with Multi-scale Spatial Pyramid Module for Fast Robotic Grasping Detection," in IEEE International Conference on Robotics and Automation (ICRA), pp. 13445-13451, 2021.
[7] Y. Chebotar, K. Hausman, Z. Su, G.S. Sukhatme, and S. Schaal, "Self-supervised regrasping using spatio-temporal tactile features and reinforcement learning," in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1960-1966, 2016.
[8] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, "A simple framework for contrastive learning of visual representations," in International Conference on Machine Learning, pp. 1597-1607, 2020.
[9] T. Chen, X. Zhai, M. Ritter, M. Lucic, and N. Houlsby, "Self-supervised gans via auxiliary rotation loss," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12154-12163, 2019.
[10] X. Chen and K. He, "Exploring simple siamese representation learning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 15750-15758, 2021.
[11] Z. Chen, S. Li, N. Zhang, Y. Hao, and X. Zhang, "Eye-to-hand robotic visual tracking based on template matching on FPGAs," IEEE Access, vol. 7, pp. 88870-88880, 2019.
[12] J. Cheng, Y. Wu, W. AbdAlmageed, and P. Natarajan, "QATM: Quality-aware template matching for deep learning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11553-11562, 2019.
[13] F.J. Chu, R. Xu, and P.A. Vela, "Real-world multiobject, multigrasp detection," IEEE Robotics Automation Letters, vol. 3, no. 4, pp. 3355-3362, 2018.
[14] J. Deng, W. Dong, R. Socher, J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248-255, 2009.
[15] X. Deng, Y. Xiang, A. Mousavian, C. Eppner, T. Bretl, and D. Fox, "Self-supervised 6d object pose estimation for robot manipulation," in Proceedings of International Conference on Robotics and Automation, pp. 3665-3671, 2020.
[16] B.K. Desai, M. Pandya, and M. Potdar, "Comparison of various template matching techniques for face recognition," International Journal of Engineering Research Development, vol. 8, no. 10, pp. 16-18, 2013.
[17] C. Devin, E. Jang, S. Levine, and V. Vanhoucke, "Grasp2Vec: Learning Object Representations from Self-Supervised Grasping," in Proceedings of the Conference on Robot Learning (CoRL), vol. 87, pp. 99–112, 2018.
[18] F. Ebert, S. Dasari, A.X. Lee, S. Levine, and C. Finn, "Robustness via retrying: Closed-loop robotic manipulation with self-supervised learning," in Conference on Robot Learning, pp. 983-993, 2018.
[19] M. Ester, H.P. Kriegel, J. Sander, and X. Xu, "Density-based spatial clustering of applications with noise," in International Conference of Knowledge Discovery and Data Mining, vol. 240, no. 6, 1996.
[20] Z. Feng, C. Xu, and D. Tao, "Self-supervised representation learning by rotation feature decoupling," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10364-10374, 2019.
[21] B. Goodrich, A. Kuefler, and W. Richards, "Depth by Poking: Learning to Estimate Depth from Self-Supervised Grasping," in Proceedings of International Conference on Robotics and Automation, pp. 10466-10472, 2020.
[22] A. Goshtasby, S.H. Gage, and J.F. Bartholic, "A two-stage cross correlation approach to template matching," IEEE Transactions on Pattern Analysis Machine Intelligence, no. 3, pp. 374-378, 1984.
[23] J.B. Grill, F. Strub, F. Altché, C. Tallec, P.H. Richemond, E. Buchatskaya, C. Doersch, B.A. Pires, Z.D. Guo, and M.G. Azar, "Bootstrap your own latent: A new approach to self-supervised learning," in International Conference on Neural Information Processing Systems, 2020.
[24] D. Guo, F. Sun, H. Liu, T. Kong, B. Fang, and N. Xi, "A hybrid deep architecture for robotic grasp detection," in Proceedings of International Conference on Robotics and Automation, pp. 1609-1614, 2017.
[25] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016.
[26] M. Hisham, S.N. Yaakob, R. Raof, A.A. Nazren, and N. Wafi, "Template matching using sum of squared difference and normalized cross correlation," in IEEE Student Conference on Research and Development (SCOReD), pp. 100-104, 2015.
[27] F.R. Hogan, M. Bauza, O. Canal, E. Donlon, and A. Rodriguez, "Tactile regrasp: Grasp adjustments via simulated tactile transformations," in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2963-2970, 2018.
[28] Y. Huang, J. Wilches, and Y. Sun, "Robot gaining accurate pouring skills through self-supervised learning and generalization," Robotics Autonomous Systems, vol. 136, p. 103692, 2021.
[29] Y. Jiang, S. Moseson, and A. Saxena, "Efficient grasping from rgbd images: Learning using a new rectangle representation," in IEEE International Conference on Robotics and Automation, pp. 3304-3311, 2011.
[30] H. Karaoguz and P. Jensfelt, "Object detection approach for robot grasp detection," in Proceedings of International Conference on Robotics and Automation, pp. 4953-4959, 2019.
[31] R. Kat, R. Jevnisek, and S. Avidan, "Matching pixels using co-occurrence statistics," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1751-1759, 2018.
[32] H.H. Kim, D.J. Kim, and K.H. Park, "Robust elevator button recognition in the presence of partial occlusion and clutter by specular reflections," IEEE Trans. Ind. Electron., vol. 59, no. 3, pp. 1597-1611, 2011.
[33] M. Kokic, J.A. Stork, J.A. Haustein, and D. Kragic, "Affordance detection for task-specific grasping using deep learning," in IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids), pp. 91-98, 2017.
[34] N. Komodakis and S. Gidaris, "Unsupervised representation learning by predicting image rotations," in International Conference on Learning Representations (ICLR), 2018.
[35] S. Korman, D. Reichman, G. Tsur, and S. Avidan, "Fast-match: Fast affine template matching," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2331-2338, 2013.
[36] S. Kumra and C. Kanan, "Robotic grasp detection using deep convolutional neural networks," in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 769-776, 2017.
[37] M. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg, "Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks," in Proceedings of International Conference on Robotics and Automation, pp. 8943-8950, 2019.
[38] I. Lenz, H. Lee, and A. Saxena, "Deep learning for detecting robotic grasps," International Journal of Robotics Research, vol. 34, no. 4-5, pp. 705-724, 2015.
[39] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, "Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection," International Journal of Robotics Research, vol. 37, no. 4-5, pp. 421-436, 2018.
[40] R. Li, S. Wang, and D. Gu, "Deepslam: A robust monocular slam system with unsupervised deep learning," IEEE Transactions on Industrial Electronics, vol. 68, no. 4, pp. 3577-3587, 2020.
[41] X. Li, X. Hu, X. Qi, L. Yu, W. Zhao, P.A. Heng, and L. Xing, "Rotation-oriented collaborative self-supervised learning for retinal disease diagnosis," IEEE Transactions on Medical Imaging, vol. 40, no. 9, pp. 2284-2294, 2021.
[42] H. Liang, X. Ma, S. Li, M. Görner, S. Tang, B. Fang, F. Sun, and J. Zhang, "Pointnetgpd: Detecting grasp configurations from point sets," in International Conference on Robotics and Automation (ICRA), pp. 3629-3635, 2019.
[43] J. Lin, R. Calandra, and S. Levine, "Learning to identify object instances by touch: Tactile recognition via multimodal matching," in International Conference on Robotics and Automation (ICRA), pp. 3644-3650, 2019.
[44] G. Lowe, "Distinctive image features from scale-invariant keypoints," International Journal Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.
[45] J. Mahler, M. Matl, X. Liu, A. Li, D. Gealy, and K. Goldberg, "Dex-net 3.0: Computing robust vacuum suction grasp targets in point clouds using a new analytic model and deep learning," in IEEE International Conference on Robotics and Automation (ICRA), pp. 5620-5627, 2018.
[46] R. Monica and J. Aleotti, "Point cloud projective analysis for part-based grasp planning," IEEE Robotics Automation Letters, vol. 5, no. 3, pp. 4695-4702, 2020.
[47] D. Morrison, P. Corke, and J. Leitner, "Learning robust, real-time, reactive robotic grasping," International Journal of Robotics Research, vol. 39, no. 2-3, pp. 183-201, 2020.
[48] A. Nair, S. Bahl, A. Khazatsky, V. Pong, G. Berseth, and S. Levine, "Contextual imagined goals for self-supervised robotic learning," in Conference on Robot Learning, pp. 530-539, 2020.
[49] Y. Narang, B. Sundaralingam, M. Macklin, A. Mousavian, and D. Fox, "Sim-to-real for robotic tactile sensing via physics-based simulation and learned latent projections," in IEEE International Conference on Robotics and Automation (ICRA), pp. 6444-6451, 2021.
[50] C.F. Olson, "Maximum-likelihood template matching," in Proceedings IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 52-57, 2000.
[51] S. Oron, T. Dekel, T. Xue, W. Freeman, and S. Avidan, "Best-buddies similarity—Robust template matching using mutual nearest neighbors," IEEE Transactions on Pattern Analysis Machine Intelligence, vol. 40, no. 8, pp. 1799-1813, 2018.
[52] L. Pinto and A. Gupta, "Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours," in IEEE International Conference on Robotics and Automation (ICRA), pp. 3406-3413, 2016.
[53] J. Redmon and A. Angelova, "Real-time grasp detection using convolutional neural networks," in IEEE International Conference on Robotics and Automation (ICRA), pp. 1316-1322, 2015.
[54] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779-788, 2016.
[55] S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," Journal of Advances in neural information processing systems, vol. 28, 2015.
[56] S. Rezapour Lakani, A.J. Rodríguez-Sánchez, and J. Piater, "Towards affordance detection for robot manipulation using affordance for parts and parts for affordance," Autonomous Robots, vol. 43, no. 5, pp. 1155-1172, 2019.
[57] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, "ORB: An efficient alternative to SIFT or SURF," in International Conference Computer Vision (ICCV), pp. 2564-2571, 2011.
[58] B. Russell, A. Torralba, K. Murphy, and W. Freeman, "LabelMe: a database and web-based tool for image annotation," International Journal of Computer Vision, vol. 77, no. 1, pp. 157-173, 2008.
[59] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen, "Mobilenetv2: Inverted residuals and linear bottlenecks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510-4520, 2018.
[60] H.C. Shih and K.C. Yu, "A new model-based rotation and scaling-invariant projection algorithm for industrial automation application," IEEE Transactions on Industrial Electronics, vol. 63, no. 7, pp. 4452-4460, 2016.
[61] Y. Song, L. Gao, X. Li, and W. Shen, "A novel robotic grasp detection method based on region proposal networks," Robotics Computer-Integrated Manufacturing, vol. 65, p. 101963, 2020.
[62] I. Talmi, R. Mechrez, and L. Zelnik-Manor, "Template matching with deformable diversity similarity," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 175-183, 2017.
[63] M. Tan and Q. Le, "Efficientnet: Rethinking model scaling for convolutional neural networks," in International Conference on Machine Learning, pp. 6105-6114, 2019.
[64] Y. Teng and P. Gao, "Generative Robotic Grasping Using Depthwise Separable Convolution," Computers Electric Engineering, vol. 94, p. 107318, 2021.
[65] D.M. Tsai and C.K. Huang, "Defect detection in electronic surfaces using template-based Fourier image reconstruction," IEEE Transactions on Components, Packaging Manufacturing Technology, vol. 9, no. 1, pp. 163-172, 2018.
[66] X. Wang, X. Wang, and L. Han, "A novel parallel architecture for template matching based on zero-mean normalized cross-correlation," IEEE Access, vol. 7, pp. 186626-186636, 2019.
[67] Z. Wang, Z. Li, B. Wang, and H. Liu, "Robot grasp detection using multimodal deep convolutional neural networks," Advances in Mechanical Engineering, vol. 8, 2016.
[68] S.D. Wei and S.H. Lai, "Fast template matching based on normalized cross correlation with adaptive multilevel winner update," IEEE Transactions on Image Processing, vol. 17, no. 11, pp. 2227-2235, 2008.
[69] C. Wu, L. Chen, Z. He, and J. Jiang, "Pseudo-Siamese Graph Matching Network for Textureless Objects' 6D Pose Estimation," IEEE Transactions on Industrial Electronics, 2021.
[70] Y. Wu, J. Lim, and M.H. Yang, "Online object tracking: A benchmark," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411-2418, 2013.
[71] Z. Wu, C. Shen, and A. Van Den Hengel, "Wider or deeper: Revisiting the resnet model for visual recognition," Pattern Recognition, vol. 90, pp. 119-133, 2019.
[72] Z. Wu, Y. Xiong, S.X. Yu, and D. Lin, "Unsupervised feature learning via non-parametric instance discrimination," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733-3742, 2018.
[73] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, "Barlow twins: Self-supervised learning via redundancy reduction," in International Conference on Machine Learning, pp. 12310-12320, 2021.
[74] A. Zeng, S. Song, S. Welker, J. Lee, A. Rodriguez, and T. Funkhouser, "Learning synergies between pushing and grasping with self-supervised deep reinforcement learning," in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4238-4245, 2018.
[75] B. Zhang, H. Yang, and Z. Yin, "A region-based normalized cross correlation algorithm for the vision-based positioning of elongated IC chips," IEEE Transactions on Semiconductor Manufacturing, vol. 28, no. 3, pp. 345-352, 2015.
[76] H. Zhang, Z. Tang, Y. Xie, X. Gao, Q. Chen, and W. Gui, "A similarity-based burst bubble recognition using weighted normalized cross correlation and chamfer distance," IEEE Transactions on Industrial Informatics, vol. 16, no. 6, pp. 4077-4089, 2019.
[77] H. Zhang, X. Zhou, X. Lan, J. Li, Z. Tian, and N. Zheng, "A real-time robotic grasping approach with oriented anchor box," IEEE Transactions on Systems, Man, Cybernetics: Systems, vol. 51, no. 5, pp. 3014-3025, 2019.
[78] Z. Zhang, "A flexible new technique for camera calibration," IEEE Transactions on Pattern Analysis Machine Intelligence, vol. 22, no. 11, pp. 1330-1334, 2000.
[79] Z. Zhang, X. Yang, and X. Jia, "Scale-Adaptive NN-Based Similarity for Robust Template Matching," IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1-9, 2020.
[80] D. Zhao, F. Sun, Z. Wang, and Q. Zhou, "A novel accurate positioning method for object pose estimation in robotic manipulation based on vision and tactile sensors," The International Journal of Advanced Manufacturing Technology, vol. 116, no. 9, pp. 2999-3010, 2021.
校內:不公開