| 研究生: |
林守仁 Lin, Shou-Jen |
|---|---|
| 論文名稱: |
使用深度和骨架資訊在動作辨識上類內變化問題之研究 The Study on Using Depth Image and Skeleton in Intra-class Variation Problem for Human Action Recognition |
| 指導教授: |
楊竹星
Yang, Chu-Sing |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2016 |
| 畢業學年度: | 104 |
| 語文別: | 英文 |
| 論文頁數: | 106 |
| 中文關鍵詞: | 深度影像 、方向梯度值方圖 、流形學習 、運動歷史圖 、動態時間扭曲法 、動作辨識 |
| 外文關鍵詞: | Depth Image, Histogram of Oriented Gradient, Manifold Learning, Motion History Image, Dynamic Time Warping, Action recognition |
| 相關次數: | 點閱:105 下載:6 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
不同的辨識方法論著手於識別和了解人類動作的變化,動作辨識中的類別內變化,像是同一動作不同情境、自身遮蔽、視角的改變、低解析度、複雜背景等, 都是由不確定因素造成。 然而,某些類別內變化仍是一個需要去解決的困難工作。例如,相同動作因快慢有不同的運動時間、自我遮蔽與不同視角等都是動作辨識議題上主要挑戰。過去的動作辨識方法中有些研究忽略這些問題,有些採用比較複雜的方式處裡這些問題。在本篇論文中,我們集中在解決因運動快慢而有不同運動時間和運動自我遮蔽的問題,這些是由運動時自然因素和運動中由於同一區域動作重疊而造成,我們提出三個以深度資訊為基礎的方法來應付這個問題。第一個方法,使用深度資訊獲取目標物,並用三維連通方法將目標物連接完整,接著得到目標件的梯度,並用方向梯度值方圖來做為影像特徵,最後降維投影並與資料庫比較識別。第二個方法,使用外表為基礎的樣板匹配法,提供比較簡單和快速的動作分析。目標物資料包含深度和二維資料被投影到三個正交平面。 沿著光軸的深度運動能在這些正交平面清楚描述並作為動作特徵,以運動能量和運動角度變化為基礎的時間切割方法,將複雜的運動切割成幾個簡單運動。三維資料被進一步用到去得到運動歷史軌跡,並由運動歷史影像圖來描述運動的三個面向,接著相對應的權重去強化主要的運動面向。就特徵萃取方面,使用多面向歷史直方圖能有效降低計算負擔並達成較高的辨識率。第三個方法,使用深度影像和骨架關節資訊,每個關節座標轉化為關節方向,這個關節方向變化的時間序列,不受身體的大小的影響,接著使用動態時間扭曲法到特徵向量的比對,期間為了增加動態時間扭曲法的運算效率,引入累積旋轉變化量去預分類和改善動態時間扭曲法的權重,經由試驗顯示,提出的方法能提升計算速率和提高辨識率。
Various recognition methodologies address to recognize and understand varieties of human actions. Intra-class variation in action recognition, such as different contexts of the same actions, occlusions, viewpoints changes, low resolution and cluttered background, is caused by various uncertainty factors. However, some of intra-class variation is a daunting task to solve. For example, different motion duration of the same action, self-occlusion, different viewpoints are the major challenges in action recognition. Various action recognition methods either bypass this problem or solve this problem in complex manner.
In this dissertation, we concentrate on different motion duration of the same action and motion self-occlusion problem, These are cause by natural factors of human movement and motion overlapping in various complex action. We proposed three methods based on depth image to address these challenges.
In first method, the target object is acquired by using 3-dimensional connected component labelling. The gradient is then obtained from the depth object, creating a histogram of oriented gradient (HOG) as the feature for human action. Finally, continuous motions are matched with their corresponding low-dimensional projections using the manifold learning method.
In second method, we used appearance-based template matching paradigms which are simpler and faster for action analysis. The target data, depth and 2D data, are projected to three orthogonal planes. The action featured in the depth motion along the optical axis can clearly describe its trajectory by these orthogonal planes. Based on the change of motion energy and the angle variations of motion orientations, temporal segmentation method automatically segments the complex action into several simple movements. 3D Data is further applied to acquire the three viewpoints' motion history trajectory, whereby the motion of a target is described through the motion history images (MHIs) from the three viewpoints. The weightings corresponding to the gradients of the MHIs are included for determining the viewpoint that can best describe the motion of the target. In terms of feature extraction, the application of multi-resolution motion history histograms can effectively reduce the computational load and achieve a high recognition rate. Experimental results show that the proposed method can effectively solve the self-occlusion problems.
In third method, we used the depth image and capture the joints from human skeleton structure. Each joint is converted 3D coordinate into joint orientation. After that, we build our feature vector from joint orientation along time series that invariant to human body size instead of 3D coordinate of body skeleton joints. Dynamic Time Warping is then applied to the resulted feature vector. However, the drawback of Dynamic Time Warping method is ineffective to computing result. Therefore, we use accumulative angular variation of skeleton joints to analysis action, and then divide actions into several groups. The proposed method can improve the calculated speed and recognition rate.
References
[1] P. Turaga, R. Chellappa, V. S. Subrahmanian, and O. Udrea, “Machine recognition of human activities: A survey,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 18, no. 11, pp. 1473–1488, 2008.
[2] J. K. Aggarwal and M. S. Ryoo, “Human activity analysis: A review,” ACM Computing Surveys (CSUR), vol. 43, no. 3, p. 16, 2011.
[3] P. KaewTraKulPong and R. Bowden, “An improved adaptive background mixture
model for real-time tracking with shadow detection,” in Video-based surveillance systems. Springer, 2002, pp. 135–144.
[4] Y. Liu, H. Yao, W. Gao, X. Chen, and D. Zhao, “Nonparametric background generation,” Journal of Visual Communication and Image Representation, vol. 18, no. 3, pp. 253–263, 2007.
[5] L. Chen, H. Wei, and J. Ferryman, “A survey of human motion analysis using depth
imagery,” Pattern Recognition Letters, vol. 34, no. 15, pp. 1995–2006, 2013.
[6] S. Malassiotis, N. Aifanti, and M. G. Strintzis, “A gesture recognition system using 3d data,” in 3D Data Processing Visualization and Transmission, 2002. Proceedings. First International Symposium on. IEEE, 2002, pp. 190–193.
[7] Z. Li and R. Jarvis, “Real time hand gesture recognition using a range camera,” in
Australasian Conference on Robotics and Automation, 2009, pp. 21–27.
[8] L. Spinello, K. O. Arras, R. Triebel, and R. Siegwart, “A layered approach to people
detection in 3d range data.” in AAAI. Citeseer, 2010.
[9] D. W. Hansen, R. Larsen, and F. Lauze, “Improving face detection with tof cameras,” in Signals, Circuits and Systems, 2007. ISSCS 2007. International Symposium on, vol. 1. IEEE, 2007, pp. 1–4.
[10] Y. Zhu, B. Dariush, and K. Fujimura, “Controlled human pose estimation from
depth image streams,” in Computer Vision and Pattern Recognition Workshops, 2008. CVPRW’08. IEEE Computer Society Conference on. IEEE, 2008, pp. 1–8.
[11] Y. Zhu and K. Fujimura, “A bayesian framework for human body pose tracking from depth image sequences,” Sensors, vol. 10, no. 5, pp. 5280–5293, 2010.
[12] V. Ganapathi, C. Plagemann, D. Koller, and S. Thrun, “Real time motion capture using a single time-of-flight camera,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 755–762.
[13] C. Plagemann, V. Ganapathi, D. Koller, and S. Thrun, “Real-time identification and
localization of body parts from depth images,” in Robotics and Automation (ICRA),
2010 IEEE International Conference on. IEEE, 2010, pp. 3108–3113.
[14] D. Weinland, R. Ronfard, and E. Boyer, “A survey of vision-based methods for action representation, segmentation and recognition,” Computer Vision and Image Understanding, vol. 115, no. 2, pp. 224–241, 2011.
[15] M. A. R. Ahad, Computer vision and action recognition: a guide for image processing and computer vision community for action understanding. Springer Science & Business Media, 2011, vol. 5.
[16] J. Liu, S. Ali, and M. Shah, “Recognizing human actions using multiple features,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on.
IEEE, 2008, pp. 1–8.
[17] J. Liu and M. Shah, “Learning human actions via information maximization,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008, pp. 1–8.
[18] A. F. Bobick and J. W. Davis, “The recognition of human movement using temporal templates,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 23, no. 3, pp. 257–267, 2001.
[19] T. Kanade, A. Yoshida, K. Oda, H. Kano, and M. Tanaka, “A stereo machine for videorate dense depth mapping and its new applications,” in Computer Vision and Pattern Recognition, 1996. Proceedings CVPR’96, 1996 IEEE Computer Society Conference on. IEEE, 1996, pp. 196–202.
[20] M. Okutomi and T. Kanade, “A multiple-baseline stereo,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 15, no. 4, pp. 353–363, 1993.
[21] A. Kolb, E. Barth, and R. Koch, “Tof-sensors: New dimensions for realism and interactivity,” in Computer Vision and Pattern Recognition Workshops, 2008. CVPRW’08. IEEE Computer Society Conference on. IEEE, 2008, pp. 1–6.
[22] T. Oggier, R. Kaufmann, M. Lehmann, P. Metzler, G. Lang, M. Schweizer, M. Richter, B. Büttgen, N. Blanc, K. Griesbach et al., “3d-imaging in real-time with miniaturized optical range camera,” in Proc. OPTO, 2004, pp. 89–94.
[23] J. Salvi, J. Pages, and J. Batlle, “Pattern codification strategies in structured light systems,” Pattern recognition, vol. 37, no. 4, pp. 827–849, 2004.
[24] R. Pless, “Image spaces and video trajectories: Using isomap to explore video sequences.” in ICCV, vol. 3, 2003, pp. 1433–1440.
[25] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000.
[26] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering.” in NIPS, vol. 14, 2001, pp. 585–591.
[27] M. Aharon and R. Kimmel, “Representation analysis and synthesis of lip images using dimensionality reduction,” International Journal of Computer Vision, vol. 67, no. 3, pp. 297–312, 2006.
[28] T.-J. Chin, L. Wang, K. Schindler, and D. Suter, “Extrapolating learned manifolds for human activity recognition,” in Image Processing, 2007. ICIP 2007. IEEE International Conference on, vol. 1. IEEE, 2007, pp. I–381.
[29] L. A. Schwarz, D. Mateus, V. Castañeda, and N. Navab, “Manifold learning for tof-based human body tracking and activity recognition.” in BMVC. Citeseer, 2010, pp.
1–11.
[30] Y. Bengio, J.-F. Paiement, P. Vincent, O. Delalleau, N. Le Roux, and M. Ouimet, “Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering,” Advances
in neural information processing systems, vol. 16, pp. 177–184, 2004.
[31] X. Niyogi, “Locality preserving projections,” in Neural information processing systems, vol. 16. MIT, 2004, p. 153.
[32] X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang, “Face recognition using laplacianfaces,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, no. 3, pp. 328–340, 2005.
[33] L. Wang and D. Suter, “Visual learning and recognition of sequential data manifolds with applications to human movement analysis,” Computer Vision and Image Understanding, vol. 110, no. 2, pp. 153–172, 2008.
[34] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp. 886–893.
[35] F. Murtaza, M. H. Yousaf, and S. A. Velastin, “Multi-view human action recognition using histograms of oriented gradients (hog) description of motion history images (mhis),” in 2015 13th International Conference on Frontiers of Information Technology (FIT). IEEE, 2015, pp. 297–302.
[36] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 32, no. 9, pp. 1627–1645, 2010.
[37] O. Masoud and N. Papanikolopoulos, “A method for human action recognition,” Image and Vision Computing, vol. 21, no. 8, pp. 729–743, 2003.
[38] M. B. Holte, T. B. Moeslund, and P. Fihl, “View-invariant gesture recognition using 3d optical flow and harmonic motion context,” Computer Vision and Image Understanding, vol. 114, no. 12, pp. 1353–1361, 2010.
[39] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 26, no. 1, pp. 43–49, 1978.
[40] A. Corradini, “Dynamic time warping for off-line recognition of a small gesture vocabulary,” in Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, 2001. Proceedings. IEEE ICCV Workshop on. IEEE, 2001, pp. 82–89.
[41] “Leonhard euler.” March 2016. [Online]. Available: https://en.wikipedia.org/wiki/Leonhard_Euler
[42] N. R. Howe, “Silhouette lookup for monocular 3d pose tracking,” Image and Vision Computing, vol. 25, no. 3, pp. 331–341, 2007.
[43] “Microsoft xbox360.” [Online]. Available: https://www.microsoftstore.com/store/msusa/en_US/cat/categoryID.69405400?icid=en_US_Store_UH_devices_Xbox
[44] L. Wang, W. Hu, and T. Tan, “Recent developments in human motion analysis,” Pattern recognition, vol. 36, no. 3, pp. 585–601, 2003.
[45] D. Marr and L. Vaina, “Representation and recognition of the movements of shapes,” Proceedings of the Royal Society of London B: Biological Sciences, vol. 214, no. 1197, pp. 501–524, 1982.
[46] S.-J. Lin, M.-H. Chao, C.-Y. Lee, and C.-S. Yang, “Human action recognition using
motion history image based temporal segmentation,” International Journal of Pattern
Recognition and Artificial Intelligence, vol. 30, no. 06, p. 1655017, 2016.
[47] D. Weinland, R. Ronfard, and E. Boyer, “Automatic discovery of action taxonomies from multiple views,” in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2. IEEE, 2006, pp. 1639–1645.
[48] C.-C. Yu, H.-Y. Cheng, C.-H. Cheng, and K.-C. Fan, “Efficient human action and gait
analysis using multiresolution motion energy histogram,” EURASIP journal on advances in signal processing, vol. 2010, p. 3, 2010
[49] M.-C. Roh, H.-K. Shin, and S.-W. Lee, “View-independent human action recognition with volume motion template on single stereo camera,” Pattern Recognition Letters, vol. 31, no. 7, pp. 639–647, 2010.
[50] B. C. Song, M. J. Kim, and J. B. Ra, “A fast multiresolution feature matching algorithm for exhaustive search in large image databases,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 11, no. 5, pp. 673–678, 2001.
[51] C.-C. Yu, F.-D. Jou, C.-C. Lee, K.-C. Fan, and T. C. Chuang, “Efficient multiresolution
histogram matching for fast image/video retrieval,” Pattern Recognition Letters,
vol. 29, no. 13, pp. 1858–1867, 2008.
[52] “Meet kinect for windows.” [Online]. Available: https://developer.microsoft.com/en-us/windows/kinect
[53] J. J. Wang and S. Singh, “Video analysis of human dynamics- a survey,” Real-time
imaging, vol. 9, no. 5, pp. 321–346, 2003.
[54] M. Z. Uddin, N. D. Thang, J. T. Kim, and T.-S. Kim, “Human activity recognition using body joint-angle features and hidden markov model,” Etri Journal, vol. 33, no. 4, pp. 569–579, 2011.
[55] N. D. Thang, T.-S. Kim, Y.-K. Lee, and S. Lee, “Estimation of 3-d human body posture via co-registration of 3-d human model and sequential stereo information,” Applied Intelligence, vol. 35, no. 2, pp. 163–177, 2011.
[56] E. Yu and J. Aggarwal, “Human action recognition with extremities as semantic posture representation,” in Computer Vision and Pattern Recognition Workshops, 2009. CVPR Workshops 2009. IEEE Computer Society Conference on. IEEE, 2009, pp. 1–8.
[57] D. Weinland, E. Boyer, and R. Ronfard, “Action recognition from arbitrary views using 3d exemplars,” in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on. IEEE, 2007, pp. 1–7.
[58] W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3d points,” in
Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer
Society Conference on. IEEE, 2010, pp. 9–14.
[59] L. Xia, C.-C. Chen, and J. Aggarwal, “View invariant human action recognition using histograms of 3d joints,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on. IEEE, 2012, pp. 20–27.
[60] A. Al Alwani, Y. Chahir, D. E. Goumidi, M. Molina, and F. Jouen, “3d-posture recognition using joint angle representation,” in International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems. Springer, 2014, pp. 106–115.
[61] “Bones hierarchy,” March 2016. [Online]. Available: http://msdn.microsoft.com/
en-us/library/hh973077.aspx
校內:2021-07-01公開