| 研究生: |
仰凱駿 Yang, Kai-Jun |
|---|---|
| 論文名稱: |
基於運動學姿勢特徵學習之人類動作品質評估技術 Kinematics-Based Pose Representation Learning for Human Action Quality Assessment |
| 指導教授: |
莊坤達
Chuang, Kun-Ta |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 英文 |
| 論文頁數: | 36 |
| 中文關鍵詞: | 圖形表示法 、人類動作品質評估 、影片對齊 、表徵學習 、圖嵌入模型 |
| 外文關鍵詞: | Graph representation, action quality assessment, video alignment, representation learning, graph embedding |
| 相關次數: | 點閱:38 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
人類動作品質評估如今已經有許多的應用以及需求。透過骨骼運動,我們可以充分的得 到以運動學的特徵以及反映其中的運動表示式。除此之外人的動作質量評估技術不僅可以 可以減少動作評估所花費的人力物力,還能夠降低評估的主觀性。近來由於COVID-19的影 響,戶外運動開始存在感染疾病的風險,所以居家運動的比例開始有了顯著增加。在此背景 之下,由於居家運動發展尚未成熟,所以缺少各方面在線動作質量評估的方式,這可能會導 致人們在居家運動過程中沒有注意到動作質量。因此,我們提出了一種基於運動學的姿態表 示式學習情境來克服人員短缺的限制,不過該情境也存在一些挑戰,(i)每對視頻之間的長 度不同,(ii)不同影片之間的背景雜訊不同,以及 (iii) 不同人之間會有不同的運動習慣。 因此,在我們的研究中,我們提出了一種基於人體運動 3D 骨架數據的方法。目的是找到更 具辨別力的姿勢特徵表示式,並在 ST-GCN [1] 上擴展我們的運動姿勢圖。而我們假設圖的 表示式是在智能學習當中非常重要部分。因此,我們進行消融實驗來證明這一點。我們對真 實數據的研究表明,我們的 KST-GCN 在效能上優於比較的模型,這意味著我們的方法可以 成功地學習人類姿態的表示式。
Action quality assessment is an essential requirement in many applications. A human action’s characteristic representation of thoughts, skeleton, and kinematics can be viewed as his/her motion parameter. The task of human action quality assessment can reduce the human and material resources spent in action evaluation and reduce subjectivity. As the impact of COVID-19, outdoor sports will increase the risk to expose to virus, leading to the significantly increasing ratio of remote exercise. Lack of online action quality assessment may cause people not to notice the action quality during the remote exercise. Therefore, we proposed a kinematic- based pose representation learning framework to overcome the limitation of short staff, and there are also some challenges in this problem, namely (i) Different lengths between pairs of videos, (ii) Different background noises between pairs of videos, and (iii) Different exercise habits between different subjects. Therefore, in this thesis, we proposed a kinematic graph representation based on 3D skeleton data of human motion. To find a more discriminative pose representation, extend the kinematic pose graph on ST-GCN [1]. We assume that graph representation is an important component in representation learning. Hence, we make ablation experiments to prove it. Empirically, our experimental studies on real data show that our KST-GCN outperforms all baselines, which means that our method can learn discriminative representation successfully.
[1] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton- based action recognition,” in Thirty-second AAAI conference on artificial intelligence, 2018.
[2] A. Langer, L. Gassner, A. Flotz, S. Hasenauer, J. Gruber, L. Wizany, R. Pokan, W. Maetzler, and H. Zach, “How covid-19 will boost remote exercise-based treatment in parkinson’s disease: a narrative review,” npj Parkinson’s Disease, vol. 7, no. 1, pp. 1–9, 2021.
[3] R. F. Escamilla, G. S. Fleisig, N. Zheng, J. E. Lander, S. W. Barrentine, J. R. Andrews, B. W. Bergemann, and C. T. Moorman, “Effects of technique variations on knee biomechanics during the squat and leg press,” Medicine and science in sports and exercise, vol. 33, no. 9, pp. 1552–1566, 2001.
[4] J. Liu, M. Shi, Q. Chen, H. Fu, and C.-L. Tai, “Normalized human pose features for human action video alignment,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 521–11 531.
[5] H. H. Pham, H. Salmane, L. Khoudour, A. Crouzil, P. Zegers, and S. A. Velastin, “Spatio– temporal image representation of 3d skeletal movements for view-invariant action recognition with deep convolutional neural networks,” Sensors, vol. 19, no. 8, p. 1932, 2019.
[6] K. Cao, J. Ji, Z. Cao, C.-Y. Chang, and J. C. Niebles, “Few-shot video classification via temporal alignment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 618–10 627.
[7] B. Fernando, S. Shirazi, and S. Gould, “Unsupervised human action detection by action matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 1–9.
[8] A. Elkholy, M. E. Hussein, W. Gomaa, D. Damen, and E. Saba, “Efficient and robust skeleton-based quality assessment and abnormality detection in human action performance,” IEEE journal of biomedical and health informatics, vol. 24, no. 1, pp. 280–291, 2019.
[9] A. Zia, Y. Sharma, V. Bettadapura, E. L. Sarin, M. A. Clements, and I. Essa, “Automated assessment of surgical skills using frequency analysis,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 430– 438.
[10] A. Elkholy, M. E. Hussein, W. Gomaa, D. Damen, and E. Saba, “Efficient and robust skeleton-based quality assessment and abnormality detection in human action performance,” IEEE journal of biomedical and health informatics, vol. 24, no. 1, pp. 280–291, 2019.
[11] P. Parmar and B. T. Morris, “Measuring the quality of exercises,” in 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2016, pp. 2241–2244.
[12] P. Parmar and B. Tran Morris, “Learning to score olympic events,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 20–28.
[13] Z. Li, Y. Huang, M. Cai, and Y. Sato, “Manipulation-skill assessment from videos with spatial attention network,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
[14] Y. Li, X. Chai, and X. Chen, “End-to-end learning for action quality assessment,” in Pacific Rim Conference on Multimedia. Springer, 2018, pp. 125–134.
[15] X. Xiang, Y. Tian, A. Reiter, G. D. Hager, and T. D. Tran, “S3d: Stacking segmental p3d for action quality assessment,” in 2018 25th IEEE International conference on image processing (ICIP). IEEE, 2018, pp. 928–932.
[16] B. Fernando, S. Shirazi, and S. Gould, “Unsupervised human action detection by action matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 1–9.
[17] M. Kocabas, N. Athanasiou, and M. J. Black, “Vibe: Video inference for human body pose and shape estimation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5253–5263.
[18] P. Parmar and B. T. Morris, “What and how well you performed? a multitask learning approach to action quality assessment,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 304–313, 2019.
[19] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019.
[20] Y. Ji, F. Xu, Y. Yang, F. Shen, H. T. Shen, and W.-S. Zheng, “A large-scale rgb-d database for arbitrary-view human action recognition,” in Proceedings of the 26th ACM international Conference on Multimedia, 2018, pp. 1510–1518.
[21] Z. Zhang, “Microsoft kinect sensor and its effect,” IEEE Multim., vol. 19, pp. 4–10, 2012.
[22] G. Rogez, P. Weinzaepfel, and C. Schmid, “Lcr-net++: Multi-person 2d and 3d pose detection in natural images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, pp. 1146–1161, 2020.
[23] S. Lu, H.-J. Ye, and D.-C. Zhan, “Few-shot action recognition with compromised metric via optimal transport,” ArXiv, vol. abs/2104.03737, 2021.
[24] Y.-S. Lee, C.-S. Ho, Y. Shih, S.-Y. Chang, F. J. R´obert, and T.-Y. Shiang, “Assessment of walking, running, and jumping movement features by using the inertial measurement unit.” Gait & posture, vol. 41 4, pp. 877–81, 2015.
[25] S. Patel, H. Park, P. Bonato, L. Chan, and M. Rodgers, “A review of wearable sensors and systems with application in rehabilitation,” Journal of neuroengineering and rehabilitation, vol. 9, no. 1, pp. 1–17, 2012.
[26] A. Ejupi, M. Brodie, S. R. Lord, J. Annegarn, S. J. Redmond, and K. Delbaere, “Wavelet- based sit-to-stand detection and assessment of fall risk in older people using a wearable pendant device,” IEEE Transactions on Biomedical Engineering, vol. 64, no. 7, pp. 1602– 1607, 2016.
[27] P. Pierleoni, A. Belli, L. Palma, M. Pellegrini, L. Pernini, and S. Valenti, “A high reliability wearable device for elderly fall detection,” IEEE Sensors Journal, vol. 15, no. 8, pp. 4544– 4553, 2015.
[28] L. Tong, Q. Song, Y. Ge, and M. Liu, “Hmm-based human fall detection and prediction method using tri-axial accelerometer,” IEEE Sensors Journal, vol. 13, pp. 1849–1856, 2013.
[29] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019.
[30] J. Park, S. Cho, D. Kim, O. Bailo, H. Park, S. Hong, and J. Park, “A body part embedding model with datasets for measuring 2d human motion similarity,” IEEE Access, vol. 9, pp. 36 547–36 558, 2021.
[31] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 221–231, 2013.
[32] Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and unifying graph convolutions for skeleton-based action recognition,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 140–149, 2020.
[33] C. S. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. Hager, “Temporal convolutional networks for action segmentation and detection,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1003–1012, 2017.
[34] P. Ghosh, Y. Yao, L. S. Davis, and A. Divakaran, “Stacked spatio-temporal graph convolutional networks for action segmentation,” 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 565–574, 2020.
[35] A. Aristidou, D. Cohen-Or, J. K. Hodgins, Y. Chrysanthou, and A. Shamir, “Deep motifs and motion signatures,” ACM Transactions on Graphics (TOG), vol. 37, no. 6, pp. 1–13, 2018.
[36] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” ACM Transactions on Graphics (TOG), vol. 38, pp. 1 – 12, 2019.
[37] K. Jun, D.-W. Lee, K. Lee, S. Lee, and M. S. Kim, “Feature extraction using an rnn autoencoder for skeleton-based abnormal gait recognition,” IEEE Access, vol. 8, pp. 19 196– 19 207, 2020.
[38] L. Zhao, X. Peng, Y. Tian, M. Kapadia, and D. N. Metaxas, “Semantic graph convolutional networks for 3d human pose regression,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3425–3435.
[39] O. Sumer, T. Dencker, and B. Ommer, “Self-supervised learning of pose embeddings from spatiotemporal relations in videos,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4298–4307.
[40] G. Mori, C. Pantofaru, N. Kothari, T. Leung, G. Toderici, A. Toshev, and W. Yang, “Pose embeddings: A deep architecture for learning to match human poses,” arXiv preprint arXiv:1507.00302, 2015.
[41] J. J. Sun, J. Zhao, L.-C. Chen, F. Schroff, H. Adam, and T. Liu, “View-invariant probabilistic embedding for human pose,” in European Conference on Computer Vision. Springer, 2020, pp. 53–70.
[42] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
[43] E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in International workshop on similarity-based pattern recognition. Springer, 2015, pp. 84–92.
[44] J. Ni, J. Liu, C. Zhang, D. Ye, and Z. Ma, “Fine-grained patient similarity measuring using deep metric learning,” in Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2017, pp. 1189–1198.
[45] K. Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” Advances in neural information processing systems, vol. 29, 2016.
[46] C. Chang, D.-A. Huang, Y. Sui, L. Fei-Fei, and J. C. Niebles, “D3tw: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation,”2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3541–3550, 2019.
[47] A. Ismail, S. Abdlerazek, and I. M. El-Henawy, “Development of smart healthcare system based on speech recognition using support vector machine and dynamic time warping,” Sustainability, 2020.
[48] H. Mohammadzade, S. Hosseini, M. R. Rezaei-Dastjerdehei, and M. Tabejamaat, “Dynamic time warping-based features with class-specific joint importance maps for action recognition using kinect depth sensor,” IEEE Sensors Journal, vol. 21, pp. 9300–9313, 2021.
[49] M. Moor, M. Horn, B. A. Rieck, D. Roqueiro, and K. M. Borgwardt, “Early recognition of sepsis with gaussian process temporal convolutional networks and dynamic time warping,” in MLHC, 2019.
[50] M. Kaya and H. S¸. Bilge, “Deep metric learning: A survey,” Symmetry, vol. 11, no. 9, p. 1066, 2019.
[51] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
[52] Y. Ji, F. Xu, Y. Yang, N. Xie, H. T. Shen, and T. Harada, “Attention transfer (ant) net- work for view-invariant action recognition,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 574–582.
[53] Y. Ji, Y. Yang, F. Shen, H. T. Shen, and W.-S. Zheng, “Arbitrary-view human action recognition: A varying-view rgb-d action dataset,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 1, pp. 289–300, 2020.
[54] L. Gao, Y. Ji, G. A. Kumie, X. Xu, X. Zhu, and H. T. Shen, “View-invariant human action recognition via view transformation network,” IEEE Transactions on Multimedia, 2021.
[55] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.