| 研究生: |
陳柏霖 Chen, Bo-Lin |
|---|---|
| 論文名稱: |
基於人體骨架特徵點座標差之影片人物身分識別方法 Video Human recognition method base on Coordinate Differences of skeleton Feature Points |
| 指導教授: |
王宗一
Wang, Tzone-I |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 工程科學系 Department of Engineering Science |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 中文 |
| 論文頁數: | 43 |
| 中文關鍵詞: | 姿態辨識 、身分識別 、深度學習 、特徵點座標差 、類神經網路 |
| 外文關鍵詞: | Action recognition, Identity recognition, Deep learning, Coordinate differences of feature point, Neural network |
| 相關次數: | 點閱:143 下載:45 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
如何辨識一段影片中的有哪些特定人物,是電腦視覺中一項重大挑戰。這樣的技術可應用在保全系統、失蹤人口追尋,甚至嫌疑犯追蹤等。傳統需要花費大量人力工時的辨識工作,如果可以發展一套深度學習網路來取代,將可更迅速地達成辨識工作。然而要訓練一套深度學習網路時,則一般需要收集大量目標人物的影像資料,如此的限制卻使得許多的應用無法達成,例如作失蹤人口追尋時,一開始也只能獲得簡短之目標人物影片,如何利用這些僅有資料來訓練一深度學習網路,將是本研究一大挑戰。本研究在於在相機單視角影片中辨識是否有目標人物出現,如有則標識之。方法是在獲得目標人物之簡短影片後,以目標人物在影片中之人體骨架座標作為辨識依據,而以影片連續影格間之人體骨架座標移動差值作為特徵向量,先以人工標記進行前處理後,作為深度學習網路模型訓練資料,在得到模型訓練之權重後,後續既可將大量影片輸入模型以標識目標人物,如此得以減少人工調閱的時間及人力。
本研究使用Youtube平台上NBA球員的精采影片作為訓練資料,先剪輯分開成多部獨立短片,再使用人體姿態辨識網路Yolov3-spp找出影片內所有人物標識框後,再將其放入人體姿態估測(Human Pose Estimation)深度學習網路以提取場中所有人物關節關鍵點座標,並經由PoseFlow姿態追蹤網路輸出每一幀影格中的所有人物並標定人物編號。最後利用人物編號、人物標識框、人物骨架關節關鍵點,來建構訓練資料。將CNN人物辨識模型以影格間人物移動時產生之關節關鍵點位移向量訓練之,以便標誌所設定之目標人物。本研究分別使用不同的模型訓練參數,模型辨識率最高可到達98%的準確率,驗證集準確率也達94%,另外影片中人物交錯重疊時所產生之人物標識框錯置,亦可經由模型檢測後重新輸出標識同一位目標人物,證明了此模型在實際應用的可行性。
How to identify a character's identification in a video is a major challenge in computer vision applications. The technology can be used in security systems, missing persons tracing, and even suspect tracking. Traditional character's identification work usually takes a lot of manpower. If a deep learning network is in place, it should be able to do the identification job more quickly. While to train a deep learning network for identifying a targeted person, it is necessary to collect a large amount of image data of the targeted person. Such a limitation makes many applications pointless. For example, when searching for a missing person, only a short video of the person may be found in the first place. Therefore, how to use limited data to train a deep learning network for target identification is a major challenge for this research. The purpose of this research is to train a model to identify a targeted person in a video taking from a single-view camera. The process is first to get a short video of the targeted person, then to identify human skeleton coordinates of the person in the video frames, and lastly to take the differences of human skeleton key points’ coordinate movements between consecutive frames of the video as the feature vectors. These feature vectors are manually labeled as the training data of a deep learning network. After the model is trained, a huge number of videos can be input to the model for identifying the target person, which can reduce the time and manpower of manual searching or tracking of the targeted person.
The study collects highlight videos of NBA players on Youtube and uses them as training materials. First, they are cut into multiple short videos before using the Yolov3-spp network for detecting the bounding boxes of persons appearing in the videos. Then the bounding boxes are put into a Human Pose Estimation network to locate the key points coordinates of all the characters in the videos and, by the pose tracking network PoseFlow, to give ID numbers to all the characters in each frame of the videos. These characters’ IDs, bounding boxes, and skeleton key points together with the feature vectors form the training data for an identity recognition CNN model. Using different training parameters, the identification rate of the CNN model can reach up to 98% in accuracy for the testing set and up to 94% in accuracy for the verification set. In addition, when the targeted person in the videos is shaded by other persons, resulting a wrong ID numbering, can usually be re-identified after detection by the model in the subsequent frames. This also confirms and the feasibility of this model in practical applications.
[1] Diogo C. Luvizon , David Picard , Hedi Tabia. 2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning
[2] Arda Senocak, Tae-Hyun Oh, Junsik Kim, In So Kweon.Part-based Player Identification using Deep Convolutional Representation and Multi-scale Pooling. 10.1109/CVPRW.2018.00225.
[3] G. Zhu, L. Zhang, P. Shen, A. J. Song. Multimodel Gesture Recognition Using 3-D Convolution and Convolutional LSTM. IEEE Access, pp. 4517-4524, 17 March. 2017. 18-22 June 2018.
[4] Rui heng Zhang, Ling xiang Wu, Yu kun Yang, Wan neng Wu, Yue qiang Chen, Min Xu. Multi-camera muti-player tracking with deep player identification in sport video. arXiv:2003.06439, 13 Mar. 2020.
[5] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In The IEEE International Conference on Computer Vision (ICCV), December 2013.
[6] B. Xiaohan Nie, C. Xiong, and S.-C. Zhu. Joint action recognition and pose estimation from video. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
[7] Hengyue Liu, Bir Bhanu. Pose-Guided R-CNN for Jersey Number Recognition in Sports. IEEE 16-17 June 2019.
[8] Adrià Arbués-Sangüesa, Gloria Haro, Coloma Ballester. Multi-Person tracking by multi-scale detection in Basketball scenarios. arXiv:1907.04637v1 [cs.CV] 10 Jul 2019.
[9] Adria Arbu ` es-Sang ´ uesa ¨, Coloma Ballester , Gloria Haro Universitat Pompeu Fabra (Barcelona, Spain) and FC Barcelona. Single-Camera Basketball Tracker through Pose and Semantic Feature Fusion. arXiv:1906.02042v2 [cs.CV] 10 Jul 2019.
[10] Siyi Shuai, Muthusubash Kavitha, Junichi Miyao, and Takio Kurita. Action Classification Based on 2D Coordinates Obtained by Real-time Pose Estimation
[11] Mohammadreza Zolfaghari , Gabriel L. Oliveira, Nima Sedaghat, and Thomas Brox. Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection. arXiv:1704.00616v2 [cs.CV] 26 May 2017.
[12] Literature Survey:Human Action Recognition:
https://towardsdatascience.com/literature-survey-human-action-recognition-cc7c3818a99a
[13] Gines Hidalgo , Yaadhav Raaj , Haroon Idrees , Donglai Xiang , Hanbyul Joo , Tomas Simon , Yaser Sheikh. Single-Network Whole-Body Pose Estimation. arXiv:1909.13423v1 [cs.CV] 30 Sep 2019.
[14] F. Baradel, C. Wolf, J. Mille, and G. W. Taylor. Glimpse clouds: Human activity recognition from unstructured feature points. In Computer Vision and Pattern Recognition (CVPR) (To appear), June 2018.
[15] C. Cao, Y. Zhang, C. Zhang, and H. Lu. Body joint guided 3-d deep convolutional descriptors for action recognition. IEEE Transactions on Cybernetics, 48(3):1095–1108, March 2018.
[16] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, July 2017.
[17] ] G. Varol, I. Laptev, and C. Schmid. Long-term Temporal Convolutions for Action Recognition. TPAMI, 2017.
[18] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, Yaser Sheikh. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. arXiv:1812.08008v2 [cs.CV] 30 May 2019.
[19] Leonid Pishchulin , Eldar Insafutdinov , Siyu Tang , Bjoern Andres , Mykhaylo Andriluka, Peter Gehler , and Bernt Schiele. DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation. arXiv:1511.06645v2 [cs.CV] 26 Apr 2016.
[20] A. Bulat and G. Tzimiropoulos. Human pose estimation via Convolutional Part Heatmap Regression. In European Conference on Computer Vision (ECCV), pages 717–732, 2016. 2, 6
[21] G. Gkioxari, A. Toshev, and N. Jaitly. Chained Predictions Using Convolutional Neural Networks. European Conference on Computer Vision (ECCV), 2016. 2, 6
[22] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked Hourglass Networks for Human Pose Estimation. arXiv:1603.06937v2 [cs.CV] 26 Jul 2016
[23] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Coarse-to-fine volumetric prediction for single-image 3D human pose. In the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 3, 6, 8, 10, 12
[24] Hao-Shu Fang , Shuqin Xie1, Yu-Wing Tai , Cewu Lu. RMPE: Regional Multi-Person Pose Estimation. arXiv:1612.00137v5 [cs.CV] 4 Feb 2018.
[25] Bipartite Matching http://web.ntnu.edu.tw/~algo/Matching.html#2
[26] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, Yaser Sheikh. Convolutional Pose Machines. arXiv:1602.00134v4 [cs.CV] 12 Apr 2016