簡易檢索 / 詳目顯示

研究生: 陳柏霖
Chen, Bo-Lin
論文名稱: 基於人體骨架特徵點座標差之影片人物身分識別方法
Video Human recognition method base on Coordinate Differences of skeleton Feature Points
指導教授: 王宗一
Wang, Tzone-I
學位類別: 碩士
Master
系所名稱: 工學院 - 工程科學系
Department of Engineering Science
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 43
中文關鍵詞: 姿態辨識身分識別深度學習特徵點座標差類神經網路
外文關鍵詞: Action recognition, Identity recognition, Deep learning, Coordinate differences of feature point, Neural network
相關次數: 點閱:143下載:45
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 如何辨識一段影片中的有哪些特定人物,是電腦視覺中一項重大挑戰。這樣的技術可應用在保全系統、失蹤人口追尋,甚至嫌疑犯追蹤等。傳統需要花費大量人力工時的辨識工作,如果可以發展一套深度學習網路來取代,將可更迅速地達成辨識工作。然而要訓練一套深度學習網路時,則一般需要收集大量目標人物的影像資料,如此的限制卻使得許多的應用無法達成,例如作失蹤人口追尋時,一開始也只能獲得簡短之目標人物影片,如何利用這些僅有資料來訓練一深度學習網路,將是本研究一大挑戰。本研究在於在相機單視角影片中辨識是否有目標人物出現,如有則標識之。方法是在獲得目標人物之簡短影片後,以目標人物在影片中之人體骨架座標作為辨識依據,而以影片連續影格間之人體骨架座標移動差值作為特徵向量,先以人工標記進行前處理後,作為深度學習網路模型訓練資料,在得到模型訓練之權重後,後續既可將大量影片輸入模型以標識目標人物,如此得以減少人工調閱的時間及人力。 
      本研究使用Youtube平台上NBA球員的精采影片作為訓練資料,先剪輯分開成多部獨立短片,再使用人體姿態辨識網路Yolov3-spp找出影片內所有人物標識框後,再將其放入人體姿態估測(Human Pose Estimation)深度學習網路以提取場中所有人物關節關鍵點座標,並經由PoseFlow姿態追蹤網路輸出每一幀影格中的所有人物並標定人物編號。最後利用人物編號、人物標識框、人物骨架關節關鍵點,來建構訓練資料。將CNN人物辨識模型以影格間人物移動時產生之關節關鍵點位移向量訓練之,以便標誌所設定之目標人物。本研究分別使用不同的模型訓練參數,模型辨識率最高可到達98%的準確率,驗證集準確率也達94%,另外影片中人物交錯重疊時所產生之人物標識框錯置,亦可經由模型檢測後重新輸出標識同一位目標人物,證明了此模型在實際應用的可行性。

    How to identify a character's identification in a video is a major challenge in computer vision applications. The technology can be used in security systems, missing persons tracing, and even suspect tracking. Traditional character's identification work usually takes a lot of manpower. If a deep learning network is in place, it should be able to do the identification job more quickly. While to train a deep learning network for identifying a targeted person, it is necessary to collect a large amount of image data of the targeted person. Such a limitation makes many applications pointless. For example, when searching for a missing person, only a short video of the person may be found in the first place. Therefore, how to use limited data to train a deep learning network for target identification is a major challenge for this research. The purpose of this research is to train a model to identify a targeted person in a video taking from a single-view camera. The process is first to get a short video of the targeted person, then to identify human skeleton coordinates of the person in the video frames, and lastly to take the differences of human skeleton key points’ coordinate movements between consecutive frames of the video as the feature vectors. These feature vectors are manually labeled as the training data of a deep learning network. After the model is trained, a huge number of videos can be input to the model for identifying the target person, which can reduce the time and manpower of manual searching or tracking of the targeted person.
    The study collects highlight videos of NBA players on Youtube and uses them as training materials. First, they are cut into multiple short videos before using the Yolov3-spp network for detecting the bounding boxes of persons appearing in the videos. Then the bounding boxes are put into a Human Pose Estimation network to locate the key points coordinates of all the characters in the videos and, by the pose tracking network PoseFlow, to give ID numbers to all the characters in each frame of the videos. These characters’ IDs, bounding boxes, and skeleton key points together with the feature vectors form the training data for an identity recognition CNN model. Using different training parameters, the identification rate of the CNN model can reach up to 98% in accuracy for the testing set and up to 94% in accuracy for the verification set. In addition, when the targeted person in the videos is shaded by other persons, resulting a wrong ID numbering, can usually be re-identified after detection by the model in the subsequent frames. This also confirms and the feasibility of this model in practical applications.

    摘要 I Extended Abstract II 致謝 VIII 目錄 IX 表目錄 XI 圖目錄 XII 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 1 1.3 研究方法 3 1.4 研究貢獻 3 第二章 文獻探討 4 2.1 人體骨架估測方法 4 2.1.1 由下至上檢測法 (Bottom-up methods) 4 2.1.2 由上至下檢測法(Top-down methods) 7 2.2 動作識別方法 7 2.3   步態描述方式 9 第三章 系統設計與架構 10 3.1 實驗流程圖 10 3.2 自製資料集與處理 12 3.3關鍵點生成及人物追蹤 13 3.3.1定框網路(Yolo v3-spp) 14 3.3.2 優化邊界網路(Symmetric STN) 14 3.3.3 姿態估測網路(SPPE branch) 17 3.3.4 姿態追蹤網路(PoseFlow) 18 3.4 關鍵點座標來構成座標差向量 19 3.5 網路結構與特徵提取 22 3.5.1 卷積神經網路 23 3.5.2 損失函數 26 3.6 資料切割方式 27 第四章 實驗設計與結果 28 4.1 資料及實驗環境 28 4.1.1 資料集 28 4.1.2 資料分割及模型設置和實驗環境 28 4.2 評估工具 29 4.3 實驗結果與分析 31 4.4 討論 39 第五章 結論與未來展望 40 5.1 結論 40 5.2 未來展望 40 參考文獻 41

    [1] Diogo C. Luvizon , David Picard , Hedi Tabia. 2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning
    [2] Arda Senocak, Tae-Hyun Oh, Junsik Kim, In So Kweon.Part-based Player Identification using Deep Convolutional Representation and Multi-scale Pooling. 10.1109/CVPRW.2018.00225.
    [3] G. Zhu, L. Zhang, P. Shen, A. J. Song. Multimodel Gesture Recognition Using 3-D Convolution and Convolutional LSTM. IEEE Access, pp. 4517-4524, 17 March. 2017. 18-22 June 2018.
    [4] Rui heng Zhang, Ling xiang Wu, Yu kun Yang, Wan neng Wu, Yue qiang Chen, Min Xu. Multi-camera muti-player tracking with deep player identification in sport video. arXiv:2003.06439, 13 Mar. 2020.
    [5] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In The IEEE International Conference on Computer Vision (ICCV), December 2013.
    [6] B. Xiaohan Nie, C. Xiong, and S.-C. Zhu. Joint action recognition and pose estimation from video. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
    [7] Hengyue Liu, Bir Bhanu. Pose-Guided R-CNN for Jersey Number Recognition in Sports. IEEE 16-17 June 2019.
    [8] Adrià Arbués-Sangüesa, Gloria Haro, Coloma Ballester. Multi-Person tracking by multi-scale detection in Basketball scenarios. arXiv:1907.04637v1 [cs.CV] 10 Jul 2019.
    [9] Adria Arbu ` es-Sang ´ uesa ¨, Coloma Ballester , Gloria Haro Universitat Pompeu Fabra (Barcelona, Spain) and FC Barcelona. Single-Camera Basketball Tracker through Pose and Semantic Feature Fusion. arXiv:1906.02042v2 [cs.CV] 10 Jul 2019.
    [10] Siyi Shuai, Muthusubash Kavitha, Junichi Miyao, and Takio Kurita. Action Classification Based on 2D Coordinates Obtained by Real-time Pose Estimation
    [11] Mohammadreza Zolfaghari , Gabriel L. Oliveira, Nima Sedaghat, and Thomas Brox. Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection. arXiv:1704.00616v2 [cs.CV] 26 May 2017.
    [12] Literature Survey:Human Action Recognition:
    https://towardsdatascience.com/literature-survey-human-action-recognition-cc7c3818a99a
    [13] Gines Hidalgo , Yaadhav Raaj , Haroon Idrees , Donglai Xiang , Hanbyul Joo , Tomas Simon , Yaser Sheikh. Single-Network Whole-Body Pose Estimation. arXiv:1909.13423v1 [cs.CV] 30 Sep 2019.
    [14] F. Baradel, C. Wolf, J. Mille, and G. W. Taylor. Glimpse clouds: Human activity recognition from unstructured feature points. In Computer Vision and Pattern Recognition (CVPR) (To appear), June 2018.
    [15] C. Cao, Y. Zhang, C. Zhang, and H. Lu. Body joint guided 3-d deep convolutional descriptors for action recognition. IEEE Transactions on Cybernetics, 48(3):1095–1108, March 2018.
    [16] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, July 2017.
    [17] ] G. Varol, I. Laptev, and C. Schmid. Long-term Temporal Convolutions for Action Recognition. TPAMI, 2017.
    [18] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, Yaser Sheikh. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. arXiv:1812.08008v2 [cs.CV] 30 May 2019.
    [19] Leonid Pishchulin , Eldar Insafutdinov , Siyu Tang , Bjoern Andres , Mykhaylo Andriluka, Peter Gehler , and Bernt Schiele. DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation. arXiv:1511.06645v2 [cs.CV] 26 Apr 2016.
    [20] A. Bulat and G. Tzimiropoulos. Human pose estimation via Convolutional Part Heatmap Regression. In European Conference on Computer Vision (ECCV), pages 717–732, 2016. 2, 6
    [21] G. Gkioxari, A. Toshev, and N. Jaitly. Chained Predictions Using Convolutional Neural Networks. European Conference on Computer Vision (ECCV), 2016. 2, 6
    [22] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked Hourglass Networks for Human Pose Estimation. arXiv:1603.06937v2 [cs.CV] 26 Jul 2016
    [23] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Coarse-to-fine volumetric prediction for single-image 3D human pose. In the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 3, 6, 8, 10, 12
    [24] Hao-Shu Fang , Shuqin Xie1, Yu-Wing Tai , Cewu Lu. RMPE: Regional Multi-Person Pose Estimation. arXiv:1612.00137v5 [cs.CV] 4 Feb 2018.
    [25] Bipartite Matching http://web.ntnu.edu.tw/~algo/Matching.html#2
    [26] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, Yaser Sheikh. Convolutional Pose Machines. arXiv:1602.00134v4 [cs.CV] 12 Apr 2016

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE