簡易檢索 / 詳目顯示

研究生: 劉至翔
Liou, Jhih-Siang
論文名稱: 驅動虛擬人體模型之單視角三維人體姿態估測網路
Monocular 3D Human Pose Estimation Networks for Driving VR Humanoid Avatar
指導教授: 楊家輝
Yang, Jar-Ferr
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 56
中文關鍵詞: 卷積式類神經網路二維姿態估測二維至三維姿態估測立體人體模型
外文關鍵詞: Convolutional neural networks, 2D pose estimation, 2D to 3D pose transformation, 3Dhumanoid avatar
相關次數: 點閱:70下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著人工智慧的發展,現在有越來越多複雜的問題可以被類神經網路解決。在這些使用類神經網路的議題中,姿態估測是其中一個非常熱門的領域用來將人們跟虛擬世界連接起來。而姿態估測的領域中可分為二維姿態估測與三維姿態估測。二維姿態估測可用在手勢識別,動作識別以及表情識別。而三為骨架資訊則可用來對目標人物的動作分析,或者連接到立體人體模型進行真實和虛擬世界的連接。在虛擬實境蓬勃發展的當下,三維姿態估測將有機會取代昂貴的動作捕捉系統。因此,在本論文中,我們將二維姿態估測網路以及二維至三維姿態估測網路串接起來,並使用生成的三維人體骨架資訊去驅動虛擬人體模型。首先,我們以單視角的影像輸入二維姿態估測網路並生成一組二維人體骨架位置資訊,接者將生成的連續的九組二維人體骨架輸入二維至三維姿態估測網路生成一組三維人體骨架。最後,我們將整部影片裡每一幀生成出來的三維人體骨架輸入最後的應用程式來驅動立體人體模型使模型模仿影片中人物所做的動作。實驗的測試結果,我們能夠在可接受的運算時間內預側二維骨架,且二維至三維姿態估測的結果能藉由時間序的資訊將二維姿態估測平滑化。

    With countless developments of artificial intelligence, there are more and more complicated problems, which can be easily solved by using deep learning convolutional neural networks (CNN). Among those intricate issues using CNN approaches, the pose estimation is one of the popular problems for linking the people to the virtual world. In the field of pose estimation, it can be divided into two subjects, 2D pose estimation and 3D pose estimation. For 2D pose estimation, it can be used for several applications, such as gesture recognition, action recognition. For the 3D pose information, it can be used for activities analyses of target persons or linked to the 3D humanoid model for real and virtual world. In this thesis, we propose a complete CNN approach which combine the 2D pose network and 2D to 3D pose transformation network, to driving the 3D humanoid models. First, we feed a monoview frame in to 2D pose network and generate a 2D skeleton information. Then we feed 9 frames of 2D skeleton into 3D pose transformation network to generate a 3D skeleton information. Finally, we feed all the skeletons into the driving application to make avatar imitate the motion that target people do in the video. The experimental results show that we could have acceptable runtime speed in 2D pose estimation and the 2D to 3D pose transformation network can smooth the prediction results generated by 2D pose network with temporal information.

    摘 要 I Abstract II 誌謝 III Contents V List of Table VIII List of Figures IX Chapter 1 Introduction 1 1.1 Research Background & Motivation 1 1.2Literature Review 3 1.2.1 2D Human Pose Estimation 3 1.2.2 3D Human Pose Estimation 6 1.3 Thesis Organization 7 Chapter 2 Related Work 9 2.1 Convolutional Neural Network 9 2.1.1 Convolutional Layers 10 2.1.2 Pooling Layers 10 2.1.3 Activation Function 11 2.1.4 Fully Connected Layers (Dense Layer) 14 2.2 Very Deep Convolutional Networks for Large-Scale Image Recognition (VGG) 15 2.3 Heterogeneous Kernel-Base Convolutions (HetConv). 16 2.4 Temporal Convolution Network (TCN) 18 Chapter 3 The Proposed 2D and 3D Pose Estimation Networks 22 3.1 2D Pose Estimation Network 23 3.1.1 Data Preparation for 2D Pose Estimator 23 3.1.2 Proposed 2D Pose Network Structure 26 3.2 3D Pose Prediction Network 29 3.2.1 Data Preparation for 3D Pose Predictor 29 3.2.2 Proposed 3D Pose Network Structure 30 3.3 Loss Functions 33 3.4 3D Humanoid Model Matching Application 34 Chapter 4 Experimental Results 37 4.1 Environmental Settings and Datasets 37 4.2 Results of the Proposed 2D Pose Estimation Network 40 4.3 Results of Proposed 2D to 3D Pose Prediction Network 43 4.4 Results of 3D humanoid model matching application 48 Chapter 5 Conclusions 51 Chapter 6 Future Work 52 References 53

    [1] W. Kong, A. Hussain, M. H. M. Saad and N. M. Tahir, "Hand Detection from Silhouette for Video Surveillance Application," 2012 IEEE 8th International Colloquium on Signal Processing and its Applications, Melaka, 2012, pp. 514-518, doi: 10.1109/CSPA.2012.6194783.
    [2] V. Ramarkrushna, D. Munoz, M. Hebert, J. A. Bagnell and Y. Sheikh, "Pose Machines: Articulated Pose Estimation via Inference Machines," 2014 ECCV, Springer, Cham, 2014, pp. 33-47
    [3] S. Wei, V. Ramakrishna, T. Kanade and Y. Sheikh, "Convolutional Pose Machines," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 4724-4732, doi: 10.1109/CVPR.2016.511.
    [4] Z. Cao, T. Simon, S. Wei and Y. Sheikh, "Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 1302-1310, doi: 10.1109/CVPR.2017.143.
    [5] C. Wang, Y. Wang, Z. Lin and A. L. Yuille, "Robust 3D Human Pose Estimation from Single Images or Video Sequences," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 5, pp. 1227-1241, 1 May 2019, doi: 10.1109/TPAMI.2018.2828427.
    [6] J. Martinez, R. Hossain, J. Romero and J. J. Little, "A Simple Yet Effective Baseline for 3d Human Pose Estimation," 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 2017, pp. 2659-2668, doi: 10.1109/ICCV.2017.288.
    [7] K.Yasunori, K. Ogaki, Y. Matsui and Y. Odagiri, "Unsupervised Adversarial Learning of 3D Human Pose from 2D Joint Locations," 2018 arXiv preprint arXiv: 1803.08244 2018.
    [8] P. Singh, V. K. Verma, P. Rai and V. P. Namboodiri, "HetConv: Heterogeneous Kernel-Based Convolutions for Deep CNNs," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 4830-4839, doi: 10.1109/CVPR.2019.00497.
    [9] S. Bai, J. Z. Kolter and V. Koltun, "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling," 2018 arXiv preprint arXiv: 1803.01271, 2018.
    [10] K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-scale Image Recognition." 2014 arXiv preprint arXiv: 1409.1556, 2014.
    [11] F. Yu and V. Koltun, "Multi-scale Context Aggregation by Dilated Convolutions," 2015 arXiv preprint, arXiv: 1511.07122, 2015
    [12] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals. A. Graves, N. Kalchbrenner, A. Senior and K. Kavukcuoglu, "WaveNet: A Generative Model for Raw Audio," 2016 arXiv preprint, arXiv: 1609.03499, 2016.
    [13] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.
    [14] "MikuMikuDance," https://sites.google.com/view/vpvp/
    [15] "Blender," https://www.blender.org/
    [16] "Human 3.6M Dataset," http://vision.imar.ro/human3.6m/description.php
    [17] "COCO Dataset," https://cocodataset.org/
    [18] "MPII Human Pose Dataset," http://human-pose.mpi-inf.mpg.de/
    [19] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler and B. Schiele, "DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 4929-4937, doi: 10.1109/CVPR.2016.533.
    [20] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele, "Deepercut: A Deeper, Stronger, and Faster Multiperson Pose Estimation Model," In: European Conference on Computer Vision. Springer, Cham, 2016. pp. 34-50
    [21] I. Umar and G. Juergen." Multi-person Pose Estimation with Local Joint-to-person Associations." In: European Conference on Computer Vision. Springer, Cham, 2016 pp. 627-642
    [22] G. Pavlakos, X. Zhou, K. G. Derpanis and K. Daniilidis, "Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 1263-1272, doi: 10.1109/CVPR.2017.139.
    [23] X. Sun, J. Shang, S. Liang and Y. Wei, "Compositional Human Pose Regression," 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 2017, pp. 2621-2630, doi: 10.1109/ICCV.2017.284.
    [24] G. Pavlakos, X. Zhou and K. Daniilidis, "Ordinal Depth Supervision for 3D Human Pose Estimation," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 7307-7316, doi: 10.1109/CVPR.2018.00763.
    [25] W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li and X. Wang, "3D Human Pose Estimation in the Wild by Adversarial Learning," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 5255-5264, doi: 10.1109/CVPR.2018.00551.
    [26] D. C. Luvizon, D. Picard and H. Tabia, "2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 5137-5146, doi: 10.1109/CVPR.2018.00539.
    [27] M. R. I. Hossain and J. J. Little, "Exploiting Temporal Information for 3D Pose Estimation,"In: 2018 Proceedings of the European Conference on Computer Vision(ECCV), arXiv: 1711.08585
    [28] K. Lee, I. Lee and S. Lee. "Propagating LSTM: 3D Pose Estimation Based on Joint Interdependency." In: 2018 Proceedings of the European Conference on Computer Vision(ECCV). 2018. pp. 119-135.
    [29] H. Fang, Y. Xu, W. Wang, X. Liu and S. C. Zhu, "Learing Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation," In Thirty-Second AAAI Conference on Artificial Intelligence. 2018
    [30] Y. Cheng, B. Yaang, B. Wang and R. T. Tan, "3D Human Pose Estimation Using Spatio-Temporal Networks with Explicit Occlusion Training," In Thirty-Fourth AAAI Conference on Artificial Intelligence. 2020
    [31] Y. Cheng, B. Yang, B. Wang, Y. Wending and R. Tan, "Occlusion-Aware Networks for 3D Human Pose Estimation in Video," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp. 723-732, doi: 10.1109/ICCV.2019.00081.

    無法下載圖示 校內:2025-07-20公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE