簡易檢索 / 詳目顯示

研究生: 丁俊哲
Ding, Jun-Zhe
論文名稱: 使用深度網路與動畫資料集進行3D骨架重建
3D Skeleton Reconstruction Using Deep Neural Network and Animated Dataset
指導教授: 蘇文鈺
Su, Wen-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 38
中文關鍵詞: 機器學習動畫人物動作資料集生成3D骨架重建
外文關鍵詞: Machine learning, Animated character action dataset generation, 3D skeleton reconstruction
相關次數: 點閱:151下載:31
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 3D人物骨架重建是計算機視覺中的一個常見問題,屬於姿態估計中的一個類別。通過從圖像中獲取人體骨骼的資訊並將其從2D姿態重建為3D人體骨架,對於讓電腦理解人體動作有很大的幫助。
    傳統上,大多數使用於深度網路學習從2D影像重建3D人物骨架的公開數據集都是使用相機與紅外線感測器加上運動捕捉系統捕捉圖像和影片,然後進行人工標記。該方法需要在演員身上貼上大量的感測器,並且消耗大量的時間和人力成本。這些動作資料的類別有限使得數據集缺乏多樣性,訓練也可能因此受到限制。在本文中,我們使用3D動畫人物作為3D人體骨架重建專案的訓練資料集。
    我們將收集到的動作序列擴展到許多不同的動畫角色骨架,並將其嵌入至Unity3D中。通過操控Unity3D中的設置,可以從3D動畫中獲得3D全域座標。因此,3D全域座標中的3D骨架形成了我們所需的數據集。接著,我們使用一些資料增強的方式在原有的數據集上生成許多特定類型的動作。我們使用這個數據集與Human3.6m數據集訓練了一個生成網路,並且藉由這個生成網路生成了更多相同類型的資料。
    除此之外,我們使用Human3.6m數據集與使用上述方案通過深度神經網路生成的數據集在3D重建模型中進行比較。與使用圖像作為深度神經網絡的輸入的相關研究不同,我們使用估計的2D骨架作為輸入,並且運用神經網絡將2D姿態骨架座標轉換為3D骨架座標。仿真實驗結果表明,使用所提出的擴充方法生成的數據集在神經網路的訓練中也是有效的。

    3D character skeleton reconstruction is a general problem in computer vision, which belongs to the category of pose estimation. It is very helpful for the computer to understand the human actions by obtaining the information of human skeleton from pictures or images and reconstructing the human skeleton from 2D posture into 3D human skeleton.
    Traditionally, most public datasets used for deep network learning to reconstruct 3D human skeletons from 2D images use cameras, infrared sensors, and motion capture systems to capture images and videos, and then annotated them manually. This method needs a lot of sensors on the actors, and consumes a lot of time and labor costs. The categories of these motion data are limited, which makes the data set lack of diversity, and the training may also be limited. In this paper, we use 3D animated characters as the training dataset for the 3D human skeleton reconstruction project.
    We extended the collected motion sequences to many different skeletons of animated characters and embedded them in Unity3D. 3D global coordinate can be obtained from the 3D animation by manipulating the settings in Unity3D. Therefore, the 3D global coordinates obtained from the corresponding 3D skeletons form the desired dataset. Next, We generate many specific types of actions by using data augmentation. Finally, we train the generated neural network by the animated dataset and Human3.6m dataset and generate more same types of actions from this generated model.
    In addition, Human3.6m dataset is used to compare with the dataset generated using the above scheme with a deep neural network. Different from related research that uses images as input for deep neural networks, we use estimated 2D skeletons as input, and the neural network converts 2D skeleton coordinates into 3D skeleton coordinates. Simulation results show that the dataset generated using the proposed augmentation method are effective in the training of the neural networks, too.

    中文摘要 i Abstract iii Acknowledgements v Contents vi List of Tables ix List of Figures x 1 Introduction 1 1.1 Motivation 1 1.2 Background 2 2 Related Works 4 2.1 Human Posture Skeleton Datasets 4 2.2 2D Pose Estimation 6 2.3 Deep-net-based 2d to 3d joints 7 2.4 Variational Autoencoder 9 3 Method 11 3.1 Generative Neural Network for Animated Dataset 11 3.1.1 Overview 11 3.1.2 LSTM-VAE 11 3.1.3 Loss Function Design 12 3.1.4 Network Architecture 13 3.2 Animated Dataset Collection 16 3.2.1 Collected of the Specific Pose Motion from Mixamo® 16 3.2.2 3D Animated Character in Unity3D® 17 3.2.3 Data Preprocessing 18 3.2.4 Data Augmentation 19 3.3 Baseline Deep Learning Neural Network for 3D Human Pose Estimation 21 3.3.1 Overview 21 3.3.2 Network Architecture 22 4 Experimantal Results 24 4.1 Dataset and Protocols 24 4.1.1 Human3.6m Dataset 24 4.1.2 Animated Dataset 25 4.1.3 Protocols 26 4.2 3D Human Pose Estimation Performance Comparison 26 4.3 Comparison of the performance in generated data 27 4.3.1 Experiment 27 4.3.2 Discussion 28 4.4 A comparison of Human3.6m dataset and animated dataset in our work 29 5 Conclusions and Future Work 31 5.1 Conclusion 31 5.2 Future Works 31 References 33

    [1] Hanbyul Joo, Tomas Simon, and Yaser Sheikh. Total capture: A 3d deformation model for tracking faces, hands, and bodies. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8320–8329, 2018.
    [2] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013.
    [3] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Openpose: realtime multi-person 2d pose estimation using part affinity fields. IEEE transactions on pattern analysis and machine intelligence, 43(1):172–186, 2019.
    [4] Akin Caliskan, Armin Mustafa, and Adrian Hilton. Temporal consistency loss
    for high resolution textured and clothed 3dhuman reconstruction from monocular
    video. arXiv preprint arXiv:2104.09259, 2021.
    [5] Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple
    yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE
    International Conference on Computer Vision, pages 2640–2649, 2017.
    [6] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv
    preprint arXiv:1312.6114, 2013. 33doi:11.9877/ncku202194520
    [7] Thi-Lan Le, Minh-Quoc Nguyen, et al. Human posture recognition using human skeleton provided by kinect. In 2013 international conference on computing, management and telecommunications (ComManTel), pages 340–345. IEEE, 2013.
    [8] Bernard Boulay, Fran¸cois Bremond, and Monique Thonnat. Human posture recognition in video sequence. In IEEE International Workshop on VS-PETS, Visual
    Surveillance and Performance Evaluation of Tracking and Surveillance, 2003.
    [9] Zequn Zhang, Yuanning Liu, Ao Li, and Minghui Wang. A novel method for
    user-defined human posture recognition using kinect. In 2014 7th International
    Congress on image and signal processing, pages 736–740. IEEE, 2014.
    [10] Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. Rmpe: Regional multiperson pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2334–2343, 2017.
    [11] Rıza Alp G¨uler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense
    human pose estimation in the wild. In Proceedings of the IEEE conference on
    computer vision and pattern recognition, pages 7297–7306, 2018.
    [12] Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V Gehler, and Bernt Schiele. Deepcut: Joint subset partition and
    labeling for multi person pose estimation. In Proceedings of the IEEE conference
    on computer vision and pattern recognition, pages 4929–4937, 2016.
    34doi:11.9877/ncku202194520
    [13] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
    Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects
    in context. In European conference on computer vision, pages 740–755. Springer,
    2014.
    [14] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2d
    human pose estimation: New benchmark and state of the art analysis. In IEEE
    Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
    [15] Ankur Agarwal and Bill Triggs. 3d human pose from silhouettes by relevance
    vector regression. In Proceedings of the 2004 IEEE Computer Society Conference
    on Computer Vision and Pattern Recognition, 2004. CVPR 2004., volume 2, pages
    II–II. IEEE, 2004.
    [16] Greg Mori and Jitendra Malik. Recovering 3d human body configurations using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(7):1052–1062, 2006.
    [17] Gregory Shakhnarovich, Paul Viola, and Trevor Darrell. Fast pose estimation with parameter-sensitive hashing. In Computer Vision, IEEE International Conference on, volume 3, pages 750–750. IEEE Computer Society, 2003.
    [18] Zerong Zheng, Tao Yu, Yixuan Wei, Qionghai Dai, and Yebin Liu. Deephuman:
    3d human reconstruction from a single image. In Proceedings of the IEEE/CVF
    International Conference on Computer Vision, pages 7739–7749, 2019.
    35doi:11.9877/ncku202194520
    [19] Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Mohamed Elgharib, Pascal Fua, Hans-Peter Seidel, Helge Rhodin, Gerard Pons-Moll, and Christian Theobalt. Xnect: Real-time multi-person 3d motion capture with a single rgb camera. ACM Transactions on Graphics (TOG), 39(4):82–1, 2020.
    [20] Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul Joo. Pifuhd: Multilevel pixel-aligned implicit function for high-resolution 3d human digitization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 84–93, 2020.
    [21] Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3d
    human pose estimation in video with temporal convolutions and semi-supervised
    training. In Proceedings of the IEEE/CVF Conference on Computer Vision and
    Pattern Recognition, pages 7753–7762, 2019.
    [22] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional
    networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
    [23] Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gulcehre, Dzmitry Bahdanau,
    Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
    [24] Xinchen Yan, Akash Rastogi, Ruben Villegas, Kalyan Sunkavalli, Eli Shechtman,
    Sunil Hadap, Ersin Yumer, and Honglak Lee. Mt-vae: Learning motion transfor36doi:11.9877/ncku202194520 mations to generate multimodal human dynamics. In Proceedings of the European Conference on Computer Vision (ECCV), pages 265–281, 2018.
    [25] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory. Neural
    computation, 9(8):1735–1780, 1997.
    [26] Bugra Tekin, Artem Rozantsev, Vincent Lepetit, and Pascal Fua. Direct prediction of 3d body poses from motion compensated sequences. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 991–1000, 2016.
    [27] Feng Zhou and Yuanqing Lin. Fine-grained image classification by exploring
    bipartite-graph labels. In Proceedings of the IEEE Conference on Computer Vision
    and Pattern Recognition (CVPR), June 2016.
    [28] Kyoungoh Lee, Inwoong Lee, and Sanghoon Lee. Propagating lstm: 3d pose estimation based on joint interdependency. In Proceedings of the European Conference on Computer Vision (ECCV), pages 119–135, 2018.
    [29] Bochen Li, Xinzhao Liu, Karthik Dinesh, Zhiyao Duan, and Gaurav Sharma.
    Creating a multitrack classical music performance dataset for multimodal music analysis: Challenges, insights, and applications. volume 21, pages 522–535,
    February 2019.
    [30] Haibo Qiu, Chunyu Wang, Jingdong Wang, Naiyan Wang, and Wenjun Zeng.
    Cross view fusion for 3d human pose estimation. In Proceedings of the IEEE/CVF
    International Conference on Computer Vision, pages 4342–4351, 2019. 37doi:11.9877/ncku202194520
    [31] Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov. Learnable triangulation of human pose. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE