成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	林沴育 Lin, Li-Yu
論文名稱：	使用BiLSTM VAE與動畫資料集對3D人體動作進行插值與去雜訊 3D Human Motion Interpolation and Denoising with BiLSTM VAE and Animated Dataset
指導教授：	蘇文鈺 Su, Wen-Yu
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering
論文出版年：	2022
畢業學年度：	110
語文別：	英文
論文頁數：	27
中文關鍵詞：	機器學習、人體動作插值、人體動作去雜訊、變分自編碼器
外文關鍵詞：	Machine learning, Human Motion Interpolation, Human Motion Denoising, Variational Auto-Encoder
相關次數：	點閱：76 下載：6
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

傳統上，製作3D人物動畫，動畫師需先繪製關鍵幀，再透過補幀的技術串連成完整的動畫，因此每一小段動畫都需要花費大量的時間與人力成本。在連接兩個不同的動畫時，常見的方法為線性插值法，此方法雖然方便，但使得兩動作間連接片段較為單調。我們期望利用深度神經網路，產生兩個任意動作間的過渡片段，透過連接多個小段的動畫，以較低的成本取得新的動畫。
3D人物骨架資料集的製作，需要耗費大量的設備與人力，使得這些資料集的資料量與類別較為有限，導致模型能學習到的動作不夠多元。在本文中，我們利用Unity3D取得人物在3D場景中的全域座標，接著透過調整角度與線性內插等方式對資料進行擴增，以此作為我們訓練用的資料集。我們使用BiLSTM VAE作為學習插值片段的模型。並且我們將人體骨架劃分為五個部位。使產生的內插動作有更多組合的可能性。
最後，使用公開的人體動作資料集來驗證我們方法的有效性。實驗結果證明，對於Human3.6M資料集，我們的模型可以對給定的動作進行內插，產生與線性內插不同，具多樣性的過渡片段。其生成結果在MAE與MPJPE上與其他生成模型相比有較好的表現。且由於分成五個部位，我們可以對每個部位給予個別的輸入，產生不同風格的新動作。
此外，現有的3D人體姿態估計技術，如OpenPose、MediaPipe，估計的人體動態，容易產生不規則抖動，或是因遮蔽而估計錯誤的情形。將原先錯誤的片段去掉，再利用我們的方法重新生成，可產生較為平滑且無明顯異常的骨架動態。由於各部位的模型是獨立的，因此可以在不影響其他部位的前提下，只對特定部位進行去雜訊。

Traditionally, to create 3D character animations, animators need to draw keyframes and then fill the frames between two keyframes to get a complete animation. Each animation clip requires a lot of time and labor costs. A common way to connect two different animations is linear interpolation. This method is convenient, but a connected clip between two animations would be too monotonous. We expect to produce a transition clip between two arbitrary actions using a deep neural network. Then we can get new animations at a lower cost by stitching multiple animation clips.
Creating 3D human skeleton datasets requires a lot of equipment and labor costs, which causes the amount of data and categories to be limited and results in the lack of diversity in the actions that the model can learn. In this paper, we obtain the global coordinates of the characters in the 3D scene with Unity3D. Then augment data by adjusting angles and linear interpolation. Finally, we use these data as a training dataset. We use a BiLSTM VAE model to learn the interpolation clips. And We divide the human skeleton into five parts to generate more possible combinations of interpolation actions.
Finally, we validate the effectiveness of our method using public human motion datasets. The experiments demonstrate that our model successfully produces interpolation clips for input motions to create new actions with a different style. The results have better performance on MAE and MPJPE than other generative models on Humam3.6M. Furthermore, since the human pose is divided into five parts, we can give different inputs to each body part, resulting in diverse styles of new actions.
In addition, the human body motions obtained by using 3D human pose estimation technologies, such as OpenPose and MediaPipe, usually be irregular jitter, or estimation errors due to occlusion. By removing the outliers and regenerating with our method, we would get a human motion that is smoother and without obvious abnormality. Since the model of each body part is independent, we can denoise certain parts without affecting other parts.

中文摘要 i
Abstract iii
Acknowledgements v
Contents vi
List of Tables viii
List of Figures ix
Introduction 1
Related Works 4
1 Human Motion Skeleton Datasets 4
2 Conditioned Motion Generation 5
3 Human Motion Interpolation 6
4 3D Pose Estimation 7
Method 8
1 Network architecture 8
2 Loss Functions 10
3 Animated Dataset 11
4 Data Preprocessing 13
Experiment Results 15
1 Datasets 15
1.1 Human3.6M 15
1.2 Animated Dataset 15
2 Evaluation Metrics 16
3 Results 17
3.1 Results of Interpolation 17
3.2 Results of Denoising 19
Conclusions and Future Works 21
1 Concousion 21
2 Future Works 21
References 22
                                    

[1] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013.
[2] Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. Mediapipe: A framework for building perception pipelines. arXiv preprint
arXiv:1906.08172, 2019.
[3] Brian F Allen and Petros Faloutsos. Evolved controllers for simulated locomotion. In International Workshop on Motion in Games, pages 219–230. Springer, 2009.
[4] Lucas Kovar, Michael Gleicher, and Fr ́ed ́eric Pighin. Motion graphs. In ACM SIGGRAPH 2008 classes, pages 1–10. 2008.
[5] Edilson De Aguiar, Carsten Stoll, Christian Theobalt, Naveed Ahmed, Hans-Peter Seidel, and Sebastian Thrun. Performance capture from sparse multi-view video. In ACM SIGGRAPH 2008 papers, pages 1–10. 2008.
[6] Janzaib Masood, Abdul Samad, Zulkafil Abbas, and Latif Khan. Evolution of locomotion controllers for snake robots. In 2016 2nd International Conference on Robotics and Artificial Intelligence (ICRAI), pages 164–169, 2016.
[7] Keith Grochow, Steven L Martin, Aaron Hertzmann, and Zoran Popovi ́c. Style-based inverse kinematics. In ACM SIGGRAPH 2004 Papers, pages 522–531. 2004.
[8] Gengdai Liu, Zhigeng Pan, and Ling Li. Motion synthesis using style-editable inverse kinematics. In International Workshop on Intelligent Virtual Agents, pages 118–124. Springer, 2009.
[9] Jehee Lee, Jinxiang Chai, Paul SA Reitsma, Jessica K Hodgins, and Nancy S Pollard. Interactive control of avatars animated with human motion data. In Proceedings of the 29th annual conference on Computer graphics and interactive
techniques, pages 491–500, 2002.
[10] Jack M Wang, David J Fleet, and Aaron Hertzmann. Gaussian process dynamical models for human motion. IEEE transactions on pattern analysis and machine intelligence, 30(2):283–298, 2007.
[11] Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. Action2Motion: Conditioned Generation of 3D Human Motions, page 2021–2029. Association for Computing Machinery, 2020.
[12] Dario Pavllo, David Grangier, and Michael Auli. Quaternet: A quaternion-based recurrent model for human motion. arXiv preprint arXiv:1805.06485, 2018.
[13] Hyemin Ahn, Timothy Ha, Yunho Choi, Hwiyeon Yoo, and Songhwai Oh. Text2action: Generative adversarial synthesis from language to action. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5915–
5920. IEEE, 2018.
[14] Chaitanya Ahuja and Louis-Philippe Morency. Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV), pages 719–728. IEEE, 2019.
[15] Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, and Jan Kautz. Dancing to music. Advances in Neural Information Processing Systems, 32, 2019.
[16] Yongyi Tang, Lin Ma, Wei Liu, and Weishi Zheng. Long-term human motion prediction by modeling motion context and enhancing motion dynamic. arXiv preprint arXiv:1805.02513, 2018.
[17] Alejandro Hernandez, Jurgen Gall, and Francesc Moreno-Noguer. Human motion prediction via spatio-temporal inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7134–7143, 2019.
[18] F ́elix G Harvey and Christopher Pal. Recurrent transition networks for character locomotion. In SIGGRAPH Asia 2018 Technical Briefs, pages 1–4. 2018.
[19] Manuel Kaufmann, Emre Aksan, Jie Song, Fabrizio Pece, Remo Ziegler, and Otmar Hilliges. Convolutional autoencoders for human motion infilling. In 2020 International Conference on 3D Vision (3DV), pages 918–927. IEEE, 2020.
[20] Yinglin Duan, Tianyang Shi, Zhengxia Zou, Yenan Lin, Zhehui Qian, Bohan Zhang, and Yi Yuan. Single-shot motion completion with transformer. arXiv preprint arXiv:2103.00776, 2021.
[21] Xinchen Yan, Akash Rastogi, Ruben Villegas, Kalyan Sunkavalli, Eli Shechtman, Sunil Hadap, Ersin Yumer, and Honglak Lee. Mt-vae: Learning motion transformations to generate multimodal human dynamics. In Proceedings of the European
conference on computer vision (ECCV), pages 265–281, 2018.
[22] Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. Recurrent network models for human dynamics. In Proceedings of the IEEE international conference on computer vision, pages 4346–4354, 2015.
[23] Partha Ghosh, Jie Song, Emre Aksan, and Otmar Hilliges. Learning human motion models for long-term predictions. In 2017 International Conference on 3D Vision(3DV), pages 458–466. IEEE, 2017.
[24] Julieta Martinez, Michael J Black, and Javier Romero. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2891–2900, 2017.
[25] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014.
[26] Wei Mao, Miaomiao Liu, Mathieu Salzmann, and Hongdong Li. Learning trajectory dependencies for human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9489–9497, 2019.
[27] Chi Zhou, Zhangjiong Lai, Suzhen Wang, Lincheng Li, Xiaohan Sun, and Yu Ding. Learning a deep motion interpolation network for human skeleton animations. Computer Animation and Virtual Worlds, 32(3-4):e2003, 2021.
[28] Jiaman Li, Ruben Villegas, Duygu Ceylan, Jimei Yang, Zhengfei Kuang, Hao Li, and Yajie Zhao. Task-generic hierarchical human motion prior using vaes. In 2021 International Conference on 3D Vision (3DV), pages 771–781. IEEE, 2021.
[29] Yujun Cai, Yiwei Wang, Yiheng Zhu, Tat-Jen Cham, Jianfei Cai, Junsong Yuan,Jun Liu, Chuanxia Zheng, Sijie Yan, Henghui Ding, et al. A unified 3d human motion synthesis model via conditional variational auto-encoder. In Proceedings
of the IEEE/CVF International Conference on Computer Vision, pages 11645–11655, 2021.
[30] Wei Mao, Miaomiao Liu, and Mathieu Salzmann. History repeats itself: Human motion prediction via motion attention. In European Conference on Computer Vision, pages 474–489. Springer, 2020.
[31] Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh. Panoptic studio: A massively multiview system for social motion capture. In Proceedings of the IEEE International
Conference on Computer Vision (ICCV), December 2015.
[32] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1010–1019, 2016.
[33] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
[34] Yujun Cai, Liuhao Ge, Jun Liu, Jianfei Cai, Tat-Jen Cham, Junsong Yuan, and Nadia Magnenat Thalmann. Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 2272–2281, 2019.
[35] Jun Liu, Henghui Ding, Amir Shahroudy, Ling-Yu Duan, Xudong Jiang, Gang Wang, and Alex C Kot. Feature boosting network for 3d pose estimation. IEEE transactions on pattern analysis and machine intelligence, 42(2):494–501, 2019.
[36] Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE international conference on computer vision, pages 2640–2649, 2017.
[37] Yujun Cai, Liuhao Ge, Jianfei Cai, and Junsong Yuan. Weakly-supervised 3d hand pose estimation from monocular rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 666–682, 2018.
[38] Joela F Gauss, Christoph Brandin, Andreas Heberle, and Welf L ̈owe. Smoothing skeleton avatar visualizations using signal processing technology. SN Computer Science, 2(6):1–17, 2021.

校內：2023-08-30公開
校外：2023-08-30公開

簡易檢索 / 詳目顯示

相關論文