簡易檢索 / 詳目顯示

研究生: 王別濬
Wang, Bie-Jyun
論文名稱: 基於關鍵幀解耦的 CVAE 文本到動作生成
CVAE Text-to-Motion Generation with Key Frame Disentanglement
指導教授: 蘇文鈺
Su, Wen-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 34
中文關鍵詞: 條件變分自編碼器關鍵幀人體動作生成文本到動作
外文關鍵詞: Conditional Variational Autoencoder, Key Frame, Human Motion Generation, Text-to-Motion
相關次數: 點閱:65下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 生成現實且多樣化的人體動作是計算機動畫中一項具有挑戰性的任務。在本文中,我們提出了一種動作生成方法,該方法結合了條件變分自編碼器(CVAE)、關鍵幀變壓器和關鍵幀強化模塊。我們的方法著重於識別並加強動作序列中的關鍵幀,確保生成的動作與輸入的文本標籤對齊。此外,我們引入了一種基於部件的處理方法,將人體骨架分為手、軀幹和腳,使模型能夠捕捉不同部位的細微動作。生成的動作在連貫性、表現力和與給定標籤的對應性方面均有所改善。

    Generating realistic and diverse human motions is a challenging task in computer animation. In this paper, we propose an approach for motion generation that combines a Conditional Variational Autoencoder (CVAE) with a key frame Transformer and a key frame reinforce module. Our method focuses on identifying and reinforcing key frames within a motion sequence, ensuring that the generated motion aligns with the input text labels. Furthermore, we introduce a part-based processing approach that divides the human skeleton into hands, torso, and feet, allowing the model to capture fine-grained movements specific to each body part. The generated motions exhibit improved coherence, expressiveness, and correspondence to the given labels.

    摘要 i Abstract ii Acknowledgments iii Contents iv List of Tables vi List of Figures vii List of Symbols viii 1 Introduction 1 2 Related works 4 2.1 Conditional Motion Generation 4 2.2 Text Embedding 5 2.3 PoseFormer 6 2.3.1 Spatial Transformer Module 6 2.3.2 Temporal Transformer Module 7 3 Method 8 3.1 Data and its preprocessing 8 3.2 Model Architecture 10 3.2.1 Transformer-based CVAE Model 10 3.2.2 Key frame detection 11 3.2.3 Key frame module 11 3.3 Loss Function 12 4 Experiment 13 4.1 Datasets 13 4.2 Result 14 4.3 Conclusion and future work 21 REFERENCES 22

    [1] Chaitanya Ahuja and Louis-Philippe Morency. “Language2Pose: Natural Language Grounded Pose Forecasting”. In: International Conference on 3D Vision (3DV). 2019.
    [2] Andreas Aristidou et al. “Rhythm is a Dancer: Music-Driven Motion Synthesis with Global Structure.” In: arXiv preprint arXiv:2111.12159 (2021).
    [3] Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Lan- guage Understanding.” In: arXiv preprint arXiv:1810.04805 (2018).
    [4] Nat Dilokthanakul et al. “Deep unsupervised clustering with gaussian mixture varia- tional autoencoders.” In: arXiv preprint arXiv:1611.02648 (2016).
    [5] Alexey Dosovitskiy et al. “An image is worth 16x16 words: Transformers for im- age recognition at scale”. In: International Conference on Learning Representations (ICLR). 2021.
    [6] Pif Edwards et al. “JALI: an animator-centric viseme model for expressive lip syn- chronization”. In: ACM Transactions on graphics (TOG) 35.4 (2016), pp. 1–11.
    [7] Chen Guo et al. “Generating human motion from natural language descriptions using a cross-modal contrastive learning framework.” In: arXiv preprint arXiv:2210.08067 (2022).
    [8] Chuan Guo et al. “Action2motion: Conditioned generation of 3d human motions”. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020, pp. 2021– 2029.
    [9] Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. “MoGlow: Probabilistic and Controllable Motion Synthesis Using Normalising Flows”. In: ACM Transactions on Graphics (TOG). Vol. 39. 6. ACM New York, NY, USA, 2020.
    [10] Yi-Ling Hong et al. “Generating diverse and natural 3d human motions from text”. In:arXiv preprint arXiv:2205.12192 (2022).
    [11] Yuki Ishiguro et al. “High speed whole body dynamic motion generation for humanoid robot with reinforcement learning and dynamical movement primitives”. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE. 2020, pp. 4632–4639.
    [12] Aleksandar Jevtić et al. “Robot motion adaptation through user intervention and rein- forcement learning.” In: Pattern Recognition Letters 105 (2018), pp. 67–75.
    [13] Chun-Wei Lai. “Infilling and Reconstruction of Fragmented 3D Human Motion Se- quence using BiLSTM-VAE”. Available: https://thesis.lib.ncku.edu.tw/ thesis/detail/cb13bddf51def1c21a8a14a1a8898090/. MA thesis. Tainan, Tai- wan: National Cheng Kung University, June 2023.
    [14] Angela S Lin et al. “Generating Animation from Natural Language”. In: Visually Grounded Interaction and Language Workshop (ViGIL) at NeurIPS. 2018.
    [15] Yinhan Liu et al. “RoBERTa: A robustly optimized BERT pretraining approach”. In:arXiv preprint arXiv:1907.11692 (2019).
    [16] Shubh Maheshwari, Debtanu Gupta, and Ravi Kiran Sarvadevabhatla. “MUGL: Large Scale Multi Person Conditional Action Generation with Locomotion”. In: Proceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2022,pp. 257–265.
    [17] Tomas Mikolov et al. “Efficient estimation of word representations in vector space”. In: International Conference on Learning Representations (ICLR). 2013.
    [18] Ludovic Mourot et al. “A Survey on Deep Learning for Skeleton-Based Human Ani- mation”. In: Computer Graphics Forum 40.2 (2021), pp. 215–237.
    [19] NeuronMocap. NeuronMocap. https://neuronmocap.com. 2023.
    [20] Mathis Petrovich, Michael J Black, and Gül Varol. “Action-Conditioned 3D Human Motion Synthesis with Transformer VAE”. In: International Conference on Computer Vision (ICCV). 2021, pp. 10985–10995.
    [21] Mathis Petrovich, Michael J Black, and Gul Varol. “Human motion generation from text via diffusion models.” In: arXiv preprint arXiv:2209.14916 (2022).
    [22] Daniel Rakita, Bilge Mutlu, and Michael Gleicher. “A motion retargeting method for effective mimicry-based teleoperation of robot arms”. In: Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction. 2018, pp. 361– 370.
    [23] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. “Learning structured output represen- tation using deep conditional generative models.” In: Advances in neural information processing systems 28. 2015.
    [24] Guy Tevet et al. “MotionCLIP: Exposing Human Motion Generation to CLIP Space.” In: arXiv preprint arXiv:2203.08063 (2022).
    [25] Jiashun Wang et al. “Multi-Person Motion Prediction with Multi-Range Transform- ers”. In: Advances in Neural Information Processing Systems (NeurIPS). 2021.
    [26] Lijuan Yu et al. “Being noncommittal: Using wordless training to generate human motions from text.” In: arXiv preprint arXiv:2302.06228 (2023).
    [27] Yuxiang Zhang et al. “A review of human-robot interaction in virtual reality.” In: Fron- tiers in Neurorobotics 15 (2021), p. 686037.
    [28] Ce Zheng et al. “3D Human Pose Estimation with Spatial and Temporal Transformers”. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2021).

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE