| 研究生: |
李萱 Lee, Hsuan |
|---|---|
| 論文名稱: |
基於 Transformer Encoder 使用符號合成人類動作 Symbolic Human Action Synthesis via Transformer Encoder |
| 指導教授: |
蘇文鈺
Su, Wen-Yu |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 英文 |
| 論文頁數: | 22 |
| 中文關鍵詞: | 3D動作合成 、Transformer Encoder 、符號表示 |
| 外文關鍵詞: | 3D human action synthesis, Transformer Encoder, Symbolic representation |
| 相關次數: | 點閱:51 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
我們的研究於將人體姿態以的符號表示。我們將人體分為五個部分。人體姿態的每個部位被分配到其對應的符號。隨著人體姿態的不同,將由不同的符號表示。符號的數量是有限的,符號越多,就越能準確地表示一個身體部位所有可能的運動。帶這著個概念,我們可以將符號序列轉換為 Transformer Encoder的token,用於合成人類動作。
Transformer Encoder 的訓練方式如下:我們收集了大量的人類動作數據集,將人類動作中的每個人體姿態轉換為token。然後,Transformer Encoder的輸入即為轉換的token,輸出是該人類動作的真值。通過token的不同組合,我們可以合成不出現在人類動作數據集中的各種人類動作。
Token set 的大小會影響人類動作合成的品質。我們使用 HumanAct12 和 Human3.6m 數據集來評估此編碼器在不同 token set 大小下造成的誤差。
Our work aims to develop the symbolic representation of the human body pose. We divided the human pose into five human body parts. Each symbol would be assigned with the specific pose of a correspondent body part. With the different postures of the human body, it will be represented by different symbols. The number of symbols is limited, and the more symbols there are, the more accurately they can represent all possible movements of a body part. With this concept, a sequence of symbols can be converted into tokens of a Transformer encoder which is applied to synthesize the desired human motion.
The Transformer encoder is trained as follows: we collect a large number of human action datasets, converting each pose in human actions into tokens. Then, the input to the Transformer Encoder will be the tokens of human actions, and the output will be the ground-truth of that human action. With the different combinations of the tokens, we can synthesize various human actions that are not in the human action dataset.
The size of the token set affects the quality of the human action synthesis. We used the HumanAct12 and Human3.6m datasets to evaluate the errors caused by the encoder at different token-level sizes.
[1] Emad Barsoum, John Kender, and Zicheng Liu. Hp-gan: Probabilistic 3d human
motion prediction via gan. In Proceedings of the IEEE conference on computer
vision and pattern recognition workshops, pages 1418–1427, 2018.
[2] Chaitanya Ahuja and Louis-Philippe Morency. Language2pose: Natural language
grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV),
pages 719–728. IEEE, 2019.
[3] Jogendra Nath Kundu, Maharshi Gor, and R Venkatesh Babu. Bihmp-gan: Bidi-
rectional 3d human motion prediction gan. In Proceedings of the AAAI conference
on artificial intelligence, volume 33, pages 8553–8560, 2019.
[4] Qiongjie Cui, Huaijiang Sun, Yue Kong, Xiaoqian Zhang, and Yanmeng Li. Effi-
cient human motion prediction using temporal convolutional generative adversarial
network. Information Sciences, 545:427–447, 2021.
[5] Kang Chen, Zhipeng Tan, Jin Lei, Song-Hai Zhang, Yuan-Chen Guo, Weidong
Zhang, and Shi-Min Hu. Choreomaster: choreography-oriented music-driven
dance synthesis. ACM Transactions on Graphics (TOG), 40(4):1–13, 2021.
[6] Mathis Petrovich, Michael J Black, and G ̈ul Varol. Action-conditioned 3d hu-
man motion synthesis with transformer vae. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 10985–10995, 2021.
[7] Hyemin Ahn, Timothy Ha, Yunho Choi, Hwiyeon Yoo, and Songhwai Oh.
Text2action: Generative adversarial synthesis from language to action. In 2018
19
IEEE International Conference on Robotics and Automation (ICRA), pages 5915–
5920. IEEE, 2018.
[8] Junji Yamato, Jun Ohya, and Kenichiro Ishii. Recognizing human action in time-
sequential images using hidden markov model. In CVPR, volume 92, pages 379–
385, 1992.
[9] Lu Xia, Chia-Chih Chen, and Jake K Aggarwal. View invariant human action
recognition using histograms of 3d joints. In 2012 IEEE computer society confer-
ence on computer vision and pattern recognition workshops, pages 20–27. IEEE,
2012.
[10] Hsuan-Sheng Chen, Hua-Tsung Chen, Yi-Wen Chen, and Suh-Yin Lee. Human
action recognition using star skeleton. In Proceedings of the 4th ACM international
workshop on Video surveillance and sensor networks, pages 171–178, 2006.
[11] Bhaskar Chakraborty, Ognjen Rudovic, and Jordi Gonzalez. View-invariant
human-body detection with extension to human action recognition using
component-wise hmm of body parts. In 2008 8th IEEE International Conference
on Automatic Face & Gesture Recognition, pages 1–6. IEEE, 2008.
[12] Yong Du, Wei Wang, and Liang Wang. Hierarchical recurrent neural network
for skeleton based action recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1110–1118, 2015.
[13] Wataru Takano, Hirotaka Imagawa, and Yoshihiko Nakamura. Prediction of hu-
man behaviors in the future through symbolic inference. In 2011 IEEE Interna-
tional Conference on Robotics and Automation, pages 1970–1975. IEEE, 2011.
20
[14] Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian,
Chen Change Loy, and Ziwei Liu. Bailando: 3d dance generation by actor-critic
gpt with choreographic memory. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 11050–11059, 2022.
[15] Dirk Ormoneit, Michael J Black, Trevor Hastie, and Hedvig Kjellstr ̈om. Represent-
ing cyclic human motion using functional analysis. Image and Vision Computing,
23(14):1264–1276, 2005.
[16] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation
learning with deep convolutional generative adversarial networks. arXiv preprint
arXiv:1511.06434, 2015.
[17] Wataru Takano and Yoshihiko Nakamura. Statistical mutual conversion between
whole body motion primitives and linguistic sentences for human motions. The
International Journal of Robotics Research, 34(10):1314–1328, 2015.
[18] Minho Lee, Kyogu Lee, and Jaeheung Park. Music similarity-based approach to
generating dance motion sequence. Multimedia tools and applications, 62(3):895–
912, 2013.
[19] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+ d: A
large scale dataset for 3d human activity analysis. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 1010–1019, 2016.
[20] Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng,
Minglun Gong, and Li Cheng. Action2motion: Conditioned generation of 3d
21
human motions. In Proceedings of the 28th ACM International Conference on
Multimedia, pages 2021–2029, 2020.
[21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.
Advances in neural information processing systems, 30, 2017.
[22] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.
6m: Large scale datasets and predictive methods for 3d human sensing in natural
environments. IEEE transactions on pattern analysis and machine intelligence,
36(7):1325–1339, 2013
校內:不公開