| 研究生: |
蔡佳樺 Tsai, Jia-Hua |
|---|---|
| 論文名稱: |
應用基於跨模態注意力的多模態融合於靜態影像動作辨識 Multimodal Fusion with Cross-Modal Attention for Action Recognition in Still Images |
| 指導教授: |
朱威達
Chu, Wei-Ta |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 英文 |
| 論文頁數: | 25 |
| 中文關鍵詞: | 動作辨識 、跨模態注意力 、特徵融合 |
| 外文關鍵詞: | Action Recognition, Feature Fusion, Cross-Modal Attention |
| 相關次數: | 點閱:80 下載:11 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
我們提出一種跨模態注意力模組,以結合來自不同上下文線索及不同模態間的資訊,從而達到更優秀的靜態影像動作辨識效果。在我們所提供的架構中,分別使用來自整張影像的特徵、偵測到的人物區域特徵,還有使用現成的人體姿態估計模型所偵測出的骨架資料。跨模態注意力模組的主要概念來自於 Transformer,在實現的過程中,我們使用其中一個上下文線索(模態)作為注意力機制中的查詢向量(query vector),而另一個上下文線索(模態)作為鍵向量(key vector),所以來自不同上下文線索(模態)的特徵向量,將會交互運算,以獲得更好的動作表示,最終達到更好的辨識效果。在本篇論文中,我們將會證明我們所提出的架構優於現存的方法,且不需要使用額外的訓練資料,最後,我們也展現了在不同設定下的消融實驗結果。
We propose a cross-modal attention module to combine information from different cues and different modalities, in order to achieve state-of-the-art action recognition in still images. Feature maps are extracted from the entire image, the detected human bounding box, and the detected human skeleton, respectively. The main idea of cross-modal attention stems from the transformer structure, by which we design the processing between the query vector from one cue/modality, and the key vector from another cue/modality. Feature maps from different cues/modalities are cross-referred so that better representations can be obtained to yield better performance. We show that the proposed framework outperforms the state-of-the-art systems without the requirement of an extra training dataset. We also conduct ablation studies to investigate how different settings impact the final results.
[1] Lubomir Bourdev and Jitendra Malik. Poselets: Body part detectors trained using 3d human pose annotations. In Proceedings of International Conference on Computer Vision, 2009.
[2] Vincent Delaitre, Ivan Laptev, and Josef Sivic. Recognizing human actions in still images: a study of bag-of-features and part-based representations. In Proceedings of British Machine Vision Conference, 2010.
[3] Vincent Delaitre, Josef Sivic, and Ivan Laptev. Learning person-object interactions for action recognition in still images. In Proceedings of Neural Information Processing Systems, 2011.
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
[5] Chaitanya Desai and Deva Ramanan. Detecting actions, poses, and objects with relational phraselets. In Proceedings of European Conference on Computer Vision, 2012.
[6] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of International Conference on Learning Representations, 2021.
[7] Mark Everingham, Luc Van Gool, Chris Williams, John Winn, and Andrew Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html, 2012.
[8] Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. Rmpe: Regional multi-person pose estimation. In Proceedings of IEEE International Conference on Computer Vision, 2017.
[9] Georgia Gkioxari, Ross Girshick, and Jitendra Malik. Actions and attributes from wholes and parts. In Proceedings of International Conference on Computer Vision, 2015.
[10] Georgia Gkioxari, Ross Girshick, and Jitendra Malik. Contextual action recognition with r*cnn. In Proceedings of International Conference on Computer Vision, 2015.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[12] Di Hu, Chengze Wang, Feiping Nie, and Xuelong Li. Dense multimodal fusion for hierarchically joint representation. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2019.
[13] Vladimir Iashin and Esa Rahtu. A better use of audio-visual cues: Dense video captioning with bi-modal transformer. In Proceedings of British Machine Vision Conference, 2020.
[14] Hamid Reza Vaezi Joze, Amirreza Shaban, Michael L. Iuzzolino, and Kazuhito Koishida. Mmtm: Multimodal transfer module for cnn fusion. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2020.
[15] Fahad Shahbaz Khan, Joost van de Weijer, Rao Muhammad Anwer, Michael Felsberg, and Carlo Gatta. Semantic pyramids for gender and action recognition. IEEE Transactions on Image Processing, 23(8):3633–3645, 2014.
[16] Alessandro Prest, Cordelia Schmid, and Vittorio Ferrari. Weakly supervised learning of interactions between humans and objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3):601–614, 2012.
[17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of Neural Information Processing Systems, 2017.
[18] Bangpeng Yao and Li Fei-Fei. Action recognition with exemplar based 2.5d graph matching. In Proceedings of European Conference on Computer Vision, 2012.
[19] Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and Li Fei-Fei. Human action recognition by learning bases of action attributes and parts. In Proceedings of International Conference on Computer Vision, 2011.
[20] Yu Zhang, Li Cheng, Jianxin Wu, Jianfei Cai, Minh N. Do, and Jiangbo Lu. Action recognition in still images with minimum annotation efforts. IEEE Transactions on Image Processing, 25(11):5479–5490, 2016.
[21] Zhichen Zhao, Huimin Ma, and Xiaozhi Chen. Semantic parts based top-down pyramid for action recognition. Pattern Recognition Letters, 84:134–141, 2016.
[22] Zhichen Zhao, Huimin Ma, and Xiaozhi Chen. Generalized symmetric pair model for action classification in still images. Pattern Recognition, 64:347–360, 2017.
[23] Zhichen Zhao, Huimin Ma, and Shaodi You. Single image action recognition using semantic body part actions. In Proceedings of International Conference on Computer Vision, 2017.
[24] Xiangtao Zheng, Tengfei Gong, Xiaoqiang Lu, and Xuelong Li. Human action recognition by multiple spatial clues network. Neurocomputing, 483:10–21, 2022.