| 研究生: |
王振倫 Wang, Zhen-Lun |
|---|---|
| 論文名稱: |
一個用於視訊摘要的特徵融合注意力網路 A Feature Fusion Attention Network for Video Summarization |
| 指導教授: |
戴顯權
Tai, Shen-Chuan |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 英文 |
| 論文頁數: | 57 |
| 中文關鍵詞: | 視訊摘要 、注意力機制 、共同注意力機制 、特徵融合 |
| 外文關鍵詞: | video summarization, attention mechanism, co-attention mechanism, feature fusion |
| 相關次數: | 點閱:93 下載:9 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
視訊摘要的目的是在輸入一段長度的影片中找出較有代表性或重要的幀。由於需要妥善地理解較高階的特徵,所以在電腦視覺領域是具有挑戰性的任務。
現今既有的方式有些為使用影片中每幀圖像的特徵向量做處理。導致忽略影片中動態變化所帶來的資訊。而另些採用動態特徵的方式是把圖像與動態特徵獨立區隔運算,這使得兩者資訊無法適合地相互融合。為了解決上述問題,本論文提出使用注意力機制的架構,採用圖像特徵與動態特徵,透過交叉注意力機制來結合不同空間的特徵,使其有更好的溝通,且方法中加入區域注意力機制,使模型能理解不同的時域區間之特徵。根據實驗結果,它的實驗表現在視訊摘要領域中通用的基準資料集SumMe與TVSum中對於state-of-the-art達到有競爭力的結果能證明其效果。
The purpose of video summarization is to identify the more representative or important frames in the input length of a video. This is a challenging task in the field of computer vision because of the need to understand the higher-level features properly.
Some of the existing methods use the frame feature extracted from the input video. As a result, the information from the motion changes in the video is ignored. Other methods predict the importance scores with separate frame and motion features, which may not allow the information from different sources to be properly integrated. To solve the above problems, the proposed method uses the attention mechanism that fuses the frame features and motion features to combine different spatial features with a co-attention mechanism, and the local attention mechanism is added to the method so that the model can understand the features in different temporal duration. According to the experimental results, the performance on the benchmark datasets SumMe and TVSum, which are commonly used in video summarization tasks, achieves competitive results compared with the state-of-the-art methods.
[1] Zhang, Ke, et al. "Video summarization with long short-term memory." European conference on computer vision. Springer, Cham, 2016.
[2] Rochan, Mrigank, Linwei Ye, and Yang Wang. "Video summarization using fully convolutional sequence networks." Proceedings of the European conference on computer vision (ECCV). 2018.
[3] Fajtl, Jiri, et al. "Summarizing videos with attention." Asian Conference on Computer Vision. Springer, Cham, 2018.
[4] Zhu, Wencheng, et al. "Learning multiscale hierarchical attention for video summarization." Pattern Recognition 122 (2022): 108312.
[5] Aner, Aya, and John R. Kender. "Video summaries through mosaic-based shot and scene clustering." European Conference on Computer Vision. Springer, Berlin, Heidelberg, 2002.
[6] Elhamifar, Ehsan, Guillermo Sapiro, and S. Shankar Sastry. "Dissimilarity-based sparse subset selection." IEEE transactions on pattern analysis and machine intelligence 38.11 (2015): 2182-2197.
[7] Mehmood, Irfan, et al. "Divide-and-conquer based summarization framework for extracting affective video content." Neurocomputing 174 (2016): 393-403.
[8] Fei, Mengjuan, Wei Jiang, and Weijie Mao. "Memorable and rich video summarization." Journal of Visual Communication and Image Representation 42 (2017): 207-217.
[9] Liu, Yen-Ting, et al. "Learning hierarchical self-attention for video summarization." 2019 IEEE international conference on image processing (ICIP). IEEE, 2019.
[10] Liu, Yen-Ting, Yu-Jhe Li, and Yu-Chiang Frank Wang. "Transforming multi-concept attention into video summarization." Proceedings of the Asian Conference on Computer Vision. 2020.
[11] Wu, Guande, Jianzhe Lin, and Claudio T. Silva. "ERA: Entity Relationship Aware Video Summarization with Wasserstein GAN." arXiv preprint arXiv:2109.02625 (2021).
[12] Mahasseni, Behrooz, Michael Lam, and Sinisa Todorovic. "Unsupervised video summarization with adversarial lstm networks." Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017.
[13] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[14] Khan, Salman, et al. "Transformers in vision: A survey." ACM Computing Surveys (CSUR) (2021).
[15] Gygli, Michael, et al. "Creating summaries from user videos." European conference on computer vision. Springer, Cham, 2014.
[16] Song, Yale, et al. "Tvsum: Summarizing web videos using titles." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
[17] Zhou, Kaiyang, Yu Qiao, and Tao Xiang. "Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018.
[18] Ji, Zhong, et al. "Video summarization with attention-based encoder–decoder networks." IEEE Transactions on Circuits and Systems for Video Technology 30.6 (2019): 1709-1717.
[19] Fu, Hao, and Hongxing Wang. "Self-attention binary neural tree for video summarization." Pattern Recognition Letters 143 (2021): 19-26.
[20] Chu, Wei-Ta, and Yu-Hsin Liu. "Spatiotemporal modeling and label distribution learning for video summarization." 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP). IEEE, 2019.
[21] Ghauri, Junaid Ahmed, Sherzod Hakimov, and Ralph Ewerth. "Supervised Video Summarization Via Multiple Feature Sets with Parallel Attention." 2021 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2021.
[22] Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems 27 (2014).
[23] Zhao, Bin, Xuelong Li, and Xiaoqiang Lu. "Hierarchical recurrent neural network for video summarization." Proceedings of the 25th ACM international conference on Multimedia. 2017.
[24] Zhao, Bin, Xuelong Li, and Xiaoqiang Lu. "Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[25] Zhu, Wencheng, et al. "Dsnet: A flexible detect-to-summarize network for video summarization." IEEE Transactions on Image Processing 30 (2020): 948-962.
[26] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
[27] Apostolidis, Evlampios, et al. "Combining Global and Local Attention with Positional Encoding for Video Summarization." 2021 IEEE International Symposium on Multimedia (ISM). IEEE, 2021.
[28] Liu, Yang, et al. "Learning natural language inference using bidirectional LSTM model and inner-attention." arXiv preprint arXiv:1605.09090 (2016).
[29] Yu, Zhou, et al. "Deep modular co-attention networks for visual question answering." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[30] Lu, Jiasen, et al. "Hierarchical question-image co-attention for visual question answering." Advances in neural information processing systems 29 (2016).
[31] Zhong, Xian, et al. "Attention-guided image captioning with adaptive global and local feature fusion." Journal of Visual Communication and Image Representation 78 (2021): 103138.
[32] Lu, Yan, et al. "Cross-modality person re-identification with shared-specific feature transfer." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
[33] Zhao, Bin, Maoguo Gong, and Xuelong Li. "Hierarchical multimodal transformer to summarize videos." Neurocomputing 468 (2022): 360-369.
[34] Haopeng, Li, et al. "Video Summarization Based on Video-text Modelling." arXiv preprint arXiv:2201.02494 (2022).
[35] Carreira, Joao, and Andrew Zisserman. "Quo vadis, action recognition? a new model and the kinetics dataset." proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.
[36] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
[37] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems 25 (2012).
[38] Fukui, Hiroshi, et al. "Attention branch network: Learning of attention mechanism for visual explanation." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[39] Khan, Salman, et al. "Transformers in vision: A survey." ACM Computing Surveys (CSUR) (2021).
[40] Liu, Ze, et al. "Swin transformer: Hierarchical vision transformer using shifted windows." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
[41] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
[42] Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database." 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009.
[43] Hsu, Tzu-Chun, Yi-Sheng Liao, and Chun-Rong Huang. "Video Summarization With Frame Index Vision Transformer." 2021 17th International Conference on Machine Vision and Applications (MVA). IEEE, 2021.
[44] Liang, Guoqiang, et al. "Video summarization with a dual-path attentive network." Neurocomputing 467 (2022): 1-9.
[45] Bertasius, Gedas, Heng Wang, and Lorenzo Torresani. "Is space-time attention all you need for video understanding." arXiv preprint arXiv:2102.05095 2.3 (2021): 4.
[46] Feichtenhofer, Christoph, et al. "Slowfast networks for video recognition." Proceedings of the IEEE/CVF international conference on computer vision. 2019.
[47] Teed, Zachary, and Jia Deng. "Raft: Recurrent all-pairs field transforms for optical flow." European conference on computer vision. Springer, Cham, 2020.
[48] Carreira, Joao, et al. "A short note about kinetics-600." arXiv preprint arXiv:1808.01340 (2018).
[49] Lu, Jiasen, et al. "Hierarchical question-image co-attention for visual question answering." Advances in neural information processing systems 29 (2016).
[50] Yu, Zhou, et al. "Deep modular co-attention networks for visual question answering." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
[51] Liu, Liyuan, et al. "On the variance of the adaptive learning rate and beyond." arXiv preprint arXiv:1908.03265 (2019).
[52] Kingma, Diederik P., and Jimmy Ba. "Adam: A method for stochastic optimization." arXiv preprint arXiv:1412.6980 (2014).
[53] Huber, Peter J. "Robust estimation of a location parameter." Breakthroughs in statistics. Springer, New York, NY, 1992. 492-518.
[54] De Avila, Sandra Eliza Fontes, et al. "VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method." Pattern Recognition Letters 32.1 (2011): 56-68.
[55] Lin, Tsung-Yi, et al. "Focal loss for dense object detection." Proceedings of the IEEE international conference on computer vision. 2017.
[56] Potapov, Danila, et al. "Category-specific video summarization." European conference on computer vision. Springer, Cham, 2014.