簡易檢索 / 詳目顯示

研究生: 江啓睿
Jiang, Ci-Ruei
論文名稱: 基於特徵擷取的衝突影片推理
Video Reasoning for Conflict Events by Feature Extraciton
指導教授: 鄭憲宗
Cheng, Sheng-Tzong
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 英文
論文頁數: 41
中文關鍵詞: 影片推理影片理解多任務學習
外文關鍵詞: Video Reasoning, Video Understanding, Multi-task Learning
相關次數: 點閱:125下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著多媒體數據快速成長以及深度學習技術的進步,人們已經可以訓練出高準確率模型來應用到各種領域上。在影片理解方面,目前已經可以做到影片分類、時序動作偵測、影片摘要等任務。在日常生活中,常常可以看到許多社會事件的發生,而這些事件往往是從一個小小的衝突開始。若能從影片中推理出衝突事件以及造成的危險性,便能及早阻止社會事件的發生。
    在本研究中,我們提出了一個影片和語音推理網路(VARN),透過影片以及語音上的特徵來推理可能發生的衝突事件。為了讓模型能更泛化於其他任務,我們還加入了預測網路來預測衝突事件的危險性。此外,我們使用了多任務學習,讓影片和語音的特徵可以更泛化於其他相關的任務。並提出幾種融合影片以及語音特徵的方法,使模型的推理效能更好。

    With the rapid growth of multimedia data and the improvement of deep learning technology, people have been able to train high-accuracy models to apply to various fields. In terms of the video understanding, tasks such as video classification, temporal action detection, and video summary are now available. In daily life, many social events often happen, and most events start with a small conflict event. If we can reason out the conflicts and the dangers caused from the video, we can prevent social incidents from happening at an early stage.
    In this research, we present a Video and Audio Reasoning Network (VARN) that infers possible conflict events through video and audio features. In order to make the model more generalized to other tasks, we also added a predictive network to predict the risk of conflict events. Moreover, use multi-tasking to make the characteristics of movies and voices more generalizable to other related tasks. We also propose several methods to integrate the video features and the audio features, so that the reasoning performance of the model is better.

    摘要 I Abstract II ACKNOWLEDGEMENT III LIST OF CONTENTS IV LIST OF FIGURES V LIST OF TABLES VI Chapter 1. Introduction & Motivation 1 1.1 Introduction 1 1.2 Motivation 2 1.3 Thesis Overview 5 Chapter 2. Background & Related Work 7 2.1 Convolutional Neural Networks 7 2.2 Video Understanding 9 2.3 Audio Understanding 14 2.4 Multi-task Learning 15 2.5 Video Dataset 17 Chapter 3. Model Design and Approach 19 3.1 Problem Description 19 3.2 Model Design 20 3.3 Feature Extraction 22 3.4 Fusion Network 25 3.5 Reasoning Network and Predicted Network 26 Chapter 4. Experiment 29 4.1 System Implementation 29 4.2 Experiment Environment and Settings 31 4.3 Experiment Result 33 Chapter 5. Conclusion & Future Work 37 Reference 38

    [1] Statista. (2018). YouTube: hours of video uploaded every minute 2015 | Statistic. [online] Available at: https://www.statista.com/statistics/259477/hours-of-video-uploaded-to-youtube-every-minute/ [Accessed 1 Jul. 2018].
    [2] Omnicoreagency.com. (2018). • Instagram by the Numbers (2018): Stats, Demographics & Fun Facts. [online] Available at: https://www.omnicoreagency.com/instagram-statistics/ [Accessed 1 Jul. 2018].
    [3] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification with deep convolutional neural networks," Advances in Neural Information Processing Systems, Stateline, 2012.
    [4] Zeiler, Matthew D., and Rob Fergus, "Visualizing and understanding convolutional networks," European Conference on computer vision, Springer, Cham, 2014.
    [5] Simonyan, Karen, and Andrew Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
    [6] Szegedy, Christian, et al., "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015.
    [7] Lin, Min, Qiang Chen, and Shuicheng Yan, "Network in network," arXiv preprint arXiv:1312.4400 (2013).
    [8] He, Kaiming, et al., "Deep residual learning for image recognition," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016.
    [9] En.wikipedia.org. (2018). Conflict. [online] Available at: https://en.wikipedia.org/wiki/Conflict [Accessed 1 Jul. 2018].
    [10] Laptev, Ivan, "On space-time interest points," International Journal of Computer Vision 64.2-3 (2005): 107-123.
    [11] Laptev, Ivan, et al, "Learning realistic human actions from movies," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, Alaska, 2008.
    [12] Klaser, Alexander, Marcin Marszałek, and Cordelia Schmid, "A spatio-temporal descriptor based on 3d-gradients," BMVC 2008-19th British Machine Vision Conference, British Machine Vision Association, Leeds, 2008.
    [13] Scovanner, Paul, Saad Ali, and Mubarak Shah, "A 3-dimensional sift descriptor and its application to action recognition," Proceedings of the 15th ACM International Conference on Multimedia, ACM, Augsburg, 2007.
    [14] Wang, Heng, and Cordelia Schmid, "Action recognition with improved trajectories," Proceedings of the IEEE International Conference on Computer Vision, Sydney, 2013.
    [15] Karpathy, Andrej, et al., "Large-scale video classification with convolutional neural networks," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 2014.
    [16] Simonyan, Karen, and Andrew Zisserman, "Two-stream convolutional networks for action recognition in videos," Advances in Neural Information Processing Systems, Montreal, 2014.
    [17] Richard, Alexander, and Juergen Gall, "Temporal action detection using a statistical language model," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016.
    [18] Venugopalan, Subhashini, et al., "Sequence to sequence-video to text," Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015.
    [19] Tapaswi, Makarand, et al., "Movieqa: Understanding stories in movies through question-answering," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016.
    [20] LeCun, Yann, et al., "Gradient-based learning applied to document recognition," Proceedings of the IEEE 86.11 (1998): 2278-2324.
    [21] Dumoulin, Vincent, and Francesco Visin, "A guide to convolution arithmetic for deep learning," arXiv preprint arXiv:1603.07285 (2016).
    [22] Soomro, Khurram, Amir Roshan Zamir, and Mubarak Shah, "UCF101: A dataset of 101 human actions classes from videos in the wild," arXiv preprint arXiv:1212.0402 (2012).
    [23] Kuehne, Hilde, et al., "Hmdb51: A large video database for human motion recognition," High Performance Computing in Science and Engineering ‘12, Springer, Berlin, Heidelberg, 2013, 571-582.
    [24] Shou, Zheng, et al., "Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017.
    [25] Zhao, Yue, et al., "Temporal action detection with structured segment networks," Proceedings of the IEEE International Conference on Computer Vision, Venice, 2017.
    [26] Girshick, Ross, "Fast r-cnn," Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015.
    [27] Parashar, Angshuman, et al., "Scnn: An accelerator for compressed-sparse convolutional neural networks," ACM SIGARCH Computer Architecture News, Vol. 45, No. 2, ACM, 2017.
    [28] Gao, Jiyang, Zhenheng Yang, and Ram Nevatia, "Cascaded boundary regression for temporal action detection," arXiv preprint arXiv:1705.01180 (2017).
    [29] Gao, Jiyang, et al., "Turn tap: Temporal unit regression network for temporal action proposals," arXiv preprint arXiv:1703.06189 (2017).
    [30] Hochreiter, Sepp, and Jürgen Schmidhuber, "Long short-term memory," Neural computation 9.8 (1997): 1735-1780.
    [31] Carreira, Joao, and Andrew Zisserman, "Quo vadis, action recognition? a new model and the kinetics dataset," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017.
    [32] Qiu, Zhaofan, Ting Yao, and Tao Mei, "Learning spatio-temporal representation with pseudo-3d residual networks," Proceedings of the IEEE International Conference on Computer Vision, Venice, 2017.
    [33] Barchiesi, Daniele, et al., "Acoustic scene classification," arXiv preprint arXiv:1411.3715 (2014).
    [34] Rakotomamonjy, Alain, and Gilles Gasso, "Histogram of gradients of time-frequency representations for audio scene classification," IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 23.1 (2015): 142-153.
    [35] Salamon, Justin, and Juan Pablo Bello, "Unsupervised feature learning for urban sound classification," Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, IEEE, Brisbane, 2015.
    [36] Lee, Honglak, et al., "Unsupervised feature learning for audio classification using convolutional deep belief networks," Advances in Neural Information Processing Systems, Vancouver, 2009.
    [37] Piczak, Karol J, "Environmental sound classification with convolutional neural networks," 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Boston, 2015.
    [38] Hertel, Lars, Huy Phan, and Alfred Mertins, "Comparing time and frequency domain for audio event recognition using deep learning," 2016 IEEE International Joint Conference on Neural Networks (IJCNN), Budapest, 2016.
    [39] McLoughlin, Ian, et al., "Robust sound event classification using deep neural networks," IEEE/ACM Transactions on Audio, Speech, and Language Processing 23.3 (2015): 540-552.
    [40] Van den Oord, Aaron, Sander Dieleman, and Benjamin Schrauwen, "Deep content-based music recommendation," Advances in Neural Information Processing Systems, Stateline, 2013.
    [41] Hannun, Awni, et al., "Deep speech: Scaling up end-to-end speech recognition," arXiv preprint arXiv:1412.5567 (2014).
    [42] Aytar, Yusuf, Carl Vondrick, and Antonio Torralba, "Soundnet: Learning sound representations from unlabeled video," Advances in Neural Information Processing Systems, Barcelona, 2016.
    [43] R. CARUNA, "Multitask learning: A knowledge-based source of inductive bias," Proceedings of the 10th International Conference in Machine Learning, Amherst, 1993.
    [44] Baxter, Jonathan, "A Bayesian/information theoretic model of learning to learn via multiple task sampling," Machine learning 28.1 (1997): 7-39.
    [45] Duong, Long, et al., "Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser," Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Vol. 2, 2015.
    [46] Yang, Yongxin, and Timothy M. Hospedales, "Trace norm regularised deep multi-task learning," arXiv preprint arXiv:1606.04038 (2016).
    [47] Kay, Will, et al., "The kinetics human action video dataset," arXiv preprint arXiv:1705.06950 (2017).
    [48] Caba Heilbron, Fabian, et al., "Activitynet: A large-scale video benchmark for human activity understanding," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015.
    [49] Krishna, Ranjay, et al., "Dense-Captioning Events in Videos," Proceedings of the IEEE International Conference on Computer Vision, Venice, 2017.
    [50] Abu-Mostafa, Yaser S, "A method for learning from hints." Advances in Neural Information Processing Systems, Denver, 1993.

    無法下載圖示 校內:2021-07-25公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE