研究生: |
林承冠 Lin, Cheng-Guan |
---|---|
論文名稱: |
基於注意力特徵融合與慢快取樣網絡之動物動作偵測 Animal Action Detection Using Modified Slow and Fast Sample Rate Networks with Attentional Feature Fusion |
指導教授: |
朱威達
Chu, Wei-Ta |
學位類別: |
碩士 Master |
系所名稱: |
敏求智慧運算學院 - 智慧科技系統碩士學位學程 MS Degree Program on Intelligent Technology Systems |
論文出版年: | 2023 |
畢業學年度: | 111 |
語文別: | 中文 |
論文頁數: | 61 |
中文關鍵詞: | 特徵融合 、注意力機制 、動作偵測 、動物行為偵測 、殘差神經網絡 、三維卷積神經網路 |
外文關鍵詞: | Feature Fusion, Attention Mechanism, Action Detection, Animal Behavior Detection, Residual Neural Network, 3D CNN |
相關次數: | 點閱:151 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
為減輕公共工程對生態環境造成的負面影響,自2017年04月25 日行政院公共工程委員會技術處發佈之公共工程生態檢核注意事項。生態檢核需要在工程週期中進行長達數個月監控,過程中人工檢視並辨認大量監控影片中的物種以及動作擴日廢時。因此本論文提出了以動作偵測方法萃取影片中的特徵來分析動物行為,協助生態檢核中繁瑣且大量的分析監控影片的人工作業。改進了原本以人類資料為主的慢快取樣網絡架構,分析由漢林生態顧問公司提供臺灣山區野生動物資料集,比較與AVA人類資料集之間的差異,提出適用於動物資料改進方法,經實驗結果顯示,因固定相機視角遮擋,改良了增加時間域的取樣數有效避免動物出入畫面被遮擋情況,提升mAP 0.79%,因背景複雜有動態物體,使用改良混合注意力使網路關注前景動物動作特徵,最高提升mAP 3.14%,因動物動作時間長,使用改良自注意力增強特徵以此捕捉長距離相依性,透過觀察多個時間片段和動作序列,來了解目前的動作上下文資訊,最高提升mAP 2.81%。
To mitigate the adverse impact of public engineering projects on ecological consideration, the Technical Division of the Public Construction Commission, Executive Yuan, published regulations on April 25, 2017, Public Construction Ecological Consideration Notes. Ecological consideration requires months of surveillance during the construction of public works projects, during which a large number of species and actions in surveillance footage are manually viewed and recognized. Therefore, this paper proposes the utilization of action detection methods to extract features from video footage for analyzing wildlife action. This approach assists in streamlining and facilitating the labor-intensive task of manually analyzing a large volume of surveillance videos during ecological consideration. We Improve the original slow and fast sample rate network structure, which is mainly based on human data. Analyzed a dataset of Taiwanese mountainous wildlife provided by Hanlin Ecological Consulting Co., compared the differences with the AVA human dataset, and proposed a method to improve the animal data. The experimental results show that, due to the fixed camera angle blocking, the improvement of increasing the number of samples in the time domain effectively avoids the blocking of the animal in and out of the screen and improves the mAP by 0.79%. Due to the complexity of the background with dynamic objects, using the modified hybrid attention to make the network focus on the foreground animal movement characteristics, the maximum increase in mAP is 3.14%. Due to the long action time of the animal, a modified self-attention enhancement feature is used to capture the long-distance dependence by observing multiple time segments and action sequences to understand the current action context information, which improves the mAP by up to 2.81%.
[1] Public Construction Commission, Executive Yuan, “Public Construction Ecological Consideration Notes”, 2023. https://reurl.cc/aVOvNl, (accessed May. 1, 2023).
[2] Browning, “TRAIL CAMERAS”, 2023. https://reurl.cc/LAkke9, (accessed May. 1, 2023).
[3] Hutchinson, M. S., & Gadepally, V. N. (2021). Video action understanding. IEEE Access, 9, 134611-134637.
[4] Yuanjun Xiong. (2020). Introduction to Human Activity Understanding in Videos. CVPR 2020 A Comprehensive Tutorial on Video Modeling.
[5] R. Wolfe. (2021). Deep Learning on Video. Towards Data Science.
[6] Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489-4497).
[7] Zafar, A., Aamir, M., Mohd Nawi, N., Arshad, A., Riaz, S., Alruban, A., ... & Almotairi, S. (2022). A comparison of pooling methods for convolutional neural networks. Applied Sciences, 12(17), 8643.
[8] Varga, D. (2020, February). Multi-pooled Inception Features for No-reference Video Quality Assessment. In VISIGRAPP (4: VISAPP) (pp. 338-347).
[9] Lucas, B. D., & Kanade, T. (1981, August). An iterative image registration technique with an application to stereo vision. In IJCAI'81: 7th international joint conference on Artificial intelligence (Vol. 2, pp. 674-679).
[10] Farnebäck, G. (2003). Two-frame motion estimation based on polynomial expansion. In Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, June 29–July 2, 2003 Proceedings 13 (pp. 363-370). Springer Berlin Heidelberg.
[11] Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision, 103, 60-79.
[12] Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision (pp. 3551-3558).
[13] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 1725-1732).
[14] Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634).
[15] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
[16] Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27.
[17] Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional Two-Stream Network Fusion for Video Action Recognition. ArXiv. /abs/1604.06573.
[18] Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299-6308).
[19] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., & Zisserman, A. (2017). The Kinetics Human Action Video Dataset. ArXiv. /abs/1705.06950.
[20] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
[21] Feichtenhofer, C., Pinz, A., & Wildes, R. P. (2016). Spatiotemporal Residual Networks for Video Action Recognition. ArXiv. /abs/1611.02155
[22] Ke, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
[23] Hara, K., Kataoka, H., & Satoh, Y. (2017). Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? ArXiv. /abs/1711.09577
[24] Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
[25] Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011, November). HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision (pp. 2556-2563). IEEE.
[26] Caba Heilbron, F., Escorcia, V., Ghanem, B., & Carlos Niebles, J. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 961-970).
[27] Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6202-6211).
[28] Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., ... & Malik, J. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6047-6056).
[29] Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018). Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) (pp. 3-19).
[30] Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132-7141).
[31] Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7794-7803).
[32] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
[33] Buades, A., Coll, B., & Morel, J. M. (2005, June). A non-local algorithm for image denoising. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05) (Vol. 2, pp. 60-65). Ieee.
[34] Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M., Pascanu, R., Battaglia, P., & Lillicrap, T. (2017). A simple neural network module for relational reasoning. Advances in neural information processing systems, 30.