簡易檢索 / 詳目顯示

研究生: 林承冠
Lin, Cheng-Guan
論文名稱: 基於注意力特徵融合與慢快取樣網絡之動物動作偵測
Animal Action Detection Using Modified Slow and Fast Sample Rate Networks with Attentional Feature Fusion
指導教授: 朱威達
Chu, Wei-Ta
學位類別: 碩士
Master
系所名稱: 敏求智慧運算學院 - 智慧科技系統碩士學位學程
MS Degree Program on Intelligent Technology Systems
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 61
中文關鍵詞: 特徵融合注意力機制動作偵測動物行為偵測殘差神經網絡三維卷積神經網路
外文關鍵詞: Feature Fusion, Attention Mechanism, Action Detection, Animal Behavior Detection, Residual Neural Network, 3D CNN
相關次數: 點閱:151下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 為減輕公共工程對生態環境造成的負面影響,自2017年04月25 日行政院公共工程委員會技術處發佈之公共工程生態檢核注意事項。生態檢核需要在工程週期中進行長達數個月監控,過程中人工檢視並辨認大量監控影片中的物種以及動作擴日廢時。因此本論文提出了以動作偵測方法萃取影片中的特徵來分析動物行為,協助生態檢核中繁瑣且大量的分析監控影片的人工作業。改進了原本以人類資料為主的慢快取樣網絡架構,分析由漢林生態顧問公司提供臺灣山區野生動物資料集,比較與AVA人類資料集之間的差異,提出適用於動物資料改進方法,經實驗結果顯示,因固定相機視角遮擋,改良了增加時間域的取樣數有效避免動物出入畫面被遮擋情況,提升mAP 0.79%,因背景複雜有動態物體,使用改良混合注意力使網路關注前景動物動作特徵,最高提升mAP 3.14%,因動物動作時間長,使用改良自注意力增強特徵以此捕捉長距離相依性,透過觀察多個時間片段和動作序列,來了解目前的動作上下文資訊,最高提升mAP 2.81%。

    To mitigate the adverse impact of public engineering projects on ecological consideration, the Technical Division of the Public Construction Commission, Executive Yuan, published regulations on April 25, 2017, Public Construction Ecological Consideration Notes. Ecological consideration requires months of surveillance during the construction of public works projects, during which a large number of species and actions in surveillance footage are manually viewed and recognized. Therefore, this paper proposes the utilization of action detection methods to extract features from video footage for analyzing wildlife action. This approach assists in streamlining and facilitating the labor-intensive task of manually analyzing a large volume of surveillance videos during ecological consideration. We Improve the original slow and fast sample rate network structure, which is mainly based on human data. Analyzed a dataset of Taiwanese mountainous wildlife provided by Hanlin Ecological Consulting Co., compared the differences with the AVA human dataset, and proposed a method to improve the animal data. The experimental results show that, due to the fixed camera angle blocking, the improvement of increasing the number of samples in the time domain effectively avoids the blocking of the animal in and out of the screen and improves the mAP by 0.79%. Due to the complexity of the background with dynamic objects, using the modified hybrid attention to make the network focus on the foreground animal movement characteristics, the maximum increase in mAP is 3.14%. Due to the long action time of the animal, a modified self-attention enhancement feature is used to capture the long-distance dependence by observing multiple time segments and action sequences to understand the current action context information, which improves the mAP by up to 2.81%.

    摘要 I SUMMARY II 誌謝 VII 目錄 VIII 表目錄 X 圖目錄 XI 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 2 1.3 論文架構 2 第二章 文獻探討 3 2.1 領域任務介紹 3 2.2 卷積層與池化層之二維與三維方法 6 2.2.1 二維與三維卷積層 6 2.2.2 二維與三維池化層 7 2.3 動作偵測在深度學習發展 11 2.3.1 One stream 介紹 12 2.3.2 Two stream 介紹 15 2.4 慢快取樣網絡 20 第三章 研究方法 23 3.1 資料集介紹 23 3.1.1 人類動作資料集 23 3.1.2 臺灣野生動物資料集 23 3.2 慢快取樣網絡 24 3.3 基於注意力特徵融合與慢快取樣網絡之動物動作偵測 25 3.3.1 注意力機制 28 3.3.2 改良空間域與時間域的取樣數 33 3.3.3改良特徵融合特徵增強 35 3.3.4 改良骨幹網路特徵增強 41 3.3.5 改良網路小節 42 第四章 實驗結果 44 4.1 實驗設置 44 4.2 製作資料集與動作劃分 45 4.2.1 動作資料集製作步驟與格式 45 4.2.2人類動作資料集動作劃分 45 4.2.3 動物資料集數量統計 47 4.3 實驗 49 4.3.1 改良空間域與時間域的取樣數實驗 50 4.3.2改良特徵融合特徵增強實驗 50 4.3.3 改良骨幹網路特徵增強實驗 54 4.3.4 改良特徵融合與骨幹網路特徵增強實驗 55 第五章 結論 58 第六章 參考文獻 59

    [1] Public Construction Commission, Executive Yuan, “Public Construction Ecological Consideration Notes”, 2023. https://reurl.cc/aVOvNl, (accessed May. 1, 2023).
    [2] Browning, “TRAIL CAMERAS”, 2023. https://reurl.cc/LAkke9, (accessed May. 1, 2023).
    [3] Hutchinson, M. S., & Gadepally, V. N. (2021). Video action understanding. IEEE Access, 9, 134611-134637.
    [4] Yuanjun Xiong. (2020). Introduction to Human Activity Understanding in Videos. CVPR 2020 A Comprehensive Tutorial on Video Modeling.
    [5] R. Wolfe. (2021). Deep Learning on Video. Towards Data Science.
    [6] Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489-4497).
    [7] Zafar, A., Aamir, M., Mohd Nawi, N., Arshad, A., Riaz, S., Alruban, A., ... & Almotairi, S. (2022). A comparison of pooling methods for convolutional neural networks. Applied Sciences, 12(17), 8643.
    [8] Varga, D. (2020, February). Multi-pooled Inception Features for No-reference Video Quality Assessment. In VISIGRAPP (4: VISAPP) (pp. 338-347).
    [9] Lucas, B. D., & Kanade, T. (1981, August). An iterative image registration technique with an application to stereo vision. In IJCAI'81: 7th international joint conference on Artificial intelligence (Vol. 2, pp. 674-679).
    [10] Farnebäck, G. (2003). Two-frame motion estimation based on polynomial expansion. In Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, June 29–July 2, 2003 Proceedings 13 (pp. 363-370). Springer Berlin Heidelberg.
    [11] Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision, 103, 60-79.
    [12] Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision (pp. 3551-3558).
    [13] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 1725-1732).
    [14] Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634).
    [15] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
    [16] Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27.
    [17] Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional Two-Stream Network Fusion for Video Action Recognition. ArXiv. /abs/1604.06573.
    [18] Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299-6308).
    [19] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., & Zisserman, A. (2017). The Kinetics Human Action Video Dataset. ArXiv. /abs/1705.06950.
    [20] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
    [21] Feichtenhofer, C., Pinz, A., & Wildes, R. P. (2016). Spatiotemporal Residual Networks for Video Action Recognition. ArXiv. /abs/1611.02155
    [22] Ke, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
    [23] Hara, K., Kataoka, H., & Satoh, Y. (2017). Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? ArXiv. /abs/1711.09577
    [24] Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
    [25] Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011, November). HMDB: a large video database for human motion recognition. In 2011 International conference on computer vision (pp. 2556-2563). IEEE.
    [26] Caba Heilbron, F., Escorcia, V., Ghanem, B., & Carlos Niebles, J. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 961-970).
    [27] Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6202-6211).
    [28] Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., ... & Malik, J. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6047-6056).
    [29] Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018). Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) (pp. 3-19).
    [30] Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132-7141).
    [31] Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7794-7803).
    [32] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
    [33] Buades, A., Coll, B., & Morel, J. M. (2005, June). A non-local algorithm for image denoising. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05) (Vol. 2, pp. 60-65). Ieee.
    [34] Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M., Pascanu, R., Battaglia, P., & Lillicrap, T. (2017). A simple neural network module for relational reasoning. Advances in neural information processing systems, 30.

    無法下載圖示 校內:2028-08-01公開
    校外:2028-08-01公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE