簡易檢索 / 詳目顯示

研究生: 施穆然
Shih, Mu-Jan
論文名稱: 排球比賽中的端到端時空即時動作檢測與群體活動識別
End-to-End Spatio-Temporal Real-Time Action Detection and Group Activity Recognition in Volleyball Match
指導教授: 徐禕佑
Hsu, Yi-Yu
學位類別: 碩士
Master
系所名稱: 敏求智慧運算學院 - 智慧科技系統碩士學位學程
MS Degree Program on Intelligent Technology Systems
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 73
中文關鍵詞: 即時物件檢測動作檢測群體活動識別DETRTransformer
外文關鍵詞: real-time object detection, action detection, group activity recognition, DETR, Transformer
相關次數: 點閱:53下載:30
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 即時物件檢測模型在自動駕駛與智慧監控等領域已展示廣泛應用潛力,然而在球類運動中捕捉快速的球員動作檢測與群體活動識別上卻面臨挑戰。傳統物件檢測模型主要設計於識別靜態或緩慢移動物體,並不適合快節奏球類競賽中動作迅速變化的環境。這些模型往往無法有效捕捉細微且動態的動作序列,尤其是在需要即時分析球員動作的情況下。為了克服這些限制,本研究提出基於DETR(Detection with Transformers)架構的創新即時動作檢測模型。利用Transformer的自注意力機制,此模型能夠有效處理影像序列中的全局訊息,提高動態追蹤與物體識別的效率。相較於傳統的YOLO(You Only Look Once)架構,DETR模型省略NMS(non-maximum suppression)的後處理步驟,降低推理階段的時間不穩定性,從而增強模型的即時性能。此外,該模型不僅能檢測個體動作,也能預測群體活動,增加對多人運動賽事如排球比賽的應用價值。實驗結果顯示,此模型能準確捕捉並即時分析排球比賽中每位球員的動作。其結果能進一步利用動作檢測數據進行自動影像剪輯,自動分割與編輯比賽特定動作片段,提供更精緻的觀賽者視覺體驗。這些成果不僅拓展DETR架構的應用範圍,亦為即時動作檢測技術的發展提供技術支持與應用展望。

    Real-time object detection models show great potential in fields like autonomous driving and intelligent surveillance. However, they face challenges in capturing fast player movements and group activities in ball sports. Traditional object detection models are designed for static or slow-moving objects and are unsuitable for the fast-paced environment of ball games. They often fail to capture subtle and dynamic motions, especially when real-time player analysis is needed. To address these limitations, this study proposes an innovative real-time action detection model based on the DETR (Detection with Transformers) architecture. Utilizing the self-attention mechanism of Transformers, this model effectively processes global information in image sequences, enhancing dynamic tracking and object recognition efficiency. Compared to the traditional YOLO (You Only Look Once) architecture, the DETR model eliminates the NMS (non-maximum suppression) post-processing step, reducing time instability during inference and improving real-time performance. Additionally, this model detects individual actions and predicts group activities, increasing its application value in multi-player sports like volleyball. Experimental results show that this model accurately captures and analyzes each player's actions in real-time during volleyball games. These results can be used for automatic video editing, segmenting, and editing specific action segments of the match, providing a refined viewing experience. These findings not only expand the application scope of the DETR architecture but also provide technical support and prospects for the development of real-time action detection technology.

    中文摘要 I Abstract II 目錄 VI 表目錄 IX 圖目錄 X 第一章 緒論 1 1.1 前言 1 1.2 研究動機 2 1.3 研究貢獻 2 第二章 背景介紹 3 2.1 物件檢測 3 2.1.1 two-stage 3 2.1.2 one-stage 4 2.1.3 Faster R-CNN 4 2.1.4 YOLO 6 2.1.5 DETR 6 2.2 動作檢測 7 2.2.1 3D卷積神經網絡(3D-CNN) 7 2.2.2 Two-Stream網絡 8 2.2.3 Temporal Segment Networks(TSN) 8 2.2.4 Transformer-based影片理解 9 2.3 圖神經網路 10 第三章 相關工作 11 3.1 RT-DETR 11 3.2 Deformable DETR 12 3.3 DINO 13 3.4 X3D 14 3.5 VideoMAE 15 3.6 RepVGG 16 第四章 群體行為檢測相關模型 17 4.1 GroupFormer 17 4.2 COMPOSER 18 4.3 DIN 19 4.4 Tamura 20 4.5 GAR 22 第五章 實驗方法 23 5.1 數據集 23 5.2 數據預處理 25 5.2.1 群體活動識別預處理 25 5.2.2邊界框與影像預處理 26 5.3 評估方式 29 5.3.1 Cosine Embedding Loss 29 5.3.2 Cross Entropy Loss 29 5.3.3 Mean Average Precision (mAP) 30 5.3.4 Mean Class Accuracy (MCA) 31 5.4 研究目的與背景 32 5.5 群體活動與關鍵事件識別 34 5.6 邊界框輔助群體活動識別之預測 36 5.7 使用即時物件檢測模型預測球員動作 43 5.8 即時動作檢測架構 46 5.9 動作檢測與群體行為辨識 49 5.10 基於I3D的多幀邊界框預測 51 5.11 結論 56 5.12 貢獻 57 參考文獻 58

    [1] K. Rangasamy, M. A. As'ari, N. Rahmad, N. F. Ghazali, and S. Ismail, “Deep learning in sport video analysis: a review,” TELKOMNIKA (Telecommunication Computing Electronics and Control), vol. 18, pp. 1926, 2020.
    [2] Y. Rui, A. Gupta, and A. Acero, “Automatically extracting highlights for TV Baseball programs,” Proceedings of the eighth ACM international conference on Multimedia, pp. 105–115, 2000.
    [3] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-Supervised Nets,” Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, vol. 38, pp. 562--570, 2015.
    [4] M. Crawshaw, “Multi-task learning with deep neural networks: A survey,” arXiv preprint arXiv:2009.09796, 2020.
    [5] R. Girshick, "Fast r-cnn," Published, pp. 1440-1448, 2015.
    [6] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
    [7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, "You only look once: Unified, real-time object detection," Published, pp. 779-788, 2016.
    [8] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo series in 2021,” arXiv preprint arXiv:2107.08430, 2021.
    [9] C.-Y. Wang, I.-H. Yeh, and H.-Y. M. Liao, “YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information,” arXiv preprint arXiv:2402.13616, 2024.
    [10] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors," Published, pp. 7464-7475, 2023.
    [11] C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Cheng, and W. Nie, “YOLOv6: A single-stage object detection framework for industrial applications,” arXiv preprint arXiv:2209.02976, 2022.
    [12] C. Li, L. Li, Y. Geng, H. Jiang, M. Cheng, B. Zhang, Z. Ke, X. Xu, and X. Chu, “Yolov6 v3. 0: A full-scale reloading,” arXiv preprint arXiv:2301.05586, 2023.
    [13] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020.
    [14] J. Redmon, and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
    [15] J. Redmon, and A. Farhadi, "YOLO9000: better, faster, stronger," Published, pp. 7263-7271, 2017.
    [16] S. Xu, X. Wang, W. Lv, Q. Chang, C. Cui, K. Deng, G. Wang, Q. Dang, S. Wei, and Y. Du, “PP-YOLOE: An evolved version of YOLO,” arXiv preprint arXiv:2203.16250, 2022.
    [17] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv preprint arXiv:1904.07850, 2019.
    [18] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, "Focal loss for dense object detection," Published, pp. 2980-2988, 2017.
    [19] Z. Sun, S. Cao, Y. Yang, and K. M. Kitani, "Rethinking transformer-based set prediction for object detection," Published, pp. 3611-3620, 2021.
    [20] Y. Pu, W. Liang, Y. Hao, Y. Yuan, Y. Yang, C. Zhang, H. Hu, and G. Huang, “Rank-DETR for high quality object detection,” Advances in Neural Information Processing Systems, vol. 36, 2024.
    [21] Q. Chen, X. Chen, J. Wang, S. Zhang, K. Yao, H. Feng, J. Han, E. Ding, G. Zeng, and J. Wang, "Group detr: Fast detr training with group-wise one-to-many assignment," Published, pp. 6633-6642, 2023.
    [22] F. Chen, H. Zhang, K. Hu, Y.-K. Huang, C. Zhu, and M. Savvides, "Enhanced training of query-based object detection via selective query recollection," Published, pp. 23756-23765, 2023.
    [23] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-end object detection with transformers," Published, pp. 213-229, 2020.
    [24] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, "Dn-detr: Accelerate detr training by introducing query denoising," Published, pp. 13619-13627, 2022.
    [25] Z. Hao, L. Feng, L. Shilong, Z. Lei, S. Hang, Z. Jun, N. Lionel, and S. Heung-Yeung, “DINO : DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection,” The Eleventh International Conference on Learning Representations, 2023.
    [26] Y. Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y. Liu, and J. Chen, "Detrs beat yolos on real-time object detection," Published, pp. 16965-16974, 2024.
    [27] Z. Xizhou, S. Weijie, L. Lewei, L. Bin, W. Xiaogang, and D. Jifeng, “Deformable DETR: Deformable Transformers for End-to-End Object Detection,” International Conference on Learning Representations, 2021.
    [28] D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang, "Conditional detr for fast training convergence," Published, pp. 3651-3660, 2021.
    [29] Y. Wang, X. Zhang, T. Yang, and J. Sun, "Anchor DETR: Query Design for Transformer-Based Detector," Published, 2022.
    [30] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, "Ssd: Single shot multibox detector," Published, pp. 21-37, 2016.
    [31] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904-1916, 2015.
    [32] C. Feichtenhofer, "X3d: Expanding architectures for efficient video recognition," Published, pp. 203-213, 2020.
    [33] D. Tran, H. Wang, L. Torresani, and M. Feiszli, "Video classification with channel-separated convolutional networks," Published, pp. 5552-5561, 2019.
    [34] J. Carreira, and A. Zisserman, "Quo vadis, action recognition? a new model and the kinetics dataset," Published, pp. 6299-6308, 2017.
    [35] S. Arif, J. Wang, T. Ul Hassan, and Z. Fei, “3D-CNN-based fused feature maps with LSTM applied to action recognition,” Future Internet, vol. 11, no. 2, pp. 42, 2019.
    [36] K. Simonyan, and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Advances in neural information processing systems, vol. 27, 2014.
    [37] C. Feichtenhofer, A. Pinz, and A. Zisserman, "Convolutional Two-Stream Network Fusion for Video Action Recognition," Published, pp. 1933-1941, 2016.
    [38] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks for action recognition in videos,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 11, pp. 2740-2755, 2018.
    [39] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, "Vivit: A video vision transformer," Published, pp. 6836-6846, 2021.
    [40] J. Wang, G. Bertasius, D. Tran, and L. Torresani, "Long-short temporal contrastive learning of video transformers," Published, pp. 14010-14020, 2022.
    [41] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
    [42] Z. Tong, Y. Song, J. Wang, and L. Wang, “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,” Advances in neural information processing systems, vol. 35, pp. 10078-10093, 2022.
    [43] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, "Repvgg: Making vgg-style convnets great again," Published, pp. 13733-13742, 2021.
    [44] S. Li, Q. Cao, L. Liu, K. Yang, S. Liu, J. Hou, and S. Yi, "Groupformer: Group activity recognition with clustered spatial-temporal transformer," Published, pp. 13668-13677, 2021.
    [45] H. Zhou, A. Kadav, A. Shamsian, S. Geng, F. Lai, L. Zhao, T. Liu, M. Kapadia, and H. P. Graf, "Composer: compositional reasoning of group activity in videos with keypoint-only modality," Published, pp. 249-266, 2022.
    [46] H. Yuan, D. Ni, and M. Wang, "Spatio-temporal dynamic inference network for group activity recognition," Published, pp. 7476-7485, 2021.
    [47] M. Tamura, R. Vishwakarma, and R. Vennelakanti, "Hunting group clues with transformers for social group activity recognition," Published, pp. 19-35, 2022.
    [48] J. Wu, L. Wang, L. Wang, J. Guo, and G. Wu, "Learning actor relation graphs for group activity recognition," Published, pp. 9964-9974, 2019.
    [49] M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori, "A hierarchical deep temporal model for group activity recognition," Published, pp. 1971-1980, 2016.
    [50] G. Jocher, A. Chaurasia, and J. Qiu. "Ultralytics YOLO," https://github.com/ultralytics/ultralytics.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE