| 研究生: |
吳坤融 Wu, Kun-Rong |
|---|---|
| 論文名稱: |
尺度混合偵測轉換器之效能提升與收斂加速 Performance Improvement and Convergence Acceleration for Multi-scale Hybrid Detection Transformers |
| 指導教授: |
楊家輝
Yang, Jar-Ferr |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 英文 |
| 論文頁數: | 52 |
| 中文關鍵詞: | 深度學習 、物件偵測 、偵測變壓器 、視覺變換器 、收斂加速 |
| 外文關鍵詞: | deep learning, object detection, detection transformer, convergence acceleration, vision transformer |
| 相關次數: | 點閱:17 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究提出一種混合式目標偵測架構,結合多尺度卷積神經網路特徵與視覺變壓器凍結語意輸出,以解決變壓器系列偵測器在收斂速度與小物件辨識上的不足。具體來說,我們將 ConvNeXt 所提取的幾何特徵與DINOv2視覺變壓器最後一層的深層語意特徵進行融合,透過通道對齊與空間重採樣統一格式,並送入可變形編碼器及強化後的解碼器。解碼器中進一步導入 動態查詢位移,以殘差形式穩定參考點於各層間的位置更新,提升定位一致性。同時,我們在注意力模組中設計群組共享鍵值機制,藉由減少投影冗餘降低記憶體與推論延遲。
在 COCO 2017資料集的實驗結果顯示,本方法即使在訓練輪數與批量大小有限的情況下,仍能達到更快的收斂速度與更佳的小物件偵測表現,相較原始可變形偵測變壓器具有明顯優勢。視覺變壓器分支提供穩定語意引導,動態查詢位移提升位置穩定性,而共享鍵值則有效節省運算資源,三者合併構成一個具備準確度與效率兼具的高效目標偵測框架。
This paper proposes a hybrid object detection framework that integrates multi-scale CNN features with frozen ViT semantics to address the convergence inefficiency and small-object weakness of Transformer-based detectors. Specifically, we combine a ConvNeXt backbone with deep-layer features from a frozen DINOv2 ViT, and unify them via channel alignment and spatial resampling. This hybrid representation is passed to a deformable encoder and an enhanced decoder equipped with dynamic query offset (DQO), which stabilizes the spatial trajectory of reference points layer-by-layer. In addition, we introduce a group-shared key–value mechanism in the attention module to reduce redundant projections across heads, improving inference speed with minimal performance loss.
Experimental results on the COCO 2017 dataset show that our method achieves faster convergence and better small-object AP compared to the original deformable DETR under limited training epochs and batch size. While the frozen ViT contributes to semantic guidance, DQO improves localization stability, and shared key-value reduces memory cost. Together, these components form a resource-efficient detection framework with competitive accuracy and improved inference efficiency.
[1] Hafiz, A. M., & Bhat, G. M. (2020). A survey on instance segmentation: state of the art. International journal of multimedia information retrieval, 9(3), 171-189.
[2] Arnab, A., & Torr, P. H. (2017). Pixelwise instance segmentation with a dynamically instantiated network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 441-450).
[3] Bhagavatula, C., Zhu, C., Luu, K., & Savvides, M. (2017). Faster than real-time facial alignment: A 3d spatial transformer network approach in unconstrained poses. In Proceedings of the IEEE International Conference on Computer Vision (pp. 3980-3989).
[4] Zheng, Y., Pal, D. K., & Savvides, M. (2018). Ring loss: Convex feature normalization for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5089-5097).
[5] Dollár, P., Wojek, C., Schiele, B., & Perona, P. (2009, June). Pedestrian detection: A benchmark. In 2009 IEEE conference on computer vision and pattern recognition (pp. 304-311). IEEE.
[6] Liang, X., Wang, T., Yang, L., & Xing, E. (2018). Cirl: Controllable imitative reinforcement learning for vision-based self-driving. In Proceedings of the European conference on computer vision (ECCV) (pp. 584-599).
[7] He, Y., Ma, X., Luo, X., Li, J., Zhao, M., An, B., & Guan, X. (2017). Vehicle traffic driven camera placement for better metropolis security surveillance. arXiv preprint arXiv:1705.08508..
[8] Wang, X., & Gupta, A. (2018). Videos as space-time region graphs. In Proceedings of the European conference on computer vision (ECCV) (pp. 399-417).
[9] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
[10] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
[11] Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1492-1500).
[12] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
[13] Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11976-11986).
[14] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14 (pp. 21-37). Springer International Publishing.
[15] He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).
[16] Zhang, S., Chi, C., Yao, Y., Lei, Z., & Li, S. Z. (2020). Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9759-9768).
[17] Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779-788).
[18] Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 1440-1448).
[19] Li, Y., & Ren, F. (2019). Light-weight retinanet for object detection. arXiv preprint arXiv:1905.10011.
[20] Neubeck, A., & Van Gool, L. (2006, August). Efficient non-maximum suppression. In 18th international conference on pattern recognition (ICPR'06) (Vol. 3, pp. 850-855). IEEE.
[21] Hosang, J., Benenson, R., & Schiele, B. (2017). Learning non-maximum suppression. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4507-4515).
[22] Hosang, J., Benenson, R., & Schiele, B. (2016, August). A convnet for non-maximum suppression. In German conference on pattern recognition (pp. 192-204). Cham: Springer International Publishing.
[23] Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., & Tian, Q. (2019). Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6569-6578).
[24] Detector, A. F. O. (2022). Fcos: a simple and strong anchor-free object detector. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4), 69-76.
[25] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020, August). End-to-end object detection with transformers. In European conference on computer vision (pp. 213-229). Cham: Springer International Publishing.
[26] Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2020). Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159.
[27] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
[28] Ashish, V. (2017). Attention is all you need. Advances in neural information processing systems, 30, I.
[29] Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., ... & Tao, D. (2022). A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1), 87-110.
[30] Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., ... & Bojanowski, P. (2023). Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193.
[31] Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13 (pp. 740-755). Springer International Publishing.
[32] Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117-2125).
[33] Ghiasi, G., Lin, T. Y., & Le, Q. V. (2019). Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7036-7045).
[34] Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., ... & Ahmed, A. (2020). Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33, 17283-17297.
[35] Kitaev, N., Kaiser, Ł., & Levskaya, A. (2001). Reformer: The efficient transformer. arXiv 2020. arXiv preprint arXiv:2001.04451.
[36] Xia, Z., Pan, X., Song, S., Li, L. E., & Huang, G. (2022). Vision transformer with deformable attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4794-4803).
[37] Yu, Y., Xiong, Y., Huang, W., & Scott, M. R. (2020). Deformable siamese attention networks for visual object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6728-6737).
[38] Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980-2988).
[39] Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., & Savarese, S. (2019). Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 658-666).