| 研究生: |
許少綸 Hsu, Shao-Lun |
|---|---|
| 論文名稱: |
自監督視覺先驗之即時實例分割的模組化增強 MERTIS: Modular Enhancements for Real-Time Instance Segmentation with Self-Supervised Visual Priors |
| 指導教授: |
楊家輝
Yang, Jar-Ferr |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 英文 |
| 論文頁數: | 54 |
| 中文關鍵詞: | 深度學習 、電腦視覺 、實例影像切割 、視覺轉換器 |
| 外文關鍵詞: | deep learning, computer vision, instance image segmentation, vision transformer |
| 相關次數: | 點閱:16 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
實例分割旨在同時偵測和描繪影像中的單一物件實例,在自動駕駛、機器人和擴增實境等許多即時視覺應用中發揮關鍵作用。然而,在嚴格的延遲和資源限制下實現高分割精度仍然是一項重大挑戰。在這項工作中,我們提出了一個有效的實例分割框架,該框架透過系統的微觀架構改進和結合自監督的視覺先驗來提高效能。我們的方法專注於優化分割管道中的輕量級模組,以減少計算開銷,同時保持精確和細粒度的掩模預測。此外,使用自監督 DINO 方法預先訓練的 ResNet-50 主幹網路可用於增強語意表示,而無需依賴額外的標記資料。在 COCO 數據集上進行的實驗表明,所提出的框架在推理速度和分割品質之間實現了良好的平衡,使其適用於資源受限環境中的即時應用。
Instance segmentation, which aims to simultaneously detect and delineate individual object instances within an image, plays a critical role in many real-time vision applications such as autonomous driving, robotics, and augmented reality. However, achieving high segmentation accuracy under strict latency and resource constraints remains a significant challenge. In this work, we propose an efficient instance segmentation framework that improves performance through systematic micro-level architectural refinements and the incorporation of self-supervised visual priors. Our method focuses on optimizing lightweight modules within the segmentation pipeline to reduce computational overhead while maintaining precise and fine-grained mask predictions. Furthermore, a ResNet-50 backbone pretrained using the self-supervised DINO approach is utilized to enhance semantic representation without relying on extra labeled data. Experiments conducted on the COCO dataset demonstrate that the proposed framework achieves a strong balance between inference speed and segmentation quality, making it suitable for real-time applications in resource-constrained environments.
[1] H. Gao, P. Xue, and W. Lin, ‘‘A new marker-based watershed algorithm,’’ in Proc. IEEE Int. Symp. Circuits Syst., vol. 2, May 2004, p. II-81.
[2] Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes: Active contour models. IJCV, 1988.
[3] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
[4] K. Chen, W. Ouyang, C. C. Loy, D. Lin, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, and J. Shi, ‘‘Hybrid task cascade for instance segmentation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 4969–4978.
[5] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Real-time instance segmentation. In ICCV, 2019.
[6] H. Chen, K. Sun, Z. Tian, C. Shen, Y. Huang, and Y. Yan, “BlendMask: Top-down meets bottom-up for instance segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, pp. 8573–8581.
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
[8] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.
[9] Mathilde Caron, Hugo Touvron, Ishan Misra, Herve J ´ egou, ´ Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
[10] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9729–9738.
[11] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020.
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
[13] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
[14] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
[15] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
[16] Zhi Tian, Chunhua Shen, and Hao Chen. Conditional convolutions for instance segmentation. In ECCV, 2020.
[17] Tianheng Cheng, Xinggang Wang, Shaoyu Chen, Wenqiang Zhang, Qian Zhang, Chang Huang, Zhaoxiang Zhang, and Wenyu Liu. Sparse instance activation for real-time instance segmentation. In CVPR, 2022.
[18] Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chunhua Shen. SOLOv2: Dynamic and fast instance segmentation. NeurIPS, 2020.
[19] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
[20] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolutional networks,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2528–2535.
[21] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, 2016.
[22] Davide Mazzini. Guided upsampling network for real-time semantic segmentation. arXiv:1807.07466, 2018.
[23] Jiaqi Wang, Kai Chen, Rui Xu, Ziwei Liu, Chen Change Loy, and Dahua Lin. CARAFE: Context-aware reassembly of features. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pages 3007–3016, 2019.
[24] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, 2017.
[25] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In International conference on machine learning, pages 933–941. PMLR, 2017.
[26] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. 3DV, 2016.
[27] J. He, P. Li, Y. Geng, and X. Xie, “Fastinst: A simple query-based model for real-time instance segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 23663–23672.
[28] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C Lawrence ´ Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014..
[29] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github. com/facebookresearch/detectron2, 2019.
[30] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[31] Rufeng Zhang, Zhi Tian, Chunhua Shen, Mingyu You, and Youliang Yan. Mask encoding for single shot instance segmentation. In CVPR, 2020.
[32] Youngwan Lee and Jongyoul Park. Centermask: Real-time anchor-free instance segmentation. In CVPR, 2020.
[33] Wentao Du, Zhiyu Xiang, Shuya Chen, Chengyu Qiao, Yiman Chen, and Tingming Bai. Real-time instance segmentation with discriminative orientation maps. In ICCV, 2021.
[34] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact++: Better real-time instance segmentation. PAMI, 2020.
[35] Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117-2125).