| 研究生: |
侯廷錡 Hou, Ting-Chi |
|---|---|
| 論文名稱: |
用於指稱分割之基於對比式語言-圖像預訓練的雙向多模態注意力 CLIP-based Bi-directional Multimodal Attention for Referring image segmentation |
| 指導教授: |
楊家輝
Yang, Jar-Ferr |
| 學位類別: |
碩士 Master |
| 系所名稱: |
敏求智慧運算學院 - 智慧科技系統碩士學位學程 MS Degree Program on Intelligent Technology Systems |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 英文 |
| 論文頁數: | 57 |
| 中文關鍵詞: | 機器人 、指稱圖像分割 、對比式語言與圖像預訓練 、開放語意 、多模態對齊 |
| 外文關鍵詞: | robotics, referring image segmentation, CLIP, open-vocabulary, multimodal alignment |
| 相關次數: | 點閱:51 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著服務型、工業型等各式機器人日漸進入現實場域,視覺辨識技術的發展顯得格外關鍵。其中,image segmentation(圖像分割)技術被廣泛應用於物件識別、導航與操作。然而,現有主流 segmentation 方法在機器人真實應用中,主要依賴預設的類別標籤,無法靈活處理機器人任務中出現的多樣化或新穎物件。此外,現有方法大多僅支援「一段描述對應單一目標」的情境,難以因應實際應用中同時存在多個或不存在目標的複雜情形。因此,發展能支援開放語意、多對多及零目標預測的分割技術,對提升機器人自主感知能力至關重要。
本研究提出一種新穎的指稱圖像分割模型,核心為雙向深度融合(bi-directional deep fusion neck)結構,能有效對齊並融合語言與多尺度視覺特徵。模型進一步結合對比式語言與圖像預訓練所生成的語意 heatmap,引導語言與視覺特徵的精確匹配,顯著提升對複雜自然語言描述的理解能力。本方法支援開放語意,能處理圖中無目標、單一目標或多目標的情境(zero/one/many),展現出優異的彈性與泛化能力。實驗結果顯示,與近期先進方法相比,本模型於 zero-case 辨識上具備明顯優勢,在分割整體準確度亦達到具有競爭力的表現。
With the widespread adoption of service and industrial robots in real-world scenarios, robust visual recognition has become increasingly important. Image segmentation is essential for tasks such as object recognition, navigation, and manipulation. However, most mainstream segmentation methods rely on fixed class labels and only support one-to-one mapping between descriptions and targets, making them insufficient for diverse and open-ended robotic applications. In practice, robots often face scenarios with multiple or absent targets and previously unseen objects. This highlights the need for segmentation techniques that support open-vocabulary and flexible zero/one/many target predictions. We propose a novel referring image segmentation model that employs a bi-directional deep fusion neck to effectively align and integrate language and multi-scale visual features. By leveraging semantic heatmaps from a pre-trained CLIP model, our approach achieves precise language-vision matching and enhanced comprehension of complex referring expressions. Our method demonstrates strong generalization, and experimental results show clear advantages in zero case recognition and competitive overall segmentation performance compared to recent state-of-the art methods.
[1] X. Wang, X. Ma, T. Dong, J. Zhang, Y. Wei, and S. Ding, “CRIS: CLIP-Driven Referring Image Segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2022, pp. 11641–11651.
[2] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, et al., "Learning transferable visual models from natural language supervision," in Proc. Int. Conf. Machine Learning (ICML), 2021.
[3] C. Xie, B. Wang, F. Kong, J. Li, D. Liang, G. Zhang et al., “FG-CLIP: Fine-Grained Visual and Textual Alignment,” in Proc. 42nd Int. Conf. on Machine Learning (ICML), Vienna, Austria, 2025.
[4] K. Lee, X. Chen, G. Hua, H. Hu, and X. He, "Stacked cross attention for image‑text matching," in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 201–216.
[5] A. Karpathy and L. Fei-Fei, "Deep visual-semantic alignments for generating image descriptions," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 3128–3137.
[6] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “VSE++: Improving visual-semantic embeddings with hard negatives,” in Proc. British Machine Vision Conf. (BMVC), 2018.
[7] A. Zareian, K. Dela Rosa, D. H. Hu, and S.-F. Chang, “Open-vocabulary object detection using captions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 14393–14403.
[8] P. Zhou, J. Yang, Q. Wang, J. Xu, X. Wang, and H. Li, "Fine-grained visual reasoning with region-text alignment," arXiv preprint arXiv:2201.09113, 2022.
[9] X. Li, W. Wang, X. Hu, J. Yang, D. Dou, and Y. Wu, “GLIP: Grounded language-image pre-training,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 11349–11359, doi: 10.1109/CVPR52688.2022.01107.
[10] Z. Zhong, Y. Yang, J. Zhao, J. Zhu, L. Zhou, S. Lin, L. Wang, et al., “RegionCLIP: Region-based language-image pretraining,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 16793–16803.
[11] F. Xue, S. Li, K. Li, H. Zhang, Y. Ding, and X. Wang, "Masked Student: Efficient CLIP Distillation via Masked Image Modeling," arXiv preprint arXiv:2303.15935, 2023.
[12] J. Gu, H. Xiao, K. Zhang, P. Han, R. Li, Y. Ma, et al., "Open-vocabulary object detection via vision and language knowledge distillation," in Proc. Int. Conf. Learn. Representations (ICLR), 2022.
[13] C. Schuhmann, R. Beaumont, R. Vencu, R. Kaczmarczyk, G. Partin, T. Munoz, et al., "LAION-400M: Open dataset of CLIP-filtered 400 million image-text pairs," arXiv preprint arXiv:2111.02114, 2021.
[14] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, "Microsoft COCO: Common objects in context," in Proc. Eur. Conf. Comput. Vis. (ECCV), 2014, pp. 740–755.
[15] P. Sharma, N. Ding, S. Goodman, and R. Soricut, "Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning," in Proc. Annu. Meeting Assoc. Comput. Linguistics (ACL), 2018.
[16] Y. Jing, Z. Zhang, Y. Wu, Y. Li, M. Jiang, Q. Wu, et al., "FineCLIP: Multi-grained contrastive learning with real-time self-distillation for fine-grained vision-language alignment," arXiv preprint arXiv:2403.12118, 2024.
[17] J. Peng, J. Li, Q. Ye, et al., “Kosmos-2: Grounding Multimodal Large Language Models to the World,” arXiv preprint arXiv:2306.14824, 2023.
[18] X. Zhou, M. Dai, J. Li, et al., “Detic: Detecting Twenty Thousand Classes Using Image-Level Supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 8761–8770.
[19] J. Hu, L. Ma, Z. Liu, et al., “Segmentation from natural language expressions,” in Proceedings of the European Conference on Computer Vision (ECCV), 2016, pp. 108–124.
[20] W. Wang, X. Ma, J. Chen, et al., “Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, pp. 312–313.
[21] E. Margffoy-Tuay, J. C. Pérez, E. Botero, and P. Arbeláez, “Dynamic multimodal instance segmentation guided by natural language queries,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 630–645.
[22] D. Liu, H. Zhang, Z.-J. Zha, and F. Wu, “Learning to assemble neural module tree networks for visual grounding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 4672–4681.
[23] Y. Li, S. Liu, H. Zhang, L. Ma, and S. Ding, “CMSA: Cross-modal self-attention network for referring image segmentation,” in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, South Korea, 2019, pp. 1482–1491.
[24] C. Liu, Z. Lin, X. Shen, J. Yang, X. Lu, and A. L. Yuille, “Recurrent multimodal interaction for referring image segmentation,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1271–1280.
[25] L. Chen, H. Zhang, J. Xiao, Z. Liu, L. Ma, and S. Ding, “Referring Image Segmentation via Recurrent Refinement Networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5745–5753.
[26] G. Luo, Y. Zhou, X. Sun, L. Cao, C. Wu, C. Deng, and R. Ji, “Multi-task collaborative network for joint referring expression comprehension and segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10034–10043.
[27] Y. Yang, H. Zhang, J. Ma, Y. Wei, and S. Ding, “LAVT: Language-Aware Vision Transformer for Referring Image Segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1815–1825.
[28] Q. Huang, L. Wang, Z. Shi, and X. Ding, “C29,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 11016–11023.
[29] Q. Wang, X. Chen, Y. Jin, X. Shi, and W. Zhang, “Bi-directional Relationship Inferring Network for Referring Image Segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 12238–12246.
[30] B. Yu, J. Zhang, Q. Liu, L. Zheng, and J. Gao, “Locate then segment: A strong pipeline for referring image segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 2805–2813.
[31] S. Ding, S. Huang, T. Hui, et al., “EFNet: Enhancement-Fusion Network for Semantic Segmentation,” in Pattern Recognition Letters, vol. 141, pp. 1–8, 2021.
[32] K. Gu, J. Chen, Y. Li, et al., “CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for Referring Image Segmentation,” in arXiv preprint arXiv:2205.08932, 2022.
[33] Y. Chen, Y. Xie, Q. Huang, and S. Ding, “PolyFormer: Referring Image Segmentation as Sequential Polygon Generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 2319–2328.
[34] Z. Wang, K. Chen, and Y. Li, “Vision-Language Transformer and Query Generation for Referring Segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 11038–11047.
[35] H. Huang, Z. Liang, F. Zhu, et al., “SADLR: Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 1906–1914.
[36] Y. Yan, X. He, W. Wan, and J. Liu, "MMNet: Multi-mask network for referring image segmentation," arXiv preprint arXiv:2305.14969, 2023.
[37] S. Huang, T. Hui, S. Liu, G. Li, Y. Wei, J. Han, L. Liu, and B. Li, “Referring image segmentation via cross-modal progressive comprehension,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10488–10497.
[38] T. Hui, S. Liu, S. Huang, G. Li, S. Yu, F. Zhang, and J. Han, “Linguistic structure guided context modeling for referring image segmentation,” in Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X, Springer, pp. 59–75, 2020.
[39] W. Wang, X. He, Y. Zhang, L. Guo, J. Shen, J. Li, and J. Liu, "CM-MaskSD: Cross-modality masked self-distillation for referring image segmentation," IEEE Trans. Multimedia, vol. 26, pp. 6906–6918, Jan. 2024.
[40] S. Kim, M. Kang, D. Kim, J. Park, and S. Kwak, “Extending CLIP’s image-text alignment to referring image segmentation,” in Proc. North American Chapter of the Association for Computational Linguistics (NAACL), Mexico City, Mexico, Jun. 2024, pp. 4611–4628, doi: 10.18653/v1/2024.naacl-long.258.
[41] L. Ma, J. Wu, S. Ding, et al., “RefSegformer: Referring image segmentation with robust transformer,” arXiv preprint arXiv:2310.01234, 2023.
[42] C. Liu, H. Ding, and X. Jiang, “GRES: Generalized referring expression segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 18680–18689.
[43] S. Hu, J. Mao, D. Zhang, C. Li, B. Ni, and X. Yang, “Beyond one-to-one: Rethinking the referring image segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 19510–19520.
[44] L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, "Modeling context in referring expressions," in Proc. Eur. Conf. Comput. Vis. (ECCV), Amsterdam, The Netherlands, 2016, pp. 69–85.
[45] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, "Microsoft COCO: Common objects in context," in Proc. Eur. Conf. Comput. Vis. (ECCV), Zurich, Switzerland, 2014, pp. 740–755.
[46] R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra, “ImageBind: One Embedding Space to Bind Them All,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Vancouver, BC, Canada, 2023, pp. 15180–15190.