| 研究生: |
鄭煜醴 Cheng, Yu-Li |
|---|---|
| 論文名稱: |
基於跨模態提⽰⼯程的影像分割改良⽅法-以CLIPSeg為例 An Improved Method for Image Segmentation Based on Cross-Modal Prompt Engineering: A Case Study of CLIPSeg |
| 指導教授: |
賀保羅
Horton, Paul |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 英文 |
| 論文頁數: | 50 |
| 中文關鍵詞: | 深度學習 、提示工程 、影像處理 、影像分割 、多模態模型 、PyTorch |
| 外文關鍵詞: | Deep Learning, Prompt Engineering, Image Processing, Image Segmentation, Multimodal Model, PyTorch |
| 相關次數: | 點閱:12 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,視覺-語言預訓練模型(Vision-Language Pretraining, VLP)於零樣本圖像理解任務中展現出卓越的表現。其中,CLIPSeg 為一延伸 CLIP 架構所設計的語意分割模型,透過靜態文字提示引導解碼器進行分割。然而,其提示融合策略較為簡化,缺乏對多樣影像語境的適應能力。
本研究提出一種新穎的跨模態提示編碼器(Cross-Modal Prompt Encoder, CMPE),以強化多模態分割模型在語意對齊與語境適應方面的表現, 並以 CLIPSeg 為例。CMPE 模組設計一組可學習的查詢向量,透過跨模態注意力機制主動與影像特徵互動,動態擷取語意提示;並採用自適應提示融合(Adaptive Prompt Fusion) 機制,將視覺提示與文字提示進行語意融合,以提升模型對任務語境的理解能力。
本研究在 PASCAL VOC 2012 資料集上進行廣泛實驗,結果顯示引入 CMPE 模組能有效提升分割準確率。透過消融實驗,分別針對提示格式(句子與單字)、查詢向量數量、與 Context Optimization (CoOP) 方法之比較,以及提示融合策略(相加與串接)進行分析,以驗證各元件對模型效能之貢獻。
整體而言,本研究證實透過跨模態語意查詢學習影像感知提示,可有效提升多模態分割模型之語意理解能力,為提示學習於視覺-語言任務中的應用提供具潛力之設計方向與實作框架。
In recent years, Vision-Language Pretraining (VLP) models have demonstrated outstanding performance in zero-shot image understanding tasks. CLIPSeg, an extension of the CLIP architecture, applies static textual prompts to guide a segmentation decoder. However, its prompt fusion mechanism remains relatively simple and lacks adaptability to diverse visual contexts.
This thesis proposes a novel Cross-Modal Prompt Encoder (CMPE) to enhance the semantic alignment and contextual adaptability of multimodal segmentation models, using CLIPSeg as a base framework. The CMPE module introduces learnable query tokens that interact with image features via a cross-attention mechanism to extract dynamic visual prompts. These are then integrated with textual prompts through an Adaptive Prompt Fusion strategy, resulting in context-aware prompt representations for improved segmentation.
Extensive experiments conducted on the PASCAL VOC 2012 dataset demonstrate that incorporating the CMPE module leads to improved segmentation accuracy. An ablation study further evaluates the impact of key design factors, including prompt type (sentence vs. token), the number of query tokens, comparison with Context Optimization based prompt learning, and fusion strategy (addition vs. concatenation), thereby validating the effectiveness of each component.
Overall, the proposed CMPE design highlights the potential of active cross-modal prompt learning in advancing multimodal segmentation, offering a promising framework for future research in vision-language tasks.
J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
H. M. Bong, R. Zhang, A. Robillard, R. de Azambuja, and G. Beltrame, “Peace: Prompt engineering automation for clipseg enhancement in aerial robotics,” arXiv preprint arXiv:2310.00085, 2023.
H. Caesar, J. Uijlings, and V. Ferrari, “Coco-stuff: Thing and stuff classes in context,” Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1209–1218.
M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International journal of computer vision, vol. 111, pp. 98–136, 2015.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” pp. 4904–4916, 2021.
S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, “Referitgame: Referring to objects in photographs of natural scenes,” Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 787–798.
M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “Maple: Multi-modal prompt learning,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 19 113–19 122.
P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020.
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pretraining with frozen image encoders and large language models,” International conference on machine learning, PMLR, 2023, pp. 19 730–19 742.
H. Lin, X. Cheng, X. Wu, and D. Shen, “Cat: Cross attention in vision transformer,” 2022 IEEE international conference on multimedia and expo (ICME), IEEE, 2022, pp. 1–6.
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
T. Lüddecke and A. Ecker, “Image segmentation using text and image prompts,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 7086–7096.
F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” 2016 fourth international conference on 3D vision (3DV), Ieee, 2016, pp. 565–571.
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” Proceedings of the AAAI conference on artificial intelligence, vol. 32, 2018.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” Proceedings of Machine Learning Research, vol. 139, M. Meila and T. Zhang, Eds., pp. 8748–8763, 18–24 Jul 2021.[Online]. Available: https://proceedings.mlr.press/v139/radford21a.html.
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.
Y. Rao, W. Zhao, G. Chen, Y. Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu, “Denseclip: Language-guided dense prediction with context-aware prompting,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18 082–18 091.
C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. Jorge Cardoso, “Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations,” Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3, Springer, 2017, pp. 240–248.
C. Wu, Z. Lin, S. Cohen, T. Bui, and S. Maji, “Phrasecut: Language-based image segmentation in the wild,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 216–10 225.
L. Yuan, D. Chen, Y.-L. Chen, N. Codella, X. Dai, J. Gao, H. Hu, X. Huang, B. Li, C. Li, et al., “Florence: A new foundation model for computer vision,” arXiv preprint arXiv:2111.11432, 2021.
S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, et al., “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022.
B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,” International Journal of Computer Vision, vol. 127, pp. 302–321, 2019.
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 816–16 825.
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022.