簡易檢索 / 詳目顯示

研究生: 鄭煜醴
Cheng, Yu-Li
論文名稱: 基於跨模態提⽰⼯程的影像分割改良⽅法-以CLIPSeg為例
An Improved Method for Image Segmentation Based on Cross-Modal Prompt Engineering: A Case Study of CLIPSeg
指導教授: 賀保羅
Horton, Paul
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2025
畢業學年度: 113
語文別: 英文
論文頁數: 50
中文關鍵詞: 深度學習提示工程影像處理影像分割多模態模型PyTorch
外文關鍵詞: Deep Learning, Prompt Engineering, Image Processing, Image Segmentation, Multimodal Model, PyTorch
相關次數: 點閱:12下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,視覺-語言預訓練模型(Vision-Language Pretraining, VLP)於零樣本圖像理解任務中展現出卓越的表現。其中,CLIPSeg 為一延伸 CLIP 架構所設計的語意分割模型,透過靜態文字提示引導解碼器進行分割。然而,其提示融合策略較為簡化,缺乏對多樣影像語境的適應能力。

    本研究提出一種新穎的跨模態提示編碼器(Cross-Modal Prompt Encoder, CMPE),以強化多模態分割模型在語意對齊與語境適應方面的表現, 並以 CLIPSeg 為例。CMPE 模組設計一組可學習的查詢向量,透過跨模態注意力機制主動與影像特徵互動,動態擷取語意提示;並採用自適應提示融合(Adaptive Prompt Fusion) 機制,將視覺提示與文字提示進行語意融合,以提升模型對任務語境的理解能力。

    本研究在 PASCAL VOC 2012 資料集上進行廣泛實驗,結果顯示引入 CMPE 模組能有效提升分割準確率。透過消融實驗,分別針對提示格式(句子與單字)、查詢向量數量、與 Context Optimization (CoOP) 方法之比較,以及提示融合策略(相加與串接)進行分析,以驗證各元件對模型效能之貢獻。

    整體而言,本研究證實透過跨模態語意查詢學習影像感知提示,可有效提升多模態分割模型之語意理解能力,為提示學習於視覺-語言任務中的應用提供具潛力之設計方向與實作框架。

    In recent years, Vision-Language Pretraining (VLP) models have demonstrated outstanding performance in zero-shot image understanding tasks. CLIPSeg, an extension of the CLIP architecture, applies static textual prompts to guide a segmentation decoder. However, its prompt fusion mechanism remains relatively simple and lacks adaptability to diverse visual contexts.

    This thesis proposes a novel Cross-Modal Prompt Encoder (CMPE) to enhance the semantic alignment and contextual adaptability of multimodal segmentation models, using CLIPSeg as a base framework. The CMPE module introduces learnable query tokens that interact with image features via a cross-attention mechanism to extract dynamic visual prompts. These are then integrated with textual prompts through an Adaptive Prompt Fusion strategy, resulting in context-aware prompt representations for improved segmentation.

    Extensive experiments conducted on the PASCAL VOC 2012 dataset demonstrate that incorporating the CMPE module leads to improved segmentation accuracy. An ablation study further evaluates the impact of key design factors, including prompt type (sentence vs. token), the number of query tokens, comparison with Context Optimization based prompt learning, and fusion strategy (addition vs. concatenation), thereby validating the effectiveness of each component.

    Overall, the proposed CMPE design highlights the potential of active cross-modal prompt learning in advancing multimodal segmentation, offering a promising framework for future research in vision-language tasks.

    中文摘要i Abstract iii 誌謝 v Contents vi List of Tables viii List of Figures ix Nomenclature x 1 Introduction 1 1.1 Background 1 1.2 Motivation and Problem Statement 1 1.3 Proposed Solution 3 1.4 Contributions 3 1.5 Research Goal 4 2 Related Work 5 2.1 Vision-Language Pretraining and CLIP 5 2.2 Prompt-Based Vision Segmentation 5 2.3 Prompt Engineering in VLP Models 6 2.4 Q-Former and Semantic Querying 7 2.5 Visual Prompt Fusion and Semantic Alignment 7 3 Methods 8 3.1 Overall Architecture 8 3.2 Cross-Modal Prompt Encoder (CMPE) 9 3.2.1 Learnable Query Tokens 9 3.2.2 Cross-Attention with Image Features 10 3.2.3 Prompt Fusion Strategy 11 3.2.4 Decoder Integration 12 3.3 Loss Function 13 3.4 Implementation Details 15 4 Experiments 17 4.1 Experiment Setup 17 4.2 Baseline Comparison 19 4.3 Effect of Dice Loss 20 4.4 Ablation Study 21 4.4.1 Prompt Design: Sentence vs. Token 22 4.4.2 Effect of Query Token Number 23 4.4.3 Comparison with CoOp-based Prompt Learning 24 4.4.4 Fusion Strategy: Addition vs. Concatenation 26 4.5 Qualitative Analysis 27 5 Discussion 30 5.1 Limitations 30 5.2 Future Work 31 6 Conclusion 32 Bibliography 34

    J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.

    H. M. Bong, R. Zhang, A. Robillard, R. de Azambuja, and G. Beltrame, “Peace: Prompt engineering automation for clipseg enhancement in aerial robotics,” arXiv preprint arXiv:2310.00085, 2023.

    H. Caesar, J. Uijlings, and V. Ferrari, “Coco-stuff: Thing and stuff classes in context,” Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1209–1218.

    M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International journal of computer vision, vol. 111, pp. 98–136, 2015.

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

    C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” pp. 4904–4916, 2021.

    S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, “Referitgame: Referring to objects in photographs of natural scenes,” Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 787–798.

    M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “Maple: Multi-modal prompt learning,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 19 113–19 122.

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020.

    J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pretraining with frozen image encoders and large language models,” International conference on machine learning, PMLR, 2023, pp. 19 730–19 742.

    H. Lin, X. Cheng, X. Wu, and D. Shen, “Cat: Cross attention in vision transformer,” 2022 IEEE international conference on multimedia and expo (ICME), IEEE, 2022, pp. 1–6.

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.

    T. Lüddecke and A. Ecker, “Image segmentation using text and image prompts,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 7086–7096.

    F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” 2016 fourth international conference on 3D vision (3DV), Ieee, 2016, pp. 565–571.

    A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.

    E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” Proceedings of the AAAI conference on artificial intelligence, vol. 32, 2018.

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” Proceedings of Machine Learning Research, vol. 139, M. Meila and T. Zhang, Eds., pp. 8748–8763, 18–24 Jul 2021.[Online]. Available: https://proceedings.mlr.press/v139/radford21a.html.

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.

    Y. Rao, W. Zhao, G. Chen, Y. Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu, “Denseclip: Language-guided dense prediction with context-aware prompting,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 18 082–18 091.

    C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. Jorge Cardoso, “Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations,” Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3, Springer, 2017, pp. 240–248.

    C. Wu, Z. Lin, S. Cohen, T. Bui, and S. Maji, “Phrasecut: Language-based image segmentation in the wild,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 216–10 225.

    L. Yuan, D. Chen, Y.-L. Chen, N. Codella, X. Dai, J. Gao, H. Hu, X. Huang, B. Li, C. Li, et al., “Florence: A new foundation model for computer vision,” arXiv preprint arXiv:2111.11432, 2021.

    S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, et al., “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022.

    B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,” International Journal of Computer Vision, vol. 127, pp. 302–321, 2019.

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 816–16 825.

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE