簡易檢索 / 詳目顯示

研究生: 李易庭
Li, Yi-Ting
論文名稱: GViG:基於提示的語言建模用於視覺問答的生成式視覺定位
GViG: Generative Visual Grounding using Prompt-based Language Modeling for Visual Question Answering
指導教授: 高宏宇
Kao, Hung-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 53
中文關鍵詞: 跨模態機器學習視覺問答視覺定位可解釋性 提示微調
外文關鍵詞: Cross-modal Machine Learning, Visual Question Answering, Visual Grounding, Interpretability, Prompt Tuning
相關次數: 點閱:107下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究呈現了一種新穎的方法來解決視覺問答中的視覺定位任務,這個方法整合了兩個關鍵模塊:提示調整模塊和視覺定位模塊。我們的方法將視覺定位任務重新定義為語言模型任務,成功地統一了文字和視覺的輸入輸出及目標函數,並從而擺脫了傳統視覺定位模型的複雜性。此外,我們的模型利用提示微調方法,善用了大型預訓練VQA模型的力量,使用它們的預測作為有益的提示,以提高其對任務的理解,並降低傳統方法在高運算資源需求的問題。我們的模型在競爭激烈的WSDM2023TolokaVQA數據集上表現出色,證明了其穩健性和效果。值得注意的是,我們的系統僅使用一張NvidiaRTX3090GPU就達到了此性能,這突顯了其與需要更多資源的大型模型相比的效率(其為我們模型大小的二至三倍及582倍的訓練資料)。通過詳細的案例研究,我們深入揭示了模型的能力,尤其是在動態連結文本和視覺數據,以及根據提供的問題或提示調整關注點的特性。這顯示了我們模型無與倫比的靈活性、適應性和可解釋性。

    This study presents a novel approach to address visual localization tasks in visual question answering by integrating two key modules: the prompt adjustment module and the visual localization module. Our method redefines the visual localization task as a language modeling task, successfully unifying the input and output of text and visuals, as well as the objective function, thus liberating it from the complexity of traditional visual localization models. Additionally, our model leverages the prompt-tuning technique, capitalizing on the power of large pre-trained VQA models, using their predictions as beneficial cues to enhance its task understanding and reduce the issue of high computational resource demands seen in traditional methods. Our model excelled in the fiercely competitive WSDM 2023 Toloka VQA dataset, underscoring its robustness and effectiveness. Notably, our system achieved such performance using only a single Nvidia RTX 3090 GPU, highlighting its efficiency compared to larger models that require significantly more resources, being 2 to 3 times the size of our model and employing 582 times the training data. Through detailed case studies, we deeply elucidate the capabilities of our model, especially in dynamically linking textual and visual data and adjusting its attention based on the provided question or cues. This demonstrates the unparalleled flexibility, adaptability, and interpretability of our model.

    摘要 i Abstract ii 誌謝 iii Table of Contents iv List of Tables vii List of Figures viii Chapter 1. Introduction 1 1.1 The WSDM 2023 Toloka VQA Dataset Benchmark 1 1.2 Two Types of Pipelines in the VG Task 3 1.2.1. The Two-stage Pipeline Method 4 1.2.2. The One-stage Pipeline Method 5 1.3 Our Work and Contribution 6 Chapter 2. Related Work 9 2.1 Prompt Tuning in Low-Resource Multi-modality Field: FewVLM 9 2.1.1. Encoder-decoder Visual-language Model 10 2.1.2. Pre-training Objective 10 2.1.3. Low-resource Adaptation for Downstream Tasks 11 2.2 The Input Format of SimVLM 14 2.3 Pix2Seq Object Detection Framework 15 2.3.1. Sequence Construction 16 2.3.2. Architecture 18 2.3.3. Objective 18 2.3.4. Inference 18 Chapter 3. Methodology 20 3.1 Architecture 20 3.2 Motivation 21 3.3 Data Pre-processing 23 3.4 Prompt Tuning Module 24 3.5 VG Module 25 3.5.1. I/O 26 3.5.2. Language Modeling Objective 27 3.6 Conditional Trie-based Search (CTS) 28 Chapter 4. Experiment 30 4.1 Dataset Description 30 4.2 Implementation Details 30 4.3 WSDM 2023 Toloka VQA Dataset Benchmark 32 4.3.1. Discussion of Methods Employed by Different Teams 32 Chapter 5. Analysis 35 5.1 Prompt Study 35 5.1.1. Influence of the number of hints 35 5.1.2. The Influence of the Quality of Hints 36 5.1.3. The influence of instruction 37 5.2 Ablation Study 38 5.3 Interpretable Attention 40 5.3.1. Text-to-Image Visualization 41 5.3.2. Image-to-Text Visualization 42 5.3.3. The influence of hints 43 5.3.4. Same image, different questions 45 5.3.5. Unrealistic images and differentiating interrogative pronouns 46 5.3.6. More Examples 46 Chapter 6. Discussion and Conclusion 48 References 49

    [1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
    [2] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. In International Conference on Learning Representations.
    [3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
    [4] Lingji Chen, Alok Sharma, Chinmay Shirore, Chengjie Zhang, and Balarama Raju Buddharaju. How to backpropagate through hungarian in your detr? arXiv preprint arXiv:2211.14448, 2022.
    [5] Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. In International Conference on Learning Representations.
    [6] Xinpeng Chen, Lin Ma, Jingyuan Chen, Zequn Jie, Wei Liu, and Jiebo Luo. Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426, 2018.
    [7] Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pages 1931–1942. PMLR, 2021.
    [8] Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
    [9] Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, and Mingkui Tan. Visual grounding via accumulated attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7746–7755, 2018.
    [10] Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1769–1779, 2021.
    [11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
    [12] Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, and Tong Lu. Champion solution for the wsdm2023 toloka vqa challenge. arXiv preprint arXiv:2301.09045, 2023.
    [13] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. In Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021, pages 3816–3830. Association for Computational Linguistics (ACL), 2021.
    [14] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
    [15] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
    [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
    [17] Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao Wei, Jizhong Han, Luoqi Liu, and Bo Li. Referring image segmentation via cross-modal progressive comprehension. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10488–10497, 2020.
    [18] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
    [19] Woojeong Jin, Yu Cheng, Yelong Shen, Weizhu Chen, and Xiang Ren. A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2763–2775, 2022.
    [20] Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi Xie. Prompting visual-language models for efficient video understanding. In European Conference on Computer Vision, pages 105–124. Springer, 2022.
    [21] Kushal Kafle and Christopher Kanan. An analysis of visual question answering algorithms. In Proceedings of the IEEE international conference on computer vision, pages 1965–1973, 2017.
    [22] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790, 2021.
    [23] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
    [24] Evgenia Komleva. Wsdm2023 vqa. https://github.com/EvgeniaKomleva/WSDM2023_VQA, 2023.
    [25] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
    [26] Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10880–10889, 2020.
    [27] Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Kumar Satzoda, Vijay Mahadevan, and R Manmatha. Polyformer: Referring image segmentation as sequential polygon generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18653–18663, 2023.
    [28] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
    [29] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 21–37. Springer, 2016.
    [30] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
    [31] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
    [32] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
    [33] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
    [34] Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. Modeling context between objects for referring expression understanding. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 792–807. Springer, 2016.
    [35] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training.
    [36] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
    [37] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards realtime object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
    [38] Timo Schick and Hinrich Schütze. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255–269, 2021.
    [39] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, 2016.
    [40] Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
    [41] Dmitry Ustalov, Nikita Pavlichenko, Daniil Likhobaba, and Alisa Smirnova. Wsdm cup 2023 challenge on visual question answering. 2023.
    [42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
    [43] Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. You only learn one representation: Unified network for multiple tasks. arXiv preprint arXiv:2105.04206, 2021.
    [44] Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):394–407, 2018.
    [45] Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision. In International Conference on Learning Representations, 2021.
    [46] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089, 2022.
    [47] Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4683–4693, 2019.
    [48] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016.
    [49] Hyu Zhang and Khylon Wong. Vqa. https://github.com/Hyu-Zhang/VQA, 2023.
    [50] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13041–13049, 2020

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE