| 研究生: |
曾敬珊 Tseng, Ching-Shan |
|---|---|
| 論文名稱: |
關聯感知圖像說明用於可解釋之圖像問答 Relation-Aware Image Captioning for Explainable Visual Question Answering |
| 指導教授: |
高宏宇
Kao, Hung-Yu |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 英文 |
| 論文頁數: | 59 |
| 中文關鍵詞: | 視覺問答 、圖像說明 、可解釋人工智慧 、跨情態學習 、多任務學習 |
| 外文關鍵詞: | Visual Question Answering, Image Captioning, Explainable AI, Cross-Modality Learning, Multi-Task Learning |
| 相關次數: | 點閱:158 下載:12 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
圖像問答(Visual Question Answering,簡稱VQA)是一項橫跨影像與自然語言處理的任務。在這份研究中,我們由三個切入點進行探討。第一點是關於圖像特徵的抽取。在現行的研究中,卷積神經網路(Convolutional neural network,簡稱CNN)被廣泛使用於抽取圖像的局部特徵,也有些研究會使用物件偵測模型(如Faster RCNN等)選取包含了物件的候選區域作為更高階的圖像特徵。然而,以上的作法並沒有辦法處理涉及到多個物件之間關聯與互動的問題。第二點是關於模型對於文字與圖像之間的理解。目前對於圖像理解和文字理解已經有許多成熟的技術,然而,若要解決VQA 任務,模型還必須理解如何將兩種不同情態的資訊融合使用。第三點則是關於模型的可解釋性。目前絕大多數的VQA 模型並不考慮可解釋性,因此,當模型得出錯誤答案時,我們也難以解釋緣由。有些研究在分析時,會將注意力機制的權重分布等訊息,作為模型學會正確解讀任務內容的依據,但是單從注意力權重也僅能推測出模型是否能理解問題中的關鍵字、或是與問題最相關的圖像特徵等,仍無法說明模型如何去理解與使用這些特徵。為了解決上述這些問題,我們提出了一個新的模型架構,該模型將圖像中各個候選區域之間的相對位置建置成關聯圖,並透過基於圖神經網路的關聯編碼器去生成包含了關聯資訊的視覺特徵。我們也導入圖像說明(Image Captioning)進行多任務學習(MultiTask Learning),利用VQA所著重的視覺特徵生成圖像說明,作為模型預測結果的解釋;同時,我們使用生成的結果輔助預測模型。實驗結果顯示,我們的模型可以根據問題與圖像生成具有意義的解釋,並且在加入關聯編碼器和使用生成的解釋輔助預測時,分別對不同類型的題目帶來提升。
Visual Question Answering (VQA) is a task requiring computer vision and natural language processing. In this work, there are three issues we would discuss. The first issue is the extraction of visual features. In recent years, convolutional networks (CNN) are widely utilized to extract local features of images. Some research also leverages object detection models (such as Faster R-CNN) to select proposal regions with objects as high-level visual features. However, these methods ignore the correlations or interactions between multiple objects. The second issue is understanding the connections between texts and images. Currently, the techniques for textual and visual understanding are mature. Yet a plausible VQA model should also have the ability to fuse cross-modal information. The third issue is explainability. Most of the VQA models are not explainable, which means it is difficult to explain why a model returns correct or wrong answers. Some research utilizes information such as attention weights as evidence that their models could interpret the inputs correctly. To solve the problems above, we propose a new model structure. Our model constructs a relation graph according to the relative positions between region pairs and then produces relation-aware visual features with a relation encoder. To generate explanations of predicted results, we introduce an image captioning module and conduct multi-task training process.
In the meantime, the generated captions are injected into the predictor to assist cross-modal understanding. Experiments show that our model could generate meaningful answers and explanations according to the questions and images. Besides, the relation encoder and the caption-attended predictor lead to improvement for different types of questions.
[1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottomup and topdown attention for image captioning and visual question answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[2] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,
C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In Interna tional Conference on Computer Vision (ICCV), 2015.
[3] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation, 2014.
[4] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. BERT: pretraining of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
[5] Rahul Dey and Fathi M. Salem. Gatevariants of gated recurrent unit (gru) neural net works. In 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), pages 1597–1600, 2017.
[6] Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. From captions to visual concepts and back. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
[7] Yash Goyal, Tejas Khot, Douglas SummersStay, Dhruv Batra, and Devi Parikh. Mak ing the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[8] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computa tion, 9(8):1735–1780, 1997.
[9] Qingbao Huang, Jielong Wei, Yi Cai, Changmeng Zheng, Junying Chen, Hofung Le ung, and Qing Li. Aligned dual channel graph convolutional network for visual question answering. In Proceedings of the 58th Annual Meeting of the Association for Compu tational Linguistics, pages 7166–7176, 2020.
[10] Allan Jabri, Armand Joulin, and Laurens van der Maaten. Revisiting visual question answering baselines. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016. Springer International Publishing, 2016.
[11] Justin Johnson, Andrej Karpathy, and Li FeiFei. Densecap: Fully convolutional lo calization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[12] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188, 2014.
[13] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multitask learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018.
[14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[15] Thomas N Kipf and Max Welling. Semisupervised classification with graph convolu tional networks. arXiv preprint arXiv:1609.02907, 2016.
[16] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, LiJia Li, David A Shamma, Michael Bernstein, and Li FeiFei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. 2016.
[17] Hui Li, Peng Wang, Chunhua Shen, and Anton van den Hengel. Visual question an swering as reading comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6319–6328, 2019.
[18] Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. Relationaware graph attention net work for visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10313–10322, 2019.
[19] Liunian Harold Li, Mark Yatskar, Da Yin, ChoJui Hsieh, and KaiWei Chang. Visual bert: A simple and performant baseline for vision and language, 2019.
[20] Qing Li, Jianlong Fu, Dongfei Yu, Tao Mei, and Jiebo Luo. Tellandanswer: Towards explainable visual question answering using attributes and captions. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018. Association for Computational Linguistics.
[21] Qing Li, Qingyi Tao, Shafiq Joty, Jianfei Cai, and Jiebo Luo. Vqae: Explaining, elab orating, and enhancing your answers for visual questions. ECCV, 2018.
[22] ChinYew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
[23] TsungYi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015.
[24] Diego Marcheggiani and Ivan Titov. Encoding sentences with graph convolutional net works for semantic role labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, Septem ber 2017. Association for Computational Linguistics.
[25] Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
[26] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vec tors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
[27] Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. Imagebert: Crossmodal pretraining with largescale weaksupervised imagetext data, 2020.
[28] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards real time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015.
[29] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008.
[30] Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge, 2017.
[31] Damien Teney and Anton van den Hengel. Zeroshot visual question answering, 2016.
[32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
[33] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus based image description evaluation. CoRR, abs/1411.5726, 2014.
[34] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
[35] Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. Fvqa: Factbased visual question answering. IEEE transactions on pattern analysis and ma chine intelligence, 40(10):2413–2427, 2017.
[36] Jialin Wu, Zeyuan Hu, and Raymond Mooney. Generating question relevant captions to aid visual question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, July 2019.
[37] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhut dinov, Richard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention, 2016.
[38] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Exploring visual relationship for image captioning. In Europeon Conference on Computer Vision (ECCV), 2018.
[39] Adams Wei Yu, David Dohan, MinhThang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. Qanet: Combining local convolution with global self attention for reading comprehension. CoRR, abs/1804.09541, 2018.
[40] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regu larization. arXiv preprint arXiv:1409.2329, 2014.
[41] Peng Zhang, Yash Goyal, Douglas SummersStay, Dhruv Batra, and Devi Parikh. Yin and Yang: Balancing and answering binary visual questions. In Conference on Com puter Vision and Pattern Recognition (CVPR), 2016.
[42] Zihao Zhu, Jing Yu, Yujing Wang, Yajing Sun, Yue Hu, and Qi Wu. Mucko: Multi layer crossmodal knowledge reasoning for factbased visualquestion answering. arXiv preprint arXiv:2006.09073, 2020.