簡易檢索 / 詳目顯示

研究生: 曾敬珊
Tseng, Ching-Shan
論文名稱: 關聯感知圖像說明用於可解釋之圖像問答
Relation-Aware Image Captioning for Explainable Visual Question Answering
指導教授: 高宏宇
Kao, Hung-Yu
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 59
中文關鍵詞: 視覺問答圖像說明可解釋人工智慧跨情態學習多任務學習
外文關鍵詞: Visual Question Answering, Image Captioning, Explainable AI, Cross-Modality Learning, Multi-Task Learning
相關次數: 點閱:158下載:12
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 圖像問答(Visual Question Answering,簡稱VQA)是一項橫跨影像與自然語言處理的任務。在這份研究中,我們由三個切入點進行探討。第一點是關於圖像特徵的抽取。在現行的研究中,卷積神經網路(Convolutional neural network,簡稱CNN)被廣泛使用於抽取圖像的局部特徵,也有些研究會使用物件偵測模型(如Faster RCNN等)選取包含了物件的候選區域作為更高階的圖像特徵。然而,以上的作法並沒有辦法處理涉及到多個物件之間關聯與互動的問題。第二點是關於模型對於文字與圖像之間的理解。目前對於圖像理解和文字理解已經有許多成熟的技術,然而,若要解決VQA 任務,模型還必須理解如何將兩種不同情態的資訊融合使用。第三點則是關於模型的可解釋性。目前絕大多數的VQA 模型並不考慮可解釋性,因此,當模型得出錯誤答案時,我們也難以解釋緣由。有些研究在分析時,會將注意力機制的權重分布等訊息,作為模型學會正確解讀任務內容的依據,但是單從注意力權重也僅能推測出模型是否能理解問題中的關鍵字、或是與問題最相關的圖像特徵等,仍無法說明模型如何去理解與使用這些特徵。為了解決上述這些問題,我們提出了一個新的模型架構,該模型將圖像中各個候選區域之間的相對位置建置成關聯圖,並透過基於圖神經網路的關聯編碼器去生成包含了關聯資訊的視覺特徵。我們也導入圖像說明(Image Captioning)進行多任務學習(MultiTask Learning),利用VQA所著重的視覺特徵生成圖像說明,作為模型預測結果的解釋;同時,我們使用生成的結果輔助預測模型。實驗結果顯示,我們的模型可以根據問題與圖像生成具有意義的解釋,並且在加入關聯編碼器和使用生成的解釋輔助預測時,分別對不同類型的題目帶來提升。

    Visual Question Answering (VQA) is a task requiring computer vision and natural language processing. In this work, there are three issues we would discuss. The first issue is the extraction of visual features. In recent years, convolutional networks (CNN) are widely utilized to extract local features of images. Some research also leverages object detection models (such as Faster R-CNN) to select proposal regions with objects as high-level visual features. However, these methods ignore the correlations or interactions between multiple objects. The second issue is understanding the connections between texts and images. Currently, the techniques for textual and visual understanding are mature. Yet a plausible VQA model should also have the ability to fuse cross-modal information. The third issue is explainability. Most of the VQA models are not explainable, which means it is difficult to explain why a model returns correct or wrong answers. Some research utilizes information such as attention weights as evidence that their models could interpret the inputs correctly. To solve the problems above, we propose a new model structure. Our model constructs a relation graph according to the relative positions between region pairs and then produces relation-aware visual features with a relation encoder. To generate explanations of predicted results, we introduce an image captioning module and conduct multi-task training process.
    In the meantime, the generated captions are injected into the predictor to assist cross-modal understanding. Experiments show that our model could generate meaningful answers and explanations according to the questions and images. Besides, the relation encoder and the caption-attended predictor lead to improvement for different types of questions.

    摘要 i Abstract ii Acknowledgements iii Table of Contents iv List of Tables vi List of Figures vii Chapter 1. Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Our work 4 Chapter 2. Related Work 6 2.1 Visual Question Answering 6 2.2 Graph Networks 8 2.2.1 Graph networks for vision-and-language tasks 9 2.2.2 Graph networks for VQA 10 2.3 Using textual descriptions to aid VQA 10 2.4 Explainable AI 12 Chapter 3. Methdology 15 3.1 Overview 15 3.2 Feature extraction 17 3.3 Cross-modal attention 17 3.4 Relation encoder 19 3.4.1 Spatial relation graph 19 3.4.2 GAT-based relation encoder 23 3.5 Caption generator 26 3.6 Caption-attended predictor 28 Chapter 4. Experiment Result 31 4.1 Dataset 31 4.1.1 VQA-v2 dataset 32 4.1.2 VQA-E dataset 32 4.2 Evaluation metrics 34 4.2.1 VQA prediction 34 4.2.2 Explanation generation 35 4.3 Implementation Details 37 4.4 Performance comparison 38 4.4.1 Pre-train stage 38 4.4.2 Fine-tune stage 39 Chapter 5. Analysis 41 5.1 Explainability 41 5.2 Effect of relation encoder 43 5.3 Effect of caption-attended predictor 48 Chapter 6. Conclusion 55 References 56

    [1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom­up and top­down attention for image captioning and visual question answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
    [2] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra,
    C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In Interna­ tional Conference on Computer Vision (ICCV), 2015.
    [3] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder­decoder for statistical machine translation, 2014.
    [4] Jacob Devlin, Ming­Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre­training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
    [5] Rahul Dey and Fathi M. Salem. Gate­variants of gated recurrent unit (gru) neural net­ works. In 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), pages 1597–1600, 2017.
    [6] Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. From captions to visual concepts and back. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
    [7] Yash Goyal, Tejas Khot, Douglas Summers­Stay, Dhruv Batra, and Devi Parikh. Mak­ ing the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
    [8] Sepp Hochreiter and Jürgen Schmidhuber. Long short­term memory. Neural computa­ tion, 9(8):1735–1780, 1997.
    [9] Qingbao Huang, Jielong Wei, Yi Cai, Changmeng Zheng, Junying Chen, Ho­fung Le­ ung, and Qing Li. Aligned dual channel graph convolutional network for visual question answering. In Proceedings of the 58th Annual Meeting of the Association for Compu­ tational Linguistics, pages 7166–7176, 2020.
    [10] Allan Jabri, Armand Joulin, and Laurens van der Maaten. Revisiting visual question answering baselines. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016. Springer International Publishing, 2016.
    [11] Justin Johnson, Andrej Karpathy, and Li Fei­Fei. Densecap: Fully convolutional lo­ calization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
    [12] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188, 2014.
    [13] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi­task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018.
    [14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
    [15] Thomas N Kipf and Max Welling. Semi­supervised classification with graph convolu­ tional networks. arXiv preprint arXiv:1609.02907, 2016.
    [16] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li­Jia Li, David A Shamma, Michael Bernstein, and Li Fei­Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. 2016.
    [17] Hui Li, Peng Wang, Chunhua Shen, and Anton van den Hengel. Visual question an­ swering as reading comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6319–6328, 2019.
    [18] Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. Relation­aware graph attention net­ work for visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10313–10322, 2019.
    [19] Liunian Harold Li, Mark Yatskar, Da Yin, Cho­Jui Hsieh, and Kai­Wei Chang. Visual­ bert: A simple and performant baseline for vision and language, 2019.
    [20] Qing Li, Jianlong Fu, Dongfei Yu, Tao Mei, and Jiebo Luo. Tell­and­answer: Towards explainable visual question answering using attributes and captions. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018. Association for Computational Linguistics.
    [21] Qing Li, Qingyi Tao, Shafiq Joty, Jianfei Cai, and Jiebo Luo. Vqa­e: Explaining, elab­ orating, and enhancing your answers for visual questions. ECCV, 2018.
    [22] Chin­Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
    [23] Tsung­Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015.
    [24] Diego Marcheggiani and Ivan Titov. Encoding sentences with graph convolutional net­ works for semantic role labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, Septem­ ber 2017. Association for Computational Linguistics.
    [25] Kishore Papineni, Salim Roukos, Todd Ward, and Wei­Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
    [26] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vec­ tors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
    [27] Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. Imagebert: Cross­modal pre­training with large­scale weak­supervised image­text data, 2020.
    [28] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r­cnn: Towards real­ time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015.
    [29] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008.
    [30] Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge, 2017.
    [31] Damien Teney and Anton van den Hengel. Zero­shot visual question answering, 2016.
    [32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
    [33] Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. Cider: Consensus­ based image description evaluation. CoRR, abs/1411.5726, 2014.
    [34] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
    [35] Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. Fvqa: Fact­based visual question answering. IEEE transactions on pattern analysis and ma­ chine intelligence, 40(10):2413–2427, 2017.
    [36] Jialin Wu, Zeyuan Hu, and Raymond Mooney. Generating question relevant captions to aid visual question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, July 2019.
    [37] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhut­ dinov, Richard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention, 2016.
    [38] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Exploring visual relationship for image captioning. In Europeon Conference on Computer Vision (ECCV), 2018.
    [39] Adams Wei Yu, David Dohan, Minh­Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. Qanet: Combining local convolution with global self­ attention for reading comprehension. CoRR, abs/1804.09541, 2018.
    [40] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regu­ larization. arXiv preprint arXiv:1409.2329, 2014.
    [41] Peng Zhang, Yash Goyal, Douglas Summers­Stay, Dhruv Batra, and Devi Parikh. Yin and Yang: Balancing and answering binary visual questions. In Conference on Com­ puter Vision and Pattern Recognition (CVPR), 2016.
    [42] Zihao Zhu, Jing Yu, Yujing Wang, Yajing Sun, Yue Hu, and Qi Wu. Mucko: Multi­ layer cross­modal knowledge reasoning for fact­based visualquestion answering. arXiv preprint arXiv:2006.09073, 2020.

    下載圖示 校內:2022-10-31公開
    校外:2022-10-31公開
    QR CODE