| 研究生: |
林穎憶 Lin, Ying-Yi |
|---|---|
| 論文名稱: |
使用對抗學習語言模型改善圖片段落描述之流暢度 Improving the Fluency of Generated Image Paragraph Descriptions Using Adversarial Learned Language Model |
| 指導教授: |
劉任修
Liu, Ren-Shiou |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 資訊管理研究所 Institute of Information Management |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 中文 |
| 論文頁數: | 49 |
| 中文關鍵詞: | 圖片段落描述生成 、深度學習 、對抗架構 |
| 外文關鍵詞: | Image Paragraph Captioning, Deep Learning, Adversarial Learned Inference |
| 相關次數: | 點閱:78 下載:17 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
電腦發展初期因其體積與限制,只能處理算數或有限的指令,漸漸發展至可處理文字、影像、閱覽網際網路,再到現在機器學習的應用讓電腦擁有了模仿人類學習的能力。目前機器學習領域中,影像辨識以及自然語言處理一直是研究的熱門主題,一項名為圖片段落描述生成的子領域,結合了影像辨識與自然語言處理,從原本輸入一張圖片進行物品分類,到目前已經發展成輸入圖片,就可以生成相對應物品、景色與動作描述。目前雖然針對圖片單句描述生成的研究已趨成熟,但單個句子因其長度以及限制,難以完整描繪出富有語意的圖片。針對此問題,圖片段落描述生成研究因應而生,目標是生成一段落,以更精準與豐富的語句描述圖片。圖片段落描述生成的研究目前仍為少數,以針對通順性做圖片描述生成改良的研究更是稀少。
本論文採用Krause et al. (2017)等學者的圖片段落生成架構為基底建置生成器,並參考Liang et al. (2017)與Park et al. (2019)等學者的對抗架構與混合鑑別器進行修改,在混合鑑別器中放入連貫性鑑別器,在段落註解生成的同時,不只考慮句子單詞之間的語意,同時也考慮每個句子的流暢度。
A technology called machine learning has given computers the ability to learn. At present, image recognition and natural language processing have always been popular research topics. Although the research on the generation of single-sentence descriptions of images has matured, it is difficult for a single sentence to fully describe a semantically rich image due to its length and limitation. To solve this problem, the research of image paragraph description generation is born in response, and the final goal is to generate a paragraph to describe the image more accurately and richly.
In this paper, we use Krause et al. (2017) paragraph captioning model as the base for the generator, reference Liang et al. (2017) adversarial framework and Park et al. (2019)hybrid discriminator, and put a coherence discriminator in the hybrid discriminator to consider not only the semantic meaning between sentence words but also the fluency of each sentence while generating paragraph captioning.
Agarwal, A. and Lavie, A. (2008). Meteor, m-bleu and m-ter: Evaluation metrics for high-correlation with human rankings of machine translation output. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 115–118.
Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155.
Chatterjee, M. and Schwing, A. G. (2018). Diverse and coherent paragraph genera- tion from images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 729–744.
Che, W., Xiong, R., Fan, X., and Zhao, D. (2018). Paragraph generation network with visual relationship detection. MM 2018 - Proceedings of the 2018 ACM Multimedia Conference, 2:1435–1443.
Chen, H., Ding, G., Lin, Z., Guo, Y., Shan, C., and Han, J. (2021). Image captioning with memorized knowledge. Cognitive Computation, 13(4):807–820.
Dai, B., Fidler, S., Urtasun, R., and Lin, D. (2017). Towards diverse and natural image descriptions via a conditional gan. pages 2989–2998.
Gao, L., Li, X., Song, J., and Shen, H. T. (2019). Hierarchical lstms with adaptive attention for visual captioning. IEEE transactions on pattern analysis and machine intelligence, 42(5):1112–1131.
Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448.
Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.http://www.deeplearningbook.org.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computa- tion, 9:1735–80.
Hossain, M. Z., Sohel, F., Shiratuddin, M. F., and Laga, H. (2019). A comprehensive survey of deep learning for image captioning. ACM Comput. Surv., 51(6).
Johnson, J., Karpathy, A., and Fei-Fei, L. (2016). Densecap: Fully convolutional lo- calization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Krause, J., Johnson, J., Krishna, R., and Fei-Fei, L. (2017). A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., Bernstein, M., and Fei-Fei, L. (2016). Visual genome: Connecting language and vision using crowdsourced dense image annotations.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2323.
Li, R., Liang, H., Shi, Y., Feng, F., and Wang, X. (2020). Dual-cnn: A convolutional language decoder for paragraph image captioning. Neurocomputing, 396.
Liang, X., Hu, Z., Zhang, H., Gan, C., and Xing, E. P. (2017). Recurrent topic-transition gan for visual paragraph generation. In 2017 IEEE International Conference on Com- puter Vision (ICCV), pages 3382–3391.
Mao, J., Xu, W., Yang, Y., Wang, J., and Yuille, A. L. (2015). Deep captioning with multimodal recurrent neural networks (m-rnn). In 2015 International Conference for Learning Representations(ICLR).
Pan, J.-Y., Yang, H.-J., Duygulu, P., and Faloutsos, C. (2004). Automatic image cap- tioning. volume 3, pages 1987–1990.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
Park, J. S., Rohrbach, M., Darrell, T., and Rohrbach, A. (2019). Adversarial inference for multi-sentence video description. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6591–6601.
Qin, Y., Du, J., Zhang, Y., and Lu, H. (2019). Look back and predict forward in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., and Goel, V. (2017). Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7008–7024.
Scharenborg, O., Besacier, L., Black, A., Hasegawa-Johnson, M., Metze, F., Neubig, G., Stu¨ker, S., Godard, P., Mu¨ller, M., Ondel, L., Palaskar, S., Arthur, P., Ciannella, F., Du, M., Larsen, E., Merkx, D., Riad, R., Wang, L., and Dupoux, E. (2020). Speech technology for unwritten languages. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:964–975.
Shetty, R., Rohrbach, M., Hendricks, L., Fritz, M., and Schiele, B. (2017). Speaking the same language: Matching machine to human captions by adversarial training. pages 4155–4164.
Socher, R., Karpathy, A., Le, Q. V., Manning, C. D., and Ng, A. (2014). Grounded com- positional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2:207–218.
Song, L., Liu, J., Qian, B., and Chen, Y. (2019). Connecting language to images: A progressive attention-guided network for simultaneous image captioning and lan- guage grounding. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):8885–8892.
Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., and Cucchiara, R. (2021). From Show to Tell: A Survey on Image Captioning. pages 1–22.
Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (2015). Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: A neural image caption generator. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3156–3164.
Wang, J., Pan, Y., Yao, T., Tang, J., and Mei, T. (2019a). Convolutional auto-encoding of sentence topics for image paragraph generation. In Proceedings of the 28th Inter- national Joint Conference on Artificial Intelligence, IJCAI’19, pages 940–946.
Wang, W., Chen, Z., and Hu, H. (2019b). Hierarchical attention network for image captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 33:8957– 8964.
Werbos, P. J. (1988). Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1(4):339–356.
Xu, C., Li, Y., Li, C., Ao, X., Yang, M., and Tian, J. (2020). Interactive key-value memory-augmented attention for image paragraph captioning. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3132–3142.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., and Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention.
Yang, L., Tang, K., Yang, J., and Li, L.-J. (2017a). Dense captioning with joint inference and visual context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Yang, L., Tang, K., Yang, J., and Li, L.-J. (2017b). Dense captioning with joint inference and visual context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2193–2202.
Yang, Z., Yuan, Y., Wu, Y., Salakhutdinov, R., and Cohen, W. W. (2016). Encode, re- view, and decode: Reviewer module for caption generation. ArXiv, abs/1605.07912.