| 研究生: |
陳洛翔 Chen, Lo-Hsiang |
|---|---|
| 論文名稱: |
2D-MapFormer: 2D-Map Transformer 於視聽場景感知對話與推理 2D-MapFormer: 2D-Map Transformer for Audio-Visual Scene-Aware Dialogue and Reasoning |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2023 |
| 畢業學年度: | 111 |
| 語文別: | 英文 |
| 論文頁數: | 78 |
| 中文關鍵詞: | 對話系統 、多模態學習 、影片時刻檢索 、多目標學習 |
| 外文關鍵詞: | Dialogue systems, Multimodal learning, Video moment retrieval, Multi-objective learning |
| 相關次數: | 點閱:43 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著人工智慧技術的持續發展,影像識別、多模態理解、對話系統等技術逐漸成熟,也推動了人工智慧助手研究的蓬勃興起。在目前的初步研究階段,我們的目標是讓機器能夠理解周遭環境並以自然語言與人類進行溝通。此外,為了更深入了解機器的思考方式以及擴展其應用,我們希望機器不僅僅能以自然語言回答問題,還能提供證據片段來證明其回答。
目前的研究中,尚未有一個模型能夠同時利用證據片段和文字標籤,訓練一個能夠尋找證據片段並回答問題的系統。因此,在本論文中,我們提出了2D-MapFormer架構,結合了2D-Map尋找證據片段的能力和Transformer編碼器-解碼器技術用於文本生成。在訓練過程中同時使用了證據片段和文字標籤,以達到相輔相成的效果。
本論文使用BLEU4、METEOR、ROUGEL、CIDER作為評估文本生成性能的指標,並使用IOU1、IOU2作為評估尋找證據片段性能的指標。實驗結果顯示了引入2D-Map到Transformer模型中的有效性,同時使用證據片段和文字標籤對模型訓練有助益,使模型能夠考慮各片段的重要性後並產生有效的回應。
With the continuous development of artificial intelligence technology, techniques such as image recognition, multimodal understanding, and dialogue systems have gradually matured. This has also led to the emergence of research in the field of AI assistants. In the current initial research stage, our goal is to enable machines to understand their surrounding environment and communicate with humans through natural language. Additionally, to gain a better understanding of machine thinking, we aim to not only answer questions using natural language but also provide evidence fragments to demonstrate the basis of their answers.
Currently, there is no existing model that can simultaneously utilize evidence fragments and textual labels to train a system capable of finding evidence fragments and answering questions. In this paper, we propose the 2D-MapFormer architecture, which incorporates the inherent ability of 2D-Map to locate evidence fragments and combines it with the Transformer encoder-decoder technique used for text generation. By using both evidence fragments and textual labels during training, our model achieves a mutually reinforcing effect between the two.
In this paper, we evaluate the text generation performance using metrics such as BLEU4, METEOR, ROUGEL, and CIDER. Additionally, we assess the evidence fragment retrieval performance using IOU1 and IOU2 metrics. The experimental results of this paper demonstrate the effectiveness of introducing 2D-Map into the Transformer model and the benefits of using both evidence fragments and textual labels in training. This enables the model to consider the importance of different fragments and generate effective responses accordingly.
[1] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra, "Visual dialog," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 326-335.
[2] H. Alamri, V. Cartillier, A. Das, J. Wang, A. Cherian, I. Essa, D. Batra, T. K. Marks, C. Hori, and P. Anderson, "Audio visual scene-aware dialog," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7558-7567.
[3] A. Liu, H. Xie, X. Liu, Z. Yin, and S. Liu, "Revisiting audio visual scene-aware dialog," Neurocomputing, vol. 496, pp. 227-237, 2022.
[4] C. Hori, A. P. Shah, S. Geng, P. Gao, A. Cherian, T. Hori, J. Le Roux, and T. K. Marks, "Overview of audio visual sceneaware dialog with reasoning track for natural language generation in DSTC10," in Proc. DSTC10 Workshop at AAAI, 2022.
[5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
[6] S. Zhang, H. Peng, J. Fu, and J. Luo, "Learning 2d temporal adjacent networks for moment localization with natural language," in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, vol. 34, no. 07, pp. 12870-12877.
[7] A. Shah, S. Geng, P. Gao, A. Cherian, T. Hori, T. K. Marks, J. Le Roux, and C. Hori, "Audio-Visual Scene-Aware Dialog and Reasoning Using Audio-Visual Transformers with Joint Student-Teacher Learning," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: IEEE, pp. 7732-7736.
[8] V. Iashin and E. Rahtu, "A better use of audio-visual cues: Dense video captioning with bi-modal transformer," arXiv preprint arXiv:2005.08271, 2020.
[9] G. K. Yoonseok Heo, Eunseok Yoo, Seungsoo Lee, Eunseo Jeong, Sangwoo Kang, "Interpretable Multimodal Dialogue System with Natural Language-Based Multimodal Integration," 2022: Association for the Advancement of Artificial Intelligence.
[10] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, "Vqa: Visual question answering," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425-2433.
[11] L. Anne Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell, "Localizing moments in video with natural language," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5803-5812.
[12] J. Gao, C. Sun, Z. Yang, and R. Nevatia, "Tall: Temporal activity localization via language query," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5267-5275.
[13] E. Reiter and R. Dale, "Building applied natural language generation systems," Natural Language Engineering, vol. 3, no. 1, pp. 57-87, 1997.
[14] L. Rabiner and B. Juang, "An introduction to hidden Markov models," ieee assp magazine, vol. 3, no. 1, pp. 4-16, 1986.
[15] J. Schmidhuber, "Deep learning in neural networks: An overview," Neural networks, vol. 61, pp. 85-117, 2015.
[16] J. J. Hopfield, "Neural networks and physical systems with emergent collective computational abilities," Proceedings of the national academy of sciences, vol. 79, no. 8, pp. 2554-2558, 1982.
[17] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.
[18] J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532-1543.
[19] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov, "Fasttext. zip: Compressing text classification models," arXiv preprint arXiv:1612.03651, 2016.
[20] I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," Advances in neural information processing systems, vol. 27, 2014.
[21] S.-I. Amari, "Learning patterns and pattern sequences by self-organizing nets of threshold elements," IEEE Transactions on computers, vol. 100, no. 11, pp. 1197-1206, 1972.
[22] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[23] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, "Empirical evaluation of gated recurrent neural networks on sequence modeling," arXiv preprint arXiv:1412.3555, 2014.
[24] R. J. Williams and D. Zipser, "A learning algorithm for continually running fully recurrent neural networks," Neural computation, vol. 1, no. 2, pp. 270-280, 1989.
[25] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, "Language models are unsupervised multitask learners," OpenAI blog, vol. 1, no. 8, p. 9, 2019.
[26] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, and A. Askell, "Language models are few-shot learners," Advances in neural information processing systems, vol. 33, pp. 1877-1901, 2020.
[27] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, and S. Gehrmann, "Palm: Scaling language modeling with pathways," arXiv preprint arXiv:2204.02311, 2022.
[28] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, and F. Azhar, "Llama: Open and efficient foundation language models," arXiv preprint arXiv:2302.13971, 2023.
[29] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, "Long-term recurrent convolutional networks for visual recognition and description," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2625-2634.
[30] A. Karpathy, A. Joulin, and L. F. Fei-Fei, "Deep fragment embeddings for bidirectional image sentence mapping," Advances in neural information processing systems, vol. 27, 2014.
[31] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, "Deep captioning with multimodal recurrent neural networks (m-rnn)," arXiv preprint arXiv:1412.6632, 2014.
[32] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image caption generator," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3156-3164.
[33] Q. Wu, C. Shen, L. Liu, A. Dick, and A. Van Den Hengel, "What value do explicit high level concepts have in vision to language problems?," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 203-212.
[34] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, "Describing videos by exploiting temporal structure," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4507-4515.
[35] M. Tan and Q. Le, "Efficientnet: Rethinking model scaling for convolutional neural networks," in International conference on machine learning, 2019: PMLR, pp. 6105-6114.
[36] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, "Inception-v4, inception-resnet and the impact of residual connections on learning," in Proceedings of the AAAI conference on artificial intelligence, 2017, vol. 31, no. 1.
[37] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
[38] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
[39] M. Malinowski, M. Rohrbach, and M. Fritz, "Ask your neurons: A neural-based approach to answering questions about images," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1-9.
[40] M. Ren, R. Kiros, and R. Zemel, "Image question answering: A visual semantic embedding model and a new dataset," Proc. Advances in Neural Inf. Process. Syst, vol. 1, no. 2, p. 5, 2015.
[41] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, "Are you talking to a machine? dataset and methods for multilingual image question," Advances in neural information processing systems, vol. 28, 2015.
[42] A. Jabri, A. Joulin, and L. Van Der Maaten, "Revisiting visual question answering baselines," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, 2016: Springer, pp. 727-739.
[43] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, "Multimodal compact bilinear pooling for visual question answering and visual grounding," arXiv preprint arXiv:1606.01847, 2016.
[44] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, "Show, attend and tell: Neural image caption generation with visual attention," in International conference on machine learning, 2015: PMLR, pp. 2048-2057.
[45] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei, "Visual7w: Grounded question answering in images," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4995-5004.
[46] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, "Stacked attention networks for image question answering," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 21-29.
[47] H. Xu and K. Saenko, "Ask, attend and answer: Exploring question-guided spatial attention for visual question answering," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, 2016: Springer, pp. 451-466.
[48] K. J. Shih, S. Singh, and D. Hoiem, "Where to look: Focus regions for visual question answering," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4613-4621.
[49] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, and J. Clark, "Learning transferable visual models from natural language supervision," in International conference on machine learning, 2021: PMLR, pp. 8748-8763.
[50] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, "Coca: Contrastive captioners are image-text foundation models," arXiv preprint arXiv:2205.01917, 2022.
[51] C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan, "Visual chatgpt: Talking, drawing and editing with visual foundation models," arXiv preprint arXiv:2303.04671, 2023.
[52] M. Tsimpoukelli, J. L. Menick, S. Cabi, S. Eslami, O. Vinyals, and F. Hill, "Multimodal few-shot learning with frozen language models," Advances in Neural Information Processing Systems, vol. 34, pp. 200-212, 2021.
[53] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, and M. Reynolds, "Flamingo: a visual language model for few-shot learning," Advances in Neural Information Processing Systems, vol. 35, pp. 23716-23736, 2022.
[54] M. Liu, X. Wang, L. Nie, X. He, B. Chen, and T.-S. Chua, "Attentive moment retrieval in videos," in The 41st international ACM SIGIR conference on research & development in information retrieval, 2018, pp. 15-24.
[55] M. Liu, X. Wang, L. Nie, Q. Tian, B. Chen, and T.-S. Chua, "Cross-modal moment localization in videos," in Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 843-851.
[56] S. Geng, P. Gao, M. Chatterjee, C. Hori, J. Le Roux, Y. Zhang, H. Li, and A. Cherian, "Dynamic graph representation learning for video dialog via multi-modal shuffled transformers," in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35, no. 2, pp. 1415-1423.
[57] Y. Yuan, T. Mei, and W. Zhu, "To find where you talk: Temporal sentence localization in video with attention based location regression," in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, no. 01, pp. 9159-9166.
[58] J. Chen, X. Chen, L. Ma, Z. Jie, and T.-S. Chua, "Temporally grounding natural sentence in video," in Proceedings of the 2018 conference on empirical methods in natural language processing, 2018, pp. 162-171.
[59] H. Wang, Z.-J. Zha, X. Chen, Z. Xiong, and J. Luo, "Dual path interaction network for video moment localization," in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 4116-4124.
[60] H. Zhang, A. Sun, W. Jing, and J. T. Zhou, "Span-based localizing network for natural language video localization," arXiv preprint arXiv:2004.13931, 2020.
[61] S. Zhang, H. Peng, J. Fu, Y. Lu, and J. Luo, "Multi-scale 2d temporal adjacency networks for moment localization with natural language," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 9073-9087, 2021.
[62] F.-T. Hong, X. Huang, W.-H. Li, and W.-S. Zheng, "Mini-net: Multiple instance ranking network for video highlight detection," in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, 2020: Springer, pp. 345-360.
[63] T. Badamdorj, M. Rochan, Y. Wang, and L. Cheng, "Joint visual and audio learning for video highlight detection," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8127-8137.
[64] Y. Liu, S. Li, Y. Wu, C.-W. Chen, Y. Shan, and X. Qie, "Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3042-3051.
[65] L. Yu, E. Park, A. C. Berg, and T. L. Berg, "Visual madlibs: Fill in the blank description generation and question answering," in Proceedings of the ieee international conference on computer vision, 2015, pp. 2461-2469.
[66] Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, "Tgif-qa: Toward spatio-temporal reasoning in visual question answering," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2758-2766.
[67] T. Maharaj, N. Ballas, A. Rohrbach, A. Courville, and C. Pal, "A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6884-6893.
[68] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, "Movieqa: Understanding stories in movies through question-answering," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4631-4640.
[69] R. Pasunuru and M. Bansal, "Dstc7-avsd: Scene-aware video-dialogue systems with dual attention," in DSTC7 at AAAI2019 workshop, 2019.
[70] D. T. Nguyen, S. Sharma, H. Schulz, and L. E. Asri, "From film to video: Multi-turn question answering with multi-modal context," arXiv preprint arXiv:1812.07023, 2018.
[71] C. Hori, H. Alamri, J. Wang, G. Wichern, T. Hori, A. Cherian, T. K. Marks, V. Cartillier, R. G. Lopes, and A. Das, "End-to-end audio visual scene-aware dialog using multimodal attention-based video features," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019: IEEE, pp. 2352-2356.
[72] R. Sanabria, S. Palaskar, and F. Metze, "Cmu sinbad’s submission for the dstc7 avsd challenge," in DSTC7 at AAAI2019 workshop, 2019, vol. 6.
[73] H. Le, D. Sahoo, N. F. Chen, and S. C. Hoi, "Multimodal transformer networks for end-to-end video-grounded dialogue systems," arXiv preprint arXiv:1907.01166, 2019.
[74] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
[75] H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, J. Li, T. Bharti, and M. Zhou, "Univl: A unified video and language pre-training model for multimodal understanding and generation," arXiv preprint arXiv:2002.06353, 2020.
[76] G. Bertasius, H. Wang, and L. Torresani, "Is space-time attention all you need for video understanding?," in ICML, 2021, vol. 2, no. 3, p. 4.
[77] X. Huang, H. L. Tan, M. C. Leong, Y. Sun, L. Li, R. Jiang, and J.-j. Kim, "Investigation on Transformer-based Multi-modal Fusion for Audio-Visual Scene-Aware Dialog," 2022.
[78] Y. Yamazaki, S. Orihashi, R. Masumura, M. Uchida, and A. Takashima, "Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations," arXiv preprint arXiv:2202.09979, 2022.
[79] Z. Li, Z. Li, J. Zhang, Y. Feng, and J. Zhou, "Bridging text and video: A universal multimodal transformer for audio-visual scene-aware dialog," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2476-2483, 2021.
[80] J. Cao, H. Cholakkal, R. M. Anwer, F. S. Khan, Y. Pang, and L. Shao, "D2det: Towards high quality object detection and instance segmentation," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11485-11494.
[81] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, and B. Seybold, "CNN architectures for large-scale audio classification," in 2017 ieee international conference on acoustics, speech and signal processing (icassp), 2017: IEEE, pp. 131-135.
[82] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, "Video swin transformer," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3202-3211.
[83] C. Hori, A. Cherian, T. K. Marks, and T. Hori, "Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog," in INTERSPEECH, 2019, pp. 1886-1890.
[84] R. Caruana, "Multitask learning," Machine learning, vol. 28, pp. 41-75, 1997.
[85] C. Hori, A. Cherian, T. K. Marks, and F. Metze, "Audio visual scene-aware dialog track in dstc8," DSTC Track Proposal, 2018.
[86] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, "Hollywood in homes: Crowdsourcing data collection for activity understanding," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, 2016: Springer, pp. 510-526.
[87] J. Lei, L. Yu, M. Bansal, and T. L. Berg, "Tvqa: Localized, compositional video question answering," arXiv preprint arXiv:1809.01696, 2018.
[88] J. S. Bridle, "Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition," in Neurocomputing: Algorithms, architectures and applications, 1990: Springer, pp. 227-236.
[89] J. L. Ba, J. R. Kiros, and G. E. Hinton, "Layer normalization," arXiv preprint arXiv:1607.06450, 2016.
[90] K. Fukushima, "Visual feature extraction by a multilayered network of analog threshold elements," IEEE Transactions on Systems Science and Cybernetics, vol. 5, no. 4, pp. 322-333, 1969.
[91] D. Hendrycks and K. Gimpel, "Gaussian error linear units (gelus)," arXiv preprint arXiv:1606.08415, 2016.
[92] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, "Improving neural networks by preventing co-adaptation of feature detectors," arXiv preprint arXiv:1207.0580, 2012.
[93] J. Han and C. Moraga, "The influence of the sigmoid function parameters on the speed of backpropagation learning," in International workshop on artificial neural networks, 1995: Springer, pp. 195-201.
[94] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "Bleu: a method for automatic evaluation of machine translation," in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311-318.
[95] S. Banerjee and A. Lavie, "METEOR: An automatic metric for MT evaluation with improved correlation with human judgments," in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65-72.
[96] C.-Y. Lin, "Rouge: A package for automatic evaluation of summaries," in Text summarization branches out, 2004, pp. 74-81.
[97] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, "Cider: Consensus-based image description evaluation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566-4575.
[98] J. Carreira and A. Zisserman, "Quo vadis, action recognition? a new model and the kinetics dataset," in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299-6308.
[99] L. Wright and N. Demeure, "Ranger21: a synergistic deep learning optimizer," arXiv preprint arXiv:2106.13731, 2021.
[100] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, "On the variance of the adaptive learning rate and beyond," arXiv preprint arXiv:1908.03265, 2019.
[101] M. Zhang, J. Lucas, J. Ba, and G. E. Hinton, "Lookahead optimizer: k steps forward, 1 step back," Advances in neural information processing systems, vol. 32, 2019.
[102] H. Yong, J. Huang, X. Hua, and L. Zhang, "Gradient centralization: A new optimization technique for deep neural networks," in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, 2020: Springer, pp. 635-652.