簡易檢索 / 詳目顯示

研究生: 李宗霖
Li, Zong-Lin
論文名稱: MapLlama:利用多視角資料轉換和微調的大型語言模型進行地圖問答
MapLlama: Using Multi-view Data Conversion and Fine-tuned Large Language Model for MapQA
指導教授: 朱威達
Chu, Wei-Ta
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 47
中文關鍵詞: 地圖問答大型語言模型多視角模型低秩適應
外文關鍵詞: Map question answering, Large language model, Multi-view model, Low-rank adaptation
相關次數: 點閱:45下載:10
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 地圖是常見的表達資訊的工具,廣泛應用於新聞、天氣預報和人口普查等領域。人 們經常將視覺化資料轉化為地圖並回答與地理空間相關的問題,這需要地圖物件分 析和數學推理能力。為了解決這一挑戰,我們提出了一個兩階段的地圖問答系統。 首先,地圖圖像通過基於殘差神經網絡(ResNet)和雙向長短期記憶網絡(BiLSTM) 的模型轉換為結構化的地圖數據。接著,基於這些結構化的地圖數據,利用大型語 言模型(LLM)回答給定的問題,實現地圖相關的問答。我們利用大型語言模型生 成大量接近真實世界的問題,並通過多視角轉換和大型語言模型進行回答。實驗結 果表明,大型語言模型在面對多變問題時表現出色。我們證明了所提出的方法具有 競爭力,不僅表現更加穩定,甚至在某些方面優於當前最先進的技術。

    Map is a common medium to convey rich information, widely used in news, weather fore- casts, and census research. People present data on maps and answer spatial-related questions, usually requiring the ability to analyze the map and work on reasoning. In this work, we propose a two-stage map question-answering system (MapQA). First, map images are con- verted into structured map data using a model based on Residual Neural Networks (ResNet) and Bidirectional Long Short-Term Memory Networks (BiLSTM). Then, based on this structured map data, a large language model (LLM) is employed to answer the given questions, enabling map-related question answering. We generate a large number of diverse questions using an LLM and answer them through multi-perspective transformations and the LLM. The experimental results show that the LLM performs excellently when facing diverse questions. We demonstrate that the proposed method is competitive, offering more stable performance and even surpassing current state-of-the-art techniques in some aspects.

    摘要 i Abstract ii Table of Contents iii List of Tables v List of Figures vi Chapter 1. Introduction 1 1.1. Motivation 1 1.2. Overview 1 1.3. Map to Table 2 1.4. Question Enhencement 2 1.5. Finetuneing Large Language Models 3 1.6. Contributions 4 1.7. Organization 4 Chapter 2. Related Works 5 2.1. Multi-view Model 5 2.2. Chart Question Answering 5 2.3. Large Language Models 6 2.4. Table Question Answering 6 2.5. Parameter Efficient Fine Tuning(PEFT) 7 2.5.1. Adaptor 7 2.5.2. Low-rank Adapter (LoRA) 8 2.5.3. Quantized Low-Rank Adaptation (QLoRA) 9 Chapter 3. Method 12 3.1. Map Question Answering 12 3.1.1. MapQA dataset 12 3.1.2. Two-stage models in the MapQA paper 15 3.1.3. Performance of the MapQA paper 18 3.2. Method 19 3.2.1. Diverse Question 19 3.2.2. Multi-view Map Data Extraction(MMDE) 19 3.2.3. MapLlama 22 Chapter 4. Experiment 26 4.1. Task Description 26 4.2. Dataset Descriptions 26 4.2.1. MapQA Dataset 26 4.3. Evaluation Metric 27 4.4. Training Details 29 4.5. Performance of the Map-to-Table task 29 4.6. Performance of Question Answering 30 4.7. Performance of question answering task on diverse questions 31 4.8. Ablation Study 32 4.9. Visualization of a map in a different view 32 Chapter 5. Conclusion 34 5.1. Conclusion 34 5.2. Future Works 34 References 35

    [1]SaleemAhmed, BhavinJawade, ShubhamPandey, SrirangarajSetlur, and Venu Govin- daraju. Realcqa: Scientific chart question answering as a testbed for first-order logic. In Proceedings of International Conference on Document Analysis and Recognition: ICDAR, 2023.

    [2] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of IEEE International Conference on Computer Vision: ICCV, 2015.

    [3] Rico Sennrich Biao Zhang. Root mean square layer normalization. In Proceedings of Neural Information Processing Systems: NeurIPS, 2019.

    [4] Shuaichen Chang, David Palzer, Jialin Li, Eric Fosler-Lussier, and Ningchuan Xiao. Mapqa: A dataset for question answering on choropleth maps. In Proceedings of Neural Information Processing Systems: NeurIPS First Table Representation Workshop, 2022.

    [5] Dettmers, Tim, Pagnoni, Artidoro, Ari Holtzman, and Luke Zettlemoyer. Qlora: Effi- cient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.

    [6] Yucheng Han, Chi Zhang, Xin Chen, Xu Yang, Zhibin Wang, Gang Yu, Bin Fu, and Hanwang Zhang. Chartllama: A multimodal llm for chart understanding and generation. arXiv preprint arXiv:2311.16483, 2023.

    [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition Conference: CVPR, 2016.

    [8] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural Computation, 1997.

    [9] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In Proceedings of International Conference on Learning Representations: ICLR, 2022.

    [10] Gautier Izacard Xavier Martinet Marie-Anne Lachaux Timothée Lacroix Baptiste Rozière Naman Goyal Eric Hambro Faisal Azhar Aurelien Rodriguez Armand Joulin Edouard Grave Guillaume Lample Hugo Touvron, Thibaut Lavril. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023.

    [11] Thomas Müller Francesco Piccinno-Julian Martin Eisenschlos Jonathan Herzig, Pawel Krzysztof Nowak. Tapas: Weakly supervised table parsing via pre-training. In Proceedings of Association for Computational Linguistics: ACL, 2020.

    [12] Kushal Kafle, Robik Shrestha, Scott Cohen, Brian Price, and Christopher Kanan. Answering questions about data visualizations using efficient bimodal fusion. In Proceedings of IEEE Winter Conference on Applications of Computer Vision: WACV, 2020.
    35

    [13] Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, and Yoshua Bengio. Figureqa: An annotated figure dataset for visual reasoning. arXiv:1710.07300, 2017.

    [14] Shankar Kantharaj, Xuan Long Do,Rixie Tiffany Leong,Jia Qing Tan,Enamul Hoque, and Shafiq Joty. OpenCQA: Open-ended question answering with charts. In Proceedings of Conference on Empirical Methods in Natural Language Processing: EMNLP, 2022.

    [15] Raja Mubashar Karim, Oh-Hyun Kwon, Chanhee Park, and Kyungwon Lee. A study of colormaps in network visualization. Applied Sciences, 2019.

    [16] Vahid Kazemi and Ali Elqursh. Show, ask, attend, and answer: A strong baseline for visual question answering. arXiv preprint arXiv:1704.03162, 2017.

    [17] Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. In Proceedings of European Conference on Computer Vision: ECCV, 2022.

    [18] Fangyu Liu, Julian Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. DePlot: One-shot visual language reasoning by plot-to-table translation. In Proceedings of As- sociation for Computational Linguistics: ACL, 2023.

    [19] Fangyu Liu, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Yasemin Altun, Nigel Collier, and Julian Eisenschlos. MatCha: Enhancing visual language pretraining with math reasoning and chart derendering. In Proceedings of Association for Computational Linguistics: ACL, 2023.

    [20] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Proceedings of Neural Information Processing Systems: NeurIPS, 2021.

    [21] Xianpeng Liu, Ming Qian Ce Zheng, Nan Xue, Chen Chen, Chen Li Zhebin Zhang, and Tianfu Wu. Multi-view attentive contextualization for multi-view 3d object detec- tion. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition Conference: CVPR, 2024.

    [22] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In Proceedings of International Conference on Learning Representations: ICLR, 2024.

    [23] Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, and Shafiq Joty. UniChart: A universal vision-language pretrained model for chart comprehension and reasoning. In Proceedings of Conference on Empirical Methods in Natural Language Processing: EMNLP, 2023.

    [24] Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Proceedings of Association for Computational Linguistics: ACL, 2022.

    [25] Gemma Team: Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupati-raju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Char- line Le Lan, Christopher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean- Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Niko- lai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu hui Chen, Za- farali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Fara- bet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahra- mani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen Kenealy. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.

    [26] Nitesh Methani, Pritha Ganguly, Mitesh M. Khapra, and Pratyush Kumar. Plotqa: Rea- soning over scientific plots. In Proceedings of IEEE Winter Conference on Applications of Computer Vision: WACV, 2020.

    [27]Stanislaw Jastrzebski Bruna Morrone Quentin de Laroussilhe Andrea Gesmundo Mona Attariyan Sylvain Gelly Neil Houlsby, Andrei Giurgiu. Parameter-efficient transfer learning for nlp. In Proceedings of International Conference on Machine Learning: ICML, 2019.

    [28] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shya- mal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Bal- tescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Made- laine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory De- careaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Mal- facini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, An- drey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Ra- jeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambat- tista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael (Rai) Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani San- turkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Sel- sam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wain- wright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Wein- mann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wo- jciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2024.

    [29] OpenAI, Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Ka- plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Re- won Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christo- pher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.

    [30] Plotly. Plotly open source graphing libraries. https://plotly.com/ graphing-libraries/, 2024.

    [31] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research: JMLR, 2020.

    [32] HelgeRhodin, JörgSpörri, IsinsuKatircioglu, VictorConstantin, FrédéricMeyer, Erich Müller, Mathieu Salzmann, and Pascal Fua. Learning monocular 3d human pose estimation from multi-view images. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition Conference: CVPR, 2018.

    [33] Mike Schuster and Kuldip K. Paliwal. Bidirectional recurrent neural networks. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition Conference: CVPR, 2016.

    [34] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.

    [35] Ray Smith. An overview of the tesseract ocr engine. In Proceedings of Conference of International Conference on Document Analysis and Recognition: ICDAR, 2007.

    [36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of Neural Information Processing Systems: NeurIPS, 2017.

    [37] Gautham Vinod, Zeman Shao, and Fengqing Zhu. Image based food energy estimation with depth domain adaptation. In Proceedings of International Conference on Multi- media Information Processing and Retrieval: MIPR, 2022.

    [38] Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of Association for Computational Linguistics: ACL, 2019.

    [39] Tianshu Zhang, Xiang Yue, Yifei Li, and Huan Sun. Tablellama: Towards open large generalist models for tables. In Proceedings of Conference on Annual Conference of the North American Chapter of the Association for Computational Linguistics: NAACL, 2024.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE