簡易檢索 / 詳目顯示

研究生: 吳璐夢
Uboncharoen, Ployphailin
論文名稱: 大型語言模型中的提示選擇與評估工作流程框架:提升學術論文分析
A Workflow Framework for Prompt Selection and Evaluation in Large Language Models: Enhancing Academic Paper Analysis
指導教授: 哈里森約翰
Harrison, John
學位類別: 碩士
Master
系所名稱: 其他 - 全校永續跨域國際碩士學位學程
International Master's Program in Interdisciplinary Sustainability Studies
論文出版年: 2026
畢業學年度: 114
語文別: 英文
論文頁數: 124
中文關鍵詞: 大型語言模型 (LLMs)提示工程學術論文摘要生成工作流程框架
外文關鍵詞: Large Language Models (LLMs), Prompt Engineering, Academic Paper Summarization, Workflow Framework
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著環境與永續發展研究領域中學術出版物的快速成長,學生與研究人員在理解大量且高度複雜的學術文本時,面臨日益增加的挑戰。大型語言模型(Large Language Models, LLMs)在輔助學術論文摘要生成方面展現出高度潛力,然而其在實際應用中仍存在若干關鍵限制,包括產生未出現在原文中的內容(hallucination)、輸出結構不一致,以及與原始文本內容對齊程度不足等問題。上述問題多半源自於所使用之提示語(prompt)缺乏結構性或說明不足。
    本研究提出一套用於提示語選擇與評估的工作流程框架,核心在於設計一種 All-in-One(AIO)結構化提示語,將模型角色設定、任務指令、輸出限制條件,以及固定之輸出結構(schema)整合於單一提示語中,以提升基於 LLM 之學術摘要生成結果在一致性、可靠性與可重現性方面的表現。
    本研究以 2016 年至 2025 年間發表之 100 篇經同儕審查的環境科學期刊論文為資料集,進行實證評估,比較三種方法:基準提示語(baseline prompt)、所提出之 AIO 提示語,以及另一種替代性工作流程提示語。摘要品質透過 ROUGE 與 BERTScore 等量化指標進行評估,並進一步採用成對統計顯著性檢定分析各方法間之差異。
    研究結果顯示,AIO 結構化提示語在詞彙重疊度、語意相似性與結構一致性等面向上,均顯著優於基準方法與替代性工作流程。此結果凸顯提示工程與工作流程設計在控制 LLM 於學術文本分析任務中行為表現的重要性。本研究所提出之框架具備實務可行性與可複製性,可為將 LLMs 系統性整合至學術研究流程中提供具方法論嚴謹性與可評估性的參考依據。

    The rapid growth of academic publications in the fields of environmental and sustainability studies has created increasing challenges for students and researchers in comprehending large volumes of complex scholarly texts. Large Language Models (LLMs) have shown strong potential for supporting academic paper summarization; however, practical applications still face significant limitations, including hallucination, inconsistent output structure, and insufficient alignment with source content. These issues are often attributed to the use of unstructured or underspecified prompts.
    This study proposes a workflow framework for prompt selection and evaluation, focusing on the design of an All-in-One (AIO) structured prompt that integrates model roles, task instructions, output constraints, and a fixed output schema into a single unified prompt. This approach aims to improve output consistency, reliability, and reproducibility in LLM-based academic summarization.
    An experimental evaluation was conducted using a dataset of 100 peer-reviewed environmental science articles published between 2016 and 2025. Three approaches were compared: a baseline prompt, the proposed AIO prompt, and an alternative workflow-based prompt. Output quality was quantitatively assessed using ROUGE and BERTScore metrics, followed by paired statistical significance testing.
    The results demonstrate that the AIO structured prompt significantly outperforms the baseline approach and alternative workflow in terms of lexical overlap, semantic similarity, and structural consistency. These findings highlight the critical role of prompt engineering and workflow design in controlling LLM behavior for academic paper analysis. This research provides a practical and replicable framework for integrating LLMs into academic workflows while maintaining methodological rigor and evaluability.

    ABSTRACT ii 摘要 iii ACKNOWLEDGEMENTS iv LIST OF TABLES ix LIST OF FIGURES x LIST OF SYMBOLS AND ABBREVIATIONS xi CHAPTER 1 INTRODUCTION 12 1.1 Introduction 12 1.2 Research Motivation 13 1.3 Research Purpose 14 1.4 Background of the Case 15 1.5 Research Framework 17 CHAPTER 2 LITERATURE REVIEW 19 2.1 Introduction to Large Language Models (LLMs) 19 2.1.1 Historical Development of NLP and Deep Learning 19 2.1.2 Emergence of Large LLMs 20 2.1.3 Applications of LLMs in Academic Research 21 2.1.4 Challenges and Limitations of LLMs 22 2.2 Prompt Engineering: Concepts and Approaches 23 2.2.1 Definition and importance of prompt engineering 23 2.2.2 Types of prompts 23 2.2.3 Baseline prompts in prior research 25 2.2.4 Integrated and All-in-One prompts 25 2.2.5 Prompt Engineering Workflows in Prior Studies 26 2.2.6 Limitations of Existing Prompt-based Workflows 28 2.3 Academic Paper Analysis with LLMs and NLP 29 2.3.1 Metadata extraction from academic articles 29 2.3.2 Summarization of academic papers 30 2.4 Data Preprocessing for LLMs 30 2.4.1 The Importance of Data Preprocessing for LLMs 30 2.4.2 Techniques and Procedures of Data Preprocessing for LLMs 31 2.5 Evaluation Metrics for Summarization and Extraction 34 2.5.1 ROUGE metrics 34 2.5.2 BERTScore 34 2.5.3 Alternative metrics 35 2.5.4 Automated vs. human evaluation 36 2.6 Applications in Environmental Science Research 37 2.6.1 Challenges in environmental science literature 37 2.6.2 Use of NLP and ML in environmental science 38 2.6.3 Importance of reproducible workflows 39 2.7 Synthesis 40 CHAPTER 3 RESEARCH DESIGN 42 3.1 Research Design 42 3.2 Research Hypotheses 43 3.3 Data Collection 44 3.3.1 Inclusion Criteria 44 3.3.2 Exclusion Criteria 45 3.3.3 Sampling 45 3.4 User Survey as a Data Collection Method 47 3.4.1 Participants 47 3.4.2 Survey Instrument 47 3.4.3 Data Collection Procedure 48 3.4.4 Survey Data Analysis 48 3.4.5 Role of the Survey within the Research Framework 49 3.5 Data Preprocessing 49 3.5.1 Overview of Workflow 49 3.5.2 PDF to Text Extraction 49 3.5.3 Text Cleaning and Normalization 50 3.6 Text Extraction Quality Assessment 50 3.7 Prompting Workflow 52 3.7.1 Input Processing 53 3.7.2 Prompt Construction 54 3.7.3 Model Invocation 57 3.7.4 Output Structuring and Evaluation 57 3.8 Model Selection 60 3.9 Data Analysis 61 CHAPTER 4 RESEARCH FINDING AND DISCUSSION 64 4.1 Overview of Experimental Results 64 4.2 Quantitative Results 64 4.2.1 BERTScore Performance 64 4.2.2 ROUGE-1 Performance 67 4.2.3 ROUGE-2 Performance 70 4.2.4 ROUGE-L Performance 73 4.3 Statistical Significance Testing 76 4.4 Hypothesis Testing Results 78 4.4.1 Results for Hypothesis H1 78 4.4.2 Results for Hypothesis H2 79 4.5 Survey Results: Baseline Prompting Behaviors 80 4.5.1 Respondent Profile 80 4.5.2 Baseline Prompting Behaviors 80 4.5.3 Accuracy and Content Expectations 81 4.5.4 Use of Structured Prompts and Interest in All-in-One Prompts 81 4.5.5 Design Implications for the Proposed Workflow 81 4.6 Discussion 82 CHAPTER 5 CONCLUSION AND SUGGESTION 84 5.1 Conclusions 84 5.2 Contributions of the Study 85 5.3 Limitations 85 5.4 Recommendations for Future Research 85 REFERENCES 86 APPENDIX 91

    Aly, W. M., Soliman, T. H. A., & AbdelAziz, A. M. (2025). An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques. ArXiv, abs/2507.05123.
    Alyafeai, Z., Al-Shaibani, M., & Ghanem, B. (2025). MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs.
    Asli, C., Clark, E., & Gao, J. (2020). Evaluation of Text Generation: A Survey.
    Bang, Y., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T., Chung, W., Do, Q., Yan, X., & Fung, P. (2023). A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity.
    Behl, A., & Li, J. (2025). Evaluating Secure Virtualization in Academic Research: Replicating Economics Studies Using Microsoft TDX. Journal of Student-Scientists’ Research, 7.
    Bernal Castano, J. R., Halavati, R., Tseng, D., Lee, K., Yan, J., Ley-Wild, R., Paisios, N., Bissacco, A., Kulkarni, S., & Zhang, L. (2025). Transforming Inaccessible PDF Documents into Accessible, Interactive Documents.
    Borja, A. (2025). Publishing datasets, using artificial intelligence to help with metadata, can enhance ocean sustainability research and management [Opinion]. Frontiers in Ocean Sustainability, Volume 3 - 2025.
    Bornmann, L., & Mutz, R. (2014). Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references: Growth Rates of Modern Science: A Bibliometric Analysis Based on the Number of Publications and Cited References. Journal of the Association for Information Science and Technology, 66.
    Brandt, W. (2023). How to address publication overload in environmental science. Eos, 104.
    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., & Amodei, D. (2020). Language Models are Few-Shot Learners.
    Cochran, W. G. (1977). Sampling techniques (3rd ed.). Wiley.
    Cohen, J. (2013). Statistical power analysis for the behavioral sciences. routledge.
    Cosci, F., & Mikocka-Walus, A. (2025). Generative artificial intelligence: A hot topic to face with. Journal of Psychosomatic Research, 192, 1-2.
    Creswell, J. W. (2009). Research design: Qualitative, quantitative, and mixed methods approaches, 3rd ed. Sage Publications, Inc.
    Dai, Q., Ischebeck, R., Sapinski, M., & Grycner, A. (2025). Application Of Large Language Models For The Extraction Of Information From Particle Accelerator Technical Documentation.
    Deutsch, D., Bedrax Weiss, T., & Roth, D. (2020). Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary.
    Devlin, J. C., Ming-Wei; Lee, Kenton; Toutanova, Kristina. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
    Gimmelberg, D., Belinskiy, A., & Ludviga, I. (2025). Governing LLM-assisted retail equity-options decision pipeline: a single-case audit of "vibe bloating" in trade selection and structuring.
    Guo, Y., Guo, M., Su, J., Yang, Z., Zhu, M., Li, H., Qiu, M., & Liu, S. (2024). Bias in Large Language Models: Origin, Evaluation, and Mitigation.
    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., & Liu, T. (2024). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Transactions on Information Systems, 43.
    Jiang, W. (2021). Applications of deep learning in stock market prediction: recent progress. Expert Systems with Applications, 184, 115537.
    John, S., Chris, J., & victor, L. (2024). Data Preprocessing for AI Models.
    Kaplan, J., McCandlish, S., Henighan, T., Brown, T., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models.
    Karatas Baydogmus, G. (2021). The Effects of Normalization and Standardization an Internet of Things Attack Detection. European Journal of Science and Technology.
    Khalifa, M., & Albadawy, M. (2024). Using artificial intelligence in academic writing and research: An essential productivity tool. Computer Methods and Programs in Biomedicine Update, 5, 100145.
    Khusro, S., Latif, A., & Ullah, I. (2015). On methods and tools of table detection, extraction and annotation in PDF documents. Journal of Information Science, 41(1), 41-57.
    Kojima, T., Gu, S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large Language Models are Zero-Shot Reasoners.
    Koukaras, P., & Tjortjis, C. (2025). Data Preprocessing and Feature Engineering for Data Mining: Techniques, Tools, and Best Practices. AI, 6, 257.
    Kuhail, M. A., Berengueres, J., Taher, D. F., Khan, S., & Siddiqui, A. (2024). Designing a Haptic Boot for Space With Prompt Engineering: Process, Insights, and Implications. IEEE Access, PP, 1-1.
    Kumar, S. (2024). Text Extraction and Cleaning. In S. Kumar (Ed.), Python for Accounting and Finance: An Integrative Approach to Using Python for Research (pp. 125-132). Springer Nature Switzerland.
    Lavie, A., & Agarwal, A. (2007). METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. 228-231.
    Lavie, A., Sagae, K., & Jayaraman, S. (2004). The Significance of Recall in Automatic Metrics for MT Evaluation (Vol. 3265).
    Lee, J.-S., & Hsiang, J. (2020). Patent claim generation by fine-tuning OpenAI GPT-2. World Patent Information, 62, 101983.
    Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of summaries.
    Liu, Y., Han, T., Ma, S., Zhang, J., Yang, Y., Tian, J., He, H., Li, A., He, M., & Liu, Z. (2023). Summary of chatgpt-related research and perspective towards the future of large language models. Meta-radiology, 1(2), 100017.
    Livathinos, N., Berrospi, C., Lysak, M., Kuropiatnyk, V., Nassar, A., Carvalho, A., Dolfi, M., Auer, C., Dinkla, K., & Staar, P. (2021). Robust pdf document conversion using recurrent neural networks. Proceedings of the AAAI Conference on Artificial Intelligence,
    Marvin, G., Hellen Raudha, N., Jjingo, D., & Nakatumba-Nabende, J. (2024). Prompt Engineering in Large Language Models. In (pp. 387-402).
    Miller, J. K., & Tang, W. (2025). Evaluating llm metrics through real-world capabilities. arXiv preprint arXiv:2505.08253.
    Montgomery, D. C., Peck, E. A., & Vining, G. G. (2021). Introduction to Linear Regression Analysis. Wiley.
    Mowbray, T. (2025). A Survey of Deep Learning Architectures in Modern Machine Learning Systems: From CNNs to Transformers. Journal of Computer Technology and Software, 4(8).
    Nandipati, S. K., Banu, S., & Sai Krishna, K. V. N. R. (2025). PROMPT ENGINEERING FROM PROMPT TO PROJECTS.
    Novikova, J., Dušek, O., Cercas Curry, A., & Rieser, V. (2017). Why We Need New Evaluation Metrics for NLG.
    Ones, D. S., Wang, Y., & Wei, Z. (2025). Deep Learning Applications for Analysis of Unstructured Data. In M. C. Richaud, B. Mesurado, & V. N. Lemos (Eds.), Contemporary Psychometrics - New Developments and Applications. IntechOpen.
    OpenAI. (2019, November 5, 2019). GPT-2: 1.5B release. OpenAI.
    OpenAI. (2025a, August 7, 2025). Introducing GPT-5.
    OpenAi. (2025b). Introducing GPT-5 for developers.
    Pak, I., & Teh, P. (2018). Text Segmentation Techniques: A Critical Review. In (Vol. 741, pp. 167-181).
    Palmer, D. D. (2000). Tokenisation and sentence. Handbook of natural language processing, 11.
    Pan, R. K., Petersen, A. M., Pammolli, F., & Fortunato, S. (2018). The memory of science: Inflation, myopia, and the knowledge network. Journal of Informetrics, 12(3), 656-678.
    Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, Pennsylvania.
    Pascanu, R., Mikolov, T., & Bengio, Y. (2012). On the difficulty of training Recurrent Neural Networks. 30th International Conference on Machine Learning, ICML 2013.
    Pearson, K. (1920). Notes on the History of Correlation. Biometrika, 13(1), 25-45.
    Peng, R. D. (2011). Reproducible Research in Computational Science. Science, 334(6060), 1226-1227.
    Philippovich, V. A., & Philippovich, A. Y. (2025, 8-10 April 2025). Modern Approaches to Extraction Text Data From Documents: Review, Analysis and Practical Implementation. 2025 7th International Youth Conference on Radio Electronics, Electrical and Power Engineering (REEPE),
    Pipek, P., Canavan, S., Canavan, S., Capinha, C., Gippet, J., Novoa, A., Pyšek, P., Souza, A., Wang, S., & Jarić, I. (2025). Sustainability of large language models—user perspective. Frontiers in Ecology and the Environment, 23.
    Ramos, D., Moreno, S., Canessa, E., & Chaigneau, S. E. (2025). Towards scalable and reliable coding of semantic property norms: ChatGPT vs. an improved AC-PLT. Behavior Research Methods, 57(11), 302.
    Rani Krishna, K., Somasundaram, K., Arulmozhivarman, P., Immanuel, S. A., & Rajkumar, E. (2025). Deep learning for text summarization using NLP for automated news digest. Scientific Reports, 15(1), 36343.
    Rosca, C., & Stancu, A. (2025). Emerging Trends in AI-Based Soil Contamination Monitoring and Prevention. Agriculture, 15, 1280.
    Sachs, J. D., Schmidt-Traub, G., Mazzucato, M., Messner, D., Nakicenovic, N., & Rockström, J. (2019). Six Transformations to achieve the Sustainable Development Goals. Nature Sustainability, 2(9), 805-814.
    Sandve, G., Nekrutenko, A., Taylor, J., & Hovig, E. (2013). Ten Simple Rules for Reproducible Computational Research. PLoS computational biology, 9, e1003285.
    Sawilowsky, S. (2009). New Effect Size Rules of Thumb. Journal of Modern Applied Statistical Methods, 8, 597-599.
    Schmidt, L., Olorisade, B., McGuinness, L., Thomas, J., & Higgins, J. (2021). Data extraction methods for systematic review (semi)automation: A living systematic review [version 1; peer review: 3 approved]. F1000 Research.
    Shahrzadi, L., Mansouri, A., Alavi, M., & Shabani, A. (2024). Causes, consequences, and strategies to deal with information overload: A scoping review. International Journal of Information Management Data Insights, 4(2), 100261.
    SHAPIRO, S. S., & WILK, M. B. (1965). An analysis of variance test for normality (complete samples)†. Biometrika, 52(3-4), 591-611.
    Sonntag, D. (2004). Assessing the Quality of Natural Language Text Data.
    Srivastava, M., Garg, R., & Mishra, P. K. (2015). Analysis of Data Extraction and Data Cleaning in Web Usage Mining Proceedings of the 2015 International Conference on Advanced Research in Computer Science Engineering & Technology (ICARCSET 2015), Unnao, India.
    Student. (1908). The Probable Error of a Mean. Biometrika, 6(1), 1-25.
    Surendrakumar, K. (2025). Abstractive Summarization Using Neural Networks with Attention Mechanisms Dublin, National College of Ireland].
    Swales, J. (1990). Genre analysis : English in academic and research settings / John M. Swales. Cambridge University Press.
    van der Berg, P., Vries, M., Jan, W., Hofstra, S., Oosten, E., Vandenberg, L., & Janssen. (2025). Neural Linguistic Architectures in the Age of Emergent Intelligence: A Taxonomic Analysis of Transformer-Based Language Models and Their Societal Implications.
    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need.
    Wan, H., Lu, X., Chen, Y., Devaprasad, K., & Hinkle, L. (2025). Automating Modelica Module Generation Using Large Language Models: A Case Study on Building Control Description Language.
    Wang, C., Sui, D., Zhang, B., Liu, X., Kang, J., Qiao, Z., & Tu, Z. (2024). A Framework for Effective Invocation Methods of Various LLM Services. arXiv preprint arXiv:2402.03408.
    Wang, S., & Jin, P. (2023). A Brief Summary of Prompting in Using GPT Models.
    Wang, Z., Zhang, Z., Lee, C.-Y., Zhang, H., Sun, R., Ren, X., Su, G., Perot, V., Dy, J., & Pfister, T. (2021). Learning to Prompt for Continual Learning.
    Webster, J. J., & Kit, C. (1992). Tokenization as the initial phase in NLP. COLING 1992 volume 4: The 14th international conference on computational linguistics,
    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA.
    Wilcoxon, F. (1945). Individual Comparisons by Ranking Methods. Biometrics Bulletin, 1(6), 80-83.
    Zahan, T. (2025). Personalized Summarization of Global News: Managing Bias with Large Language Models
    Zeng, Q., Jin, C., Wang, X., Zheng, Y., & Li, Q. (2025). An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science.
    Zhang, H., Yu, P. S., & Zhang, J. (2025). A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models. ACM Comput. Surv., 57(11), Article 277.
    Zhang, T., Kishore, V., Wu, F., Weinberger, K., & Artzi, Y. (2019). BERTScore: Evaluating Text Generation with BERT.
    Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K., & Hashimoto, T. B. (2024). Benchmarking Large Language Models for News Summarization. Transactions of the Association for Computational Linguistics, 12, 39-57.
    Zhao, W., Peyrard, M., Liu, F., Gao, Y., Meyer, C., & Eger, S. (2019). MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance.

    QR CODE