| 研究生: |
吳璐夢 Uboncharoen, Ployphailin |
|---|---|
| 論文名稱: |
大型語言模型中的提示選擇與評估工作流程框架:提升學術論文分析 A Workflow Framework for Prompt Selection and Evaluation in Large Language Models: Enhancing Academic Paper Analysis |
| 指導教授: |
哈里森約翰
Harrison, John |
| 學位類別: |
碩士 Master |
| 系所名稱: |
其他 - 全校永續跨域國際碩士學位學程 International Master's Program in Interdisciplinary Sustainability Studies |
| 論文出版年: | 2026 |
| 畢業學年度: | 114 |
| 語文別: | 英文 |
| 論文頁數: | 124 |
| 中文關鍵詞: | 大型語言模型 (LLMs) 、提示工程 、學術論文摘要生成 、工作流程框架 |
| 外文關鍵詞: | Large Language Models (LLMs), Prompt Engineering, Academic Paper Summarization, Workflow Framework |
| 相關次數: | 點閱:2 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著環境與永續發展研究領域中學術出版物的快速成長,學生與研究人員在理解大量且高度複雜的學術文本時,面臨日益增加的挑戰。大型語言模型(Large Language Models, LLMs)在輔助學術論文摘要生成方面展現出高度潛力,然而其在實際應用中仍存在若干關鍵限制,包括產生未出現在原文中的內容(hallucination)、輸出結構不一致,以及與原始文本內容對齊程度不足等問題。上述問題多半源自於所使用之提示語(prompt)缺乏結構性或說明不足。
本研究提出一套用於提示語選擇與評估的工作流程框架,核心在於設計一種 All-in-One(AIO)結構化提示語,將模型角色設定、任務指令、輸出限制條件,以及固定之輸出結構(schema)整合於單一提示語中,以提升基於 LLM 之學術摘要生成結果在一致性、可靠性與可重現性方面的表現。
本研究以 2016 年至 2025 年間發表之 100 篇經同儕審查的環境科學期刊論文為資料集,進行實證評估,比較三種方法:基準提示語(baseline prompt)、所提出之 AIO 提示語,以及另一種替代性工作流程提示語。摘要品質透過 ROUGE 與 BERTScore 等量化指標進行評估,並進一步採用成對統計顯著性檢定分析各方法間之差異。
研究結果顯示,AIO 結構化提示語在詞彙重疊度、語意相似性與結構一致性等面向上,均顯著優於基準方法與替代性工作流程。此結果凸顯提示工程與工作流程設計在控制 LLM 於學術文本分析任務中行為表現的重要性。本研究所提出之框架具備實務可行性與可複製性,可為將 LLMs 系統性整合至學術研究流程中提供具方法論嚴謹性與可評估性的參考依據。
The rapid growth of academic publications in the fields of environmental and sustainability studies has created increasing challenges for students and researchers in comprehending large volumes of complex scholarly texts. Large Language Models (LLMs) have shown strong potential for supporting academic paper summarization; however, practical applications still face significant limitations, including hallucination, inconsistent output structure, and insufficient alignment with source content. These issues are often attributed to the use of unstructured or underspecified prompts.
This study proposes a workflow framework for prompt selection and evaluation, focusing on the design of an All-in-One (AIO) structured prompt that integrates model roles, task instructions, output constraints, and a fixed output schema into a single unified prompt. This approach aims to improve output consistency, reliability, and reproducibility in LLM-based academic summarization.
An experimental evaluation was conducted using a dataset of 100 peer-reviewed environmental science articles published between 2016 and 2025. Three approaches were compared: a baseline prompt, the proposed AIO prompt, and an alternative workflow-based prompt. Output quality was quantitatively assessed using ROUGE and BERTScore metrics, followed by paired statistical significance testing.
The results demonstrate that the AIO structured prompt significantly outperforms the baseline approach and alternative workflow in terms of lexical overlap, semantic similarity, and structural consistency. These findings highlight the critical role of prompt engineering and workflow design in controlling LLM behavior for academic paper analysis. This research provides a practical and replicable framework for integrating LLMs into academic workflows while maintaining methodological rigor and evaluability.
Aly, W. M., Soliman, T. H. A., & AbdelAziz, A. M. (2025). An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques. ArXiv, abs/2507.05123.
Alyafeai, Z., Al-Shaibani, M., & Ghanem, B. (2025). MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs.
Asli, C., Clark, E., & Gao, J. (2020). Evaluation of Text Generation: A Survey.
Bang, Y., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T., Chung, W., Do, Q., Yan, X., & Fung, P. (2023). A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity.
Behl, A., & Li, J. (2025). Evaluating Secure Virtualization in Academic Research: Replicating Economics Studies Using Microsoft TDX. Journal of Student-Scientists’ Research, 7.
Bernal Castano, J. R., Halavati, R., Tseng, D., Lee, K., Yan, J., Ley-Wild, R., Paisios, N., Bissacco, A., Kulkarni, S., & Zhang, L. (2025). Transforming Inaccessible PDF Documents into Accessible, Interactive Documents.
Borja, A. (2025). Publishing datasets, using artificial intelligence to help with metadata, can enhance ocean sustainability research and management [Opinion]. Frontiers in Ocean Sustainability, Volume 3 - 2025.
Bornmann, L., & Mutz, R. (2014). Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references: Growth Rates of Modern Science: A Bibliometric Analysis Based on the Number of Publications and Cited References. Journal of the Association for Information Science and Technology, 66.
Brandt, W. (2023). How to address publication overload in environmental science. Eos, 104.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., & Amodei, D. (2020). Language Models are Few-Shot Learners.
Cochran, W. G. (1977). Sampling techniques (3rd ed.). Wiley.
Cohen, J. (2013). Statistical power analysis for the behavioral sciences. routledge.
Cosci, F., & Mikocka-Walus, A. (2025). Generative artificial intelligence: A hot topic to face with. Journal of Psychosomatic Research, 192, 1-2.
Creswell, J. W. (2009). Research design: Qualitative, quantitative, and mixed methods approaches, 3rd ed. Sage Publications, Inc.
Dai, Q., Ischebeck, R., Sapinski, M., & Grycner, A. (2025). Application Of Large Language Models For The Extraction Of Information From Particle Accelerator Technical Documentation.
Deutsch, D., Bedrax Weiss, T., & Roth, D. (2020). Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary.
Devlin, J. C., Ming-Wei; Lee, Kenton; Toutanova, Kristina. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
Gimmelberg, D., Belinskiy, A., & Ludviga, I. (2025). Governing LLM-assisted retail equity-options decision pipeline: a single-case audit of "vibe bloating" in trade selection and structuring.
Guo, Y., Guo, M., Su, J., Yang, Z., Zhu, M., Li, H., Qiu, M., & Liu, S. (2024). Bias in Large Language Models: Origin, Evaluation, and Mitigation.
Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., & Liu, T. (2024). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Transactions on Information Systems, 43.
Jiang, W. (2021). Applications of deep learning in stock market prediction: recent progress. Expert Systems with Applications, 184, 115537.
John, S., Chris, J., & victor, L. (2024). Data Preprocessing for AI Models.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models.
Karatas Baydogmus, G. (2021). The Effects of Normalization and Standardization an Internet of Things Attack Detection. European Journal of Science and Technology.
Khalifa, M., & Albadawy, M. (2024). Using artificial intelligence in academic writing and research: An essential productivity tool. Computer Methods and Programs in Biomedicine Update, 5, 100145.
Khusro, S., Latif, A., & Ullah, I. (2015). On methods and tools of table detection, extraction and annotation in PDF documents. Journal of Information Science, 41(1), 41-57.
Kojima, T., Gu, S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large Language Models are Zero-Shot Reasoners.
Koukaras, P., & Tjortjis, C. (2025). Data Preprocessing and Feature Engineering for Data Mining: Techniques, Tools, and Best Practices. AI, 6, 257.
Kuhail, M. A., Berengueres, J., Taher, D. F., Khan, S., & Siddiqui, A. (2024). Designing a Haptic Boot for Space With Prompt Engineering: Process, Insights, and Implications. IEEE Access, PP, 1-1.
Kumar, S. (2024). Text Extraction and Cleaning. In S. Kumar (Ed.), Python for Accounting and Finance: An Integrative Approach to Using Python for Research (pp. 125-132). Springer Nature Switzerland.
Lavie, A., & Agarwal, A. (2007). METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. 228-231.
Lavie, A., Sagae, K., & Jayaraman, S. (2004). The Significance of Recall in Automatic Metrics for MT Evaluation (Vol. 3265).
Lee, J.-S., & Hsiang, J. (2020). Patent claim generation by fine-tuning OpenAI GPT-2. World Patent Information, 62, 101983.
Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of summaries.
Liu, Y., Han, T., Ma, S., Zhang, J., Yang, Y., Tian, J., He, H., Li, A., He, M., & Liu, Z. (2023). Summary of chatgpt-related research and perspective towards the future of large language models. Meta-radiology, 1(2), 100017.
Livathinos, N., Berrospi, C., Lysak, M., Kuropiatnyk, V., Nassar, A., Carvalho, A., Dolfi, M., Auer, C., Dinkla, K., & Staar, P. (2021). Robust pdf document conversion using recurrent neural networks. Proceedings of the AAAI Conference on Artificial Intelligence,
Marvin, G., Hellen Raudha, N., Jjingo, D., & Nakatumba-Nabende, J. (2024). Prompt Engineering in Large Language Models. In (pp. 387-402).
Miller, J. K., & Tang, W. (2025). Evaluating llm metrics through real-world capabilities. arXiv preprint arXiv:2505.08253.
Montgomery, D. C., Peck, E. A., & Vining, G. G. (2021). Introduction to Linear Regression Analysis. Wiley.
Mowbray, T. (2025). A Survey of Deep Learning Architectures in Modern Machine Learning Systems: From CNNs to Transformers. Journal of Computer Technology and Software, 4(8).
Nandipati, S. K., Banu, S., & Sai Krishna, K. V. N. R. (2025). PROMPT ENGINEERING FROM PROMPT TO PROJECTS.
Novikova, J., Dušek, O., Cercas Curry, A., & Rieser, V. (2017). Why We Need New Evaluation Metrics for NLG.
Ones, D. S., Wang, Y., & Wei, Z. (2025). Deep Learning Applications for Analysis of Unstructured Data. In M. C. Richaud, B. Mesurado, & V. N. Lemos (Eds.), Contemporary Psychometrics - New Developments and Applications. IntechOpen.
OpenAI. (2019, November 5, 2019). GPT-2: 1.5B release. OpenAI.
OpenAI. (2025a, August 7, 2025). Introducing GPT-5.
OpenAi. (2025b). Introducing GPT-5 for developers.
Pak, I., & Teh, P. (2018). Text Segmentation Techniques: A Critical Review. In (Vol. 741, pp. 167-181).
Palmer, D. D. (2000). Tokenisation and sentence. Handbook of natural language processing, 11.
Pan, R. K., Petersen, A. M., Pammolli, F., & Fortunato, S. (2018). The memory of science: Inflation, myopia, and the knowledge network. Journal of Informetrics, 12(3), 656-678.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, Pennsylvania.
Pascanu, R., Mikolov, T., & Bengio, Y. (2012). On the difficulty of training Recurrent Neural Networks. 30th International Conference on Machine Learning, ICML 2013.
Pearson, K. (1920). Notes on the History of Correlation. Biometrika, 13(1), 25-45.
Peng, R. D. (2011). Reproducible Research in Computational Science. Science, 334(6060), 1226-1227.
Philippovich, V. A., & Philippovich, A. Y. (2025, 8-10 April 2025). Modern Approaches to Extraction Text Data From Documents: Review, Analysis and Practical Implementation. 2025 7th International Youth Conference on Radio Electronics, Electrical and Power Engineering (REEPE),
Pipek, P., Canavan, S., Canavan, S., Capinha, C., Gippet, J., Novoa, A., Pyšek, P., Souza, A., Wang, S., & Jarić, I. (2025). Sustainability of large language models—user perspective. Frontiers in Ecology and the Environment, 23.
Ramos, D., Moreno, S., Canessa, E., & Chaigneau, S. E. (2025). Towards scalable and reliable coding of semantic property norms: ChatGPT vs. an improved AC-PLT. Behavior Research Methods, 57(11), 302.
Rani Krishna, K., Somasundaram, K., Arulmozhivarman, P., Immanuel, S. A., & Rajkumar, E. (2025). Deep learning for text summarization using NLP for automated news digest. Scientific Reports, 15(1), 36343.
Rosca, C., & Stancu, A. (2025). Emerging Trends in AI-Based Soil Contamination Monitoring and Prevention. Agriculture, 15, 1280.
Sachs, J. D., Schmidt-Traub, G., Mazzucato, M., Messner, D., Nakicenovic, N., & Rockström, J. (2019). Six Transformations to achieve the Sustainable Development Goals. Nature Sustainability, 2(9), 805-814.
Sandve, G., Nekrutenko, A., Taylor, J., & Hovig, E. (2013). Ten Simple Rules for Reproducible Computational Research. PLoS computational biology, 9, e1003285.
Sawilowsky, S. (2009). New Effect Size Rules of Thumb. Journal of Modern Applied Statistical Methods, 8, 597-599.
Schmidt, L., Olorisade, B., McGuinness, L., Thomas, J., & Higgins, J. (2021). Data extraction methods for systematic review (semi)automation: A living systematic review [version 1; peer review: 3 approved]. F1000 Research.
Shahrzadi, L., Mansouri, A., Alavi, M., & Shabani, A. (2024). Causes, consequences, and strategies to deal with information overload: A scoping review. International Journal of Information Management Data Insights, 4(2), 100261.
SHAPIRO, S. S., & WILK, M. B. (1965). An analysis of variance test for normality (complete samples)†. Biometrika, 52(3-4), 591-611.
Sonntag, D. (2004). Assessing the Quality of Natural Language Text Data.
Srivastava, M., Garg, R., & Mishra, P. K. (2015). Analysis of Data Extraction and Data Cleaning in Web Usage Mining Proceedings of the 2015 International Conference on Advanced Research in Computer Science Engineering & Technology (ICARCSET 2015), Unnao, India.
Student. (1908). The Probable Error of a Mean. Biometrika, 6(1), 1-25.
Surendrakumar, K. (2025). Abstractive Summarization Using Neural Networks with Attention Mechanisms Dublin, National College of Ireland].
Swales, J. (1990). Genre analysis : English in academic and research settings / John M. Swales. Cambridge University Press.
van der Berg, P., Vries, M., Jan, W., Hofstra, S., Oosten, E., Vandenberg, L., & Janssen. (2025). Neural Linguistic Architectures in the Age of Emergent Intelligence: A Taxonomic Analysis of Transformer-Based Language Models and Their Societal Implications.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need.
Wan, H., Lu, X., Chen, Y., Devaprasad, K., & Hinkle, L. (2025). Automating Modelica Module Generation Using Large Language Models: A Case Study on Building Control Description Language.
Wang, C., Sui, D., Zhang, B., Liu, X., Kang, J., Qiao, Z., & Tu, Z. (2024). A Framework for Effective Invocation Methods of Various LLM Services. arXiv preprint arXiv:2402.03408.
Wang, S., & Jin, P. (2023). A Brief Summary of Prompting in Using GPT Models.
Wang, Z., Zhang, Z., Lee, C.-Y., Zhang, H., Sun, R., Ren, X., Su, G., Perot, V., Dy, J., & Pfister, T. (2021). Learning to Prompt for Continual Learning.
Webster, J. J., & Kit, C. (1992). Tokenization as the initial phase in NLP. COLING 1992 volume 4: The 14th international conference on computational linguistics,
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA.
Wilcoxon, F. (1945). Individual Comparisons by Ranking Methods. Biometrics Bulletin, 1(6), 80-83.
Zahan, T. (2025). Personalized Summarization of Global News: Managing Bias with Large Language Models
Zeng, Q., Jin, C., Wang, X., Zheng, Y., & Li, Q. (2025). An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science.
Zhang, H., Yu, P. S., & Zhang, J. (2025). A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models. ACM Comput. Surv., 57(11), Article 277.
Zhang, T., Kishore, V., Wu, F., Weinberger, K., & Artzi, Y. (2019). BERTScore: Evaluating Text Generation with BERT.
Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K., & Hashimoto, T. B. (2024). Benchmarking Large Language Models for News Summarization. Transactions of the Association for Computational Linguistics, 12, 39-57.
Zhao, W., Peyrard, M., Liu, F., Gao, Y., Meyer, C., & Eger, S. (2019). MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance.