| 研究生: |
袁倫祥 Yuan, Lun-Hsiang |
|---|---|
| 論文名稱: |
大型語言模型在前列腺癌多模態工作報告中的風險評估與資訊檢索之深入比較分析 The In-depth Comparative Analysis of Four Large Language AI Models for Risk Assessment and Information Retrieval from Multi-Modality Prostate Cancer Work-up Reports |
| 指導教授: |
周鼎贏
Chou, Dean |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
工學院 - 生物醫學工程學系 Department of BioMedical Engineering |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 英文 |
| 論文頁數: | 182 |
| 中文關鍵詞: | 攝護腺癌 、大語言模型 、人工智慧 、資訊檢索 、風險評估 、臨床決策系統 |
| 外文關鍵詞: | Prostate Cancer, Large Language Models, Artificial Intelligence, Information Retrieval, Risk Assessment, Clinical Decision Support System |
| 相關次數: | 點閱:21 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
背景與目標:
從多模態影像和病理報告中進行資訊檢索(IR)和風險評估(RA)對於前列腺癌(PC)的治療十分重要。 本研究旨在評估四個通用型大型語言模型(LLMs)在IR和RA任務中的表現。
材料與方法:
本研究採用了第四期前列腺癌患者的模擬電腦斷層掃描(CT)、磁振造影(MRI)、骨骼掃描及活檢病理報告文本資料。我們評估了四個大型語言模型(LLMs),包含ChatGPT-3.5-turbo、Claude-3-opus、ChatGPT-4-turbo、Gemini-pro-1.0,在三項風險評估(RA)任務(LATITUDE、CHAARTED、TwNHI)和七項資訊檢索(IR)任務中的效能。這些任務涵蓋了TNM分期以及骨骼和內臟轉移的偵測與量化,旨在全面評估LLMs處理多元臨床資料的能力。我們透過API使用zero-shot chain-of-thought prompting查詢這些模型,並採用重複單輪查詢和集成投票方法來評估其效能,並以三位評審的一致意見作為黃金標準,使用六個評估指標。
初步結果:
在350位IV期PC患者的模擬報告中,115位(32.8%)、128位(36.5%)和94位(27%)分別屬於LATITUDE、CHAARTED和TwNHI高風險組別。基於三次重複的單輪查詢的集成投票一致性提升了準確度,相較於單次查詢更有可能達到非劣性結果。四個模型在IR任務中表現出微小的差異,且在TNM分期中的準確度較高(87.4%-94.2%),一致性(ICC > 0.8)。然而,在RA任務中存在顯著差異,排名如下:ChatGPT-4-turbo 大於Claude-3-opus 大於 Gemini-pro-1.0 大於 ChatGPT-3.5-turbo。ChatGPT-4-turbo在三個RA任務中達到了最高的準確度(90.1%、90.7%、91.6%)和一致性(ICC 0.86、0.93、0.76)。此外,其高負預測值有助於排除高風險患者。
初步結論:
ChatGPT-4-turbo在IV期PC的RA和IR任務中展現了使人滿意的準確性和結果,顯示其在臨床決策支援中的潛力。然而,錯誤解讀可能影響決策的風險不可忽視。仍需進一步研究驗證這些結果在其他癌症的應用。
Background and Objectives
Accurate information retrieval (IR) and risk assessment (RA) utilizing pathology reports and multi-modality imaging are critical components in the management of prostate cancer (PC). This research endeavors to assess the accomplishment of four general-purpose large language models (LLMs) in the context of IR and RA tasks.
Material and Methods
This investigation leveraged synthetic textual reports generated from magnetic resonance imaging (MRI), computed tomography (CT), biopsy pathology, and bone scans data obtained from patients diagnosed with stage IV Prostate Cancer (PC). The efficacy of four large language models (LLMs)—ChatGPT-3.5-turbo, Claude-3-opus, ChatGPT-4-turbo, and Gemini-pro-1.0— was evaluated over seven Information Retrieval (IR) tasks and three Risk Assessment (RA) tasks (LATITUDE, CHAARTED, and TwNHI). These tasks encompassed TNM staging, alongside the identification and measurement of both visceral and bone metastases, with the aim of comprehensively gauging the LLMs' capacity to process a wide array of clinical data types. The models were prompted via API using a zero-shot chain-of-thought approach. Model performance was gauged using repeated single-query trials and ensemble voting methodologies; a consensus reached by a panel of three expert adjudicators served as the benchmark standard. Six distinct outcome metrics were employed for performance evaluation.
Preliminary Results
Analysis of simulated reports from 350 patients with stage IV PC indicated that 115 (32.8%), 128 (36.5%), and 94 (27%) were categorized as high-risk according to LATITUDE, CHAARTED, and TwNHI criteria, respectively. The use of ensemble voting, incorporating three iterative single-query evaluations, led to a consistent enhancement in accuracy, increasing the likelihood of attaining results on par with those from a single query. The four LLMs showed similar performance in Information Retrieval (IR) tasks, with high consistency (ICC > 0.8) and accuracy (87.4%-94.2%) in determining TNM stage. However, Risk Assessment (RA) performance varied significantly among the models, with the following ranking: 1. ChatGPT-4-turbo > 2. Claude-3-opus > 3. Gemini-pro-1.0 > 4. ChatGPT-3.5-turbo. Notably, ChatGPT-4-turbo demonstrated the capital accuracy (90.1%, 90.7% and 91.6%) and consistency (ICC 0.86, 0.93 and 0.76) across the three RA tasks. Its high negative predictive value is potentially valuable for excluding patients from being classified as high-risk.
Preliminary Conclusion
ChatGPT-4-turbo has shown encouraging precision and effectiveness in Information Retrieval (IR) and Risk Assessment (RA) tasks for stage IV Prostate Cancer (PC), indicating its possible value in helping doctors make decisions. However, we must consider the possible dangers of misunderstanding information, as this could change how choices are made. More studies are required to confirm if the results apply to other types of cancer.
1. Cancer of the Prostate - Cancer Stat Facts. SEER https://seer.cancer.gov/statfacts/html/prost.html.
2. CDC. Prostate Cancer Incidence by Stage at Diagnosis. United States Cancer Statistics https://www.cdc.gov/united-states-cancer-statistics/publications/prostate-cancer.html (2024).
3. Lin, P.-H. et al. Increasing incidence of prostate cancer in Taiwan: A study of related factors using a nationwide health and welfare database. Medicine (Baltimore) 99, e22336 (2020).
4. Wu, C.-C., Lin, C.-H., Chiang, H.-S. & Tang, M.-J. A population-based study of the influence of socioeconomic status on prostate cancer diagnosis in Taiwan. International Journal for Equity in Health 17, 79 (2018).
5. Chang, H.-J. et al. A matched case-control study in Taiwan to evaluate potential risk factors for prostate cancer. Sci Rep 13, 4382 (2023).
6. 統計處. 111年國人死因統計結果. 統計處 https://www.mohw.gov.tw/cp-16-74869-1.html (2023).
7. Schaeffer, E. M. et al. Prostate Cancer, Version 4.2023, NCCN Clinical Practice Guidelines in Oncology. J Natl Compr Canc Netw 21, 1067–1096 (2023).
8. Chang, C.-H., Lucas, M. M., Lu-Yao, G. & Yang, C. C. Classifying Cancer Stage with Open-Source Clinical Large Language Models. Preprint at http://arxiv.org/abs/2404.01589 (2024).
9. Kimura, T. et al. Prognostic impact of young age on stage IV prostate cancer treated with primary androgen deprivation therapy. Int J Urol 21, 578–583 (2014).
10. Sartor, O. et al. Lutetium-177–PSMA-617 for Metastatic Castration-Resistant Prostate Cancer. New England Journal of Medicine 385, 1091–1103 (2021).
11. Miyake, H. et al. Prognostic Significance of Time to Castration Resistance in Patients With Metastatic Castration-sensitive Prostate Cancer. Anticancer Research 39, 1391–1396 (2019).
12. Huggins, C. STUDIES ON PROSTATIC CANCER: II. THE EFFECTS OF CASTRATION ON ADVANCED CARCINOMA OF THE PROSTATE GLAND. Arch Surg 43, 209 (1941).
13. M, H. Olaparib for metastatic castration. The New England journal of medicine 382, 2091 (2020).
14. Chung, B. H. et al. Apalutamide for metastatic castration-sensitive prostate cancer: final analysis of the Asian subpopulation in the TITAN trial. Asian Journal of Andrology 10.4103/aja202320 doi:10.4103/aja202320.
15. A, S. ARCHES: a randomized, phase III study of androgen deprivation therapy WithEnzalutamide or placebo in men with metastatic hormone-sensitive prostate cancer. Journal of clinical oncology : official journal of the American Society ofClinical Oncology 37, 2974 (2019).
16. Armstrong, A. J. et al. The Efficacy of Enzalutamide plus Androgen Deprivation Therapy in Oligometastatic Hormone-sensitive Prostate Cancer: A Post Hoc Analysis of ARCHES. European Urology 84, 229–241 (2023).
17. Fizazi Karim et al. Abiraterone plus Prednisone in Metastatic, Castration-Sensitive Prostate Cancer. New England Journal of Medicine 377, 352–360 (2017).
18. Hussain, M. et al. Darolutamide Plus Androgen-Deprivation Therapy and Docetaxel in Metastatic Hormone-Sensitive Prostate Cancer by Disease Volume and Risk Subgroups in the Phase III ARASENS Trial. Journal of Clinical Oncology 41, 3595–3607 (2023).
19. Kyriakopoulos, C. E. et al. Chemohormonal Therapy in Metastatic Hormone-Sensitive Prostate Cancer: Long-Term Survival Analysis of the Randomized Phase III E3805 CHAARTED Trial. J Clin Oncol 36, 1080–1087 (2018).
20. Maggi, M. et al. A Systematic Review and Meta-Analysis of Randomized Controlled Trials With Novel Hormonal Therapies for Non-Metastatic Castration-Resistant Prostate Cancer: An Update From Mature Overall Survival Data. Front. Oncol. 11, 700258 (2021).
21. Vidal, N., Rivas, J. G., Fernández, L., Moreno, J. & Puente, J. The past, present, and future of non-metastatic castration-resistant prostate cancer (nmCRPC): a narrative review. Precision Cancer Medicine 5, (2022).
22. Tsai, C.-Y., Tian, J.-H., Lee, C.-C. & Kuo, H.-C. Building Dual AI Models and Nomograms Using Noninvasive Parameters for Aiding Male Bladder Outlet Obstruction Diagnosis and Minimizing the Need for Invasive Video-Urodynamic Studies: Development and Validation Study. J Med Internet Res 26, e58599 (2024).
23. Borkowski, A. A. Applications of ChatGPT and Large Language Models in Medicine and Health Care: Benefits and Pitfalls. Federal Practitioner 40, (2023).
24. Chen, Z. et al. Harnessing the power of clinical decision support systems: challenges and opportunities. Open Heart 10, e002432 (2023).
25. Marcos, M., Maldonado, J. A., Martínez-Salvador, B., Boscá, D. & Robles, M. Interoperability of clinical decision-support systems and electronic health records using archetypes: A case study in clinical trial eligibility. Journal of Biomedical Informatics 46, 676–689 (2013).
26. Gabriel, J., Gabriel, A., Shafik, L., Alanbuki, A. & Larner, T. Artificial intelligence in the urology multidisciplinary team meeting: can ChatGPT suggest European Association of Urology guideline‐recommended prostate cancer treatments? BJU International bju.16240 (2023) doi:10.1111/bju.16240.
27. Pierce, R. L., Van Biesen, W., Van Cauwenberge, D., Decruyenaere, J. & Sterckx, S. Explainability in medicine in an era of AI-based clinical decision support systems. Front. Genet. 13, 903600 (2022).
28. US Preventive Services Task Force et al. Collaboration and Shared Decision-Making Between Patients and Clinicians in Preventive Health Care Decisions and US Preventive Services Task Force Recommendations. JAMA 327, 1171 (2022).
29. Lamontagne, F. Establishing trust through clear communication and shared decision-making. CMAJ 195, E1725–E1726 (2023).
30. Kopanitsa, G. Implementation of an Archetype Data Set to Reuse Electronic Health Record Data in Clinical Decision Support Systems.
31. Haselager, P. et al. Reflection Machines: Supporting Effective Human Oversight Over Medical Decision Support Systems. Camb Q Healthc Ethics 33, 380–389 (2024).
32. Ke, Y. et al. Mitigating Cognitive Biases in Clinical Decision-Making Through Multi-Agent Conversations Using Large Language Models: Simulation Study. J Med Internet Res 26, e59439 (2024).
33. Ke, Y. H., Yang, R., Abdullah, H. R., Ting, D. S. W. & Liu, N. Enhancing Diagnostic Accuracy through Multi-Agent Conversations: Using Large Language Models to Mitigate Cognitive Bias.
34. Zhao, W. X. et al. A Survey of Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2303.18223 (2023).
35. Wang, J. et al. A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations. Preprint at https://doi.org/10.48550/arXiv.2406.10303 (2024).
36. Kumar, D., Verma, C. & Nitika, N. Revolutionizing Text Summarization: A Comprehensive Comparative Analysis of NLP-Based Models. in 2024 4th International Conference on Innovative Practices in Technology and Management (ICIPTM) 1–5 (2024). doi:10.1109/ICIPTM59628.2024.10563328.
37. Ahsan, H. et al. Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges. arXiv.org (2023) doi:10.48550/arXiv.2309.04550.
38. Dhombres, F. et al. Contributions of Artificial Intelligence Reported in Obstetrics and Gynecology Journals: Systematic Review. Journal of Medical Internet Research 24, e35465 (2022).
39. Liu, Z. et al. AI-based language models powering drug discovery and development. Drug Discov Today 26, 2593–2607 (2021).
40. Amin, M. B. et al. The Eighth Edition AJCC Cancer Staging Manual: Continuing to build a bridge from a population-based to a more ‘personalized’ approach to cancer staging. CA Cancer J Clin 67, 93–99 (2017).
41. Cooperberg, M. R., Broering, J. M. & Carroll, P. R. Risk Assessment for Prostate Cancer Metastasis and Mortality at the Time of Diagnosis. JNCI: Journal of the National Cancer Institute 101, 878–887 (2009).
42. Goel, A. et al. LLMs Accelerate Annotation for Medical Information Extraction. (2023) doi:10.48550/ARXIV.2312.02296.
43. Sedlakova, J. et al. Challenges and best practices for digital unstructured data enrichment in health research: A systematic narrative review. PLOS Digit Health 2, e0000347 (2023).
44. Ansoborlo, M. et al. Performance of a NLP Tool for Text Classification from Orthopaedic Operative Reports, Using Data from the Large Network of Clinical Data Warehouses of the West of France: The HACRO-HUGORTHO Project. Stud Health Technol Inform 316, 1979–1983 (2024).
45. Khalate, P., Gite, S., Pradhan, B. & Lee, C.-W. Advancements and gaps in natural language processing and machine learning applications in healthcare: a comprehensive review of electronic medical records and medical imaging. Front. Phys. 12, 1445204 (2024).
46. Janowski, A. Natural Language Processing Techniques for Clinical Text Analysis in Healthcare. Healthcare Management.
47. Sezgin, E., Hussain, S.-A., Rust, S. & Huang, Y. Extracting Medical Information From Free-Text and Unstructured Patient-Generated Health Data Using Natural Language Processing Methods: Feasibility Study With Real-world Data. JMIR Form Res 7, e43014 (2023).
48. Adamson, B. et al. Approach to machine learning for extraction of real-world data variables from electronic health records. Front. Pharmacol. 14, 1180962 (2023).
49. Arasteh, S. T. et al. Empowering Clinicians and Democratizing Data Science: Large Language Models Automate Machine Learning for Clinical Studies.
50. Heston, T. F. & Lewis, L. M. ChatGPT provides inconsistent risk-stratification of patients with atraumatic chest pain. PLoS ONE 19, e0301854 (2024).
51. Naveed, H. et al. A Comprehensive Overview of Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2307.06435 (2024).
52. Kojima, T., Gu, S. (Shane), Reid, M., Matsuo, Y. & Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. Advances in Neural Information Processing Systems 35, 22199–22213 (2022).
53. Pan, S. et al. Unifying Large Language Models and Knowledge Graphs: A Roadmap. IEEE Trans. Knowl. Data Eng. 36, 3580–3599 (2024).
54. Nori, H. et al. Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine. Preprint at http://arxiv.org/abs/2311.16452 (2023).
55. Toma, A. et al. Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding. Preprint at https://doi.org/10.48550/arXiv.2305.12031 (2023).
56. Zack, T. et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. The Lancet Digital Health 6, e12–e22 (2024).
57. Wu, S. et al. Automated Review Generation Method Based on Large Language Models. Preprint at http://arxiv.org/abs/2407.20906 (2024).
58. Gebrael, G. et al. Enhancing Triage Efficiency and Accuracy in Emergency Rooms for Patients with Metastatic Prostate Cancer: A Retrospective Analysis of Artificial Intelligence-Assisted Triage Using ChatGPT 4.0. Cancers 15, 3717 (2023).
59. Wei, J. et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv (2022).
60. Dietterich, T. G. Ensemble Methods in Machine Learning. in Multiple Classifier Systems vol. 1857 1–15 (Springer Berlin Heidelberg, Berlin, Heidelberg, 2000).
61. Sim, J. & Wright, C. The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements. Physical Therapy 85, (2005).
62. Liljequist, D., Elfving, B. & Skavberg Roaldsen, K. Intraclass correlation – A discussion and demonstration of basic features. PLoS ONE 14, e0219854 (2019).