| 研究生: |
陳昱達 Chen, Yu-Da |
|---|---|
| 論文名稱: |
MDS-UPDRS手指敲擊測驗評估,比較ChatGPT-4與兩位神經科醫生的評分 MDS-UPDRS Finger Tapping Test Evaluation Comparing ChatGPT-4 and Two Neurologists |
| 指導教授: |
吳馬丁
Torbjörn, Nordling |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 機械工程學系 Department of Mechanical Engineering |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 英文 |
| 論文頁數: | 166 |
| 中文關鍵詞: | 巴金森氏症 、手指敲擊測試 、嚴重性評估 、大型語言模型 、ChatGPT-4 、沃瑟斯坦距離 、評分者內信度 、評分者間信度 |
| 外文關鍵詞: | Parkinson's disease, finger-tapping test, severity assessment, Large Language Models, ChatGPT-4, Wasserstein Distance, intra-rater reliability, inter-rater reliability |
| 相關次數: | 點閱:5 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
研究介紹: 巴金森氏症(PD)是一種進行性神經退化性疾病,根據 2021 年全球疾病負擔研究,全球有超過 1100 萬人被診斷患有此病,對全球失能調整生命年造成了顯著影響。巴金森氏症的一個關鍵診斷標誌是運動遲緩,可以通過運動障礙協會統一巴金森氏症評定量表(MDS-UPDRS)手指敲擊(FT)測試進行量化。運動遲緩的特徵是啟動和執行動作的緩慢,在手指敲擊測試中,透過速度、振幅和節奏來評估。目前,這些評分是由臨床神經科醫師決定的,不僅耗時、成本高,而且可能存在主觀性。通用的大型語言模型(LLMs)每季度在各種分析和邏輯測試中創下新紀錄,有時甚至超越人類表現,因此測試大型語言模型在手指敲擊測試評分方面的表現具有重要意義。
研究目標: 本研究旨在測試一個通用大型語言模型—ChatGPT-4(版本:gpt-4-turbo-2024-04-09)—在有無範例指示的情況下對手指敲擊評分的表現,並與臨床神經科醫師的表現進行比較。
研究方法: 本研究招募了 16 名巴金森氏症的受試者,並使用智慧型手機(720p/240 FPS)錄製了 53 個手指敲擊測試的影片。這些影片由兩名神經科醫師獨立評估,以建立真實評分標準並評估人類評分者間信度。同時,ChatGPT-4 基於使用 CoTracker 從單個智慧型手機提取食指和拇指上標記的時間序列軌跡以及 MDS-UPDRS 手指敲擊指示,對每個影片進行了 10 次獨立的嚴重程度評估。研究執行了四項實驗來測試模型的表現:第一項實驗(Exp. A)使用動作的第 2.9 至 9.4 秒(平均 4.0 秒)起開始的 10 秒片段資料進行分析。第二項實驗(Exp. B)則在前者基礎上加入了利用 Python 程式碼根據極值辨識敲擊特徵的過程。第三項實驗(Exp. C)聚焦在前 10 次敲擊動作,以評估早期動作特徵。第四項實驗(Exp. D)則選取了第三項實驗中表現最差的四個例子,結合神經科醫師的評分,作為 few-shot(少樣本)學習的範例,以探討是否能提升模型表現。
研究結果: 在實驗 D 中,ChatGPT-4 的評分者內信度達到組內相關係數 ICC(2,1) 為 0.45,優於神經科醫師間的評分者間信度 ICC(2,1) 為 0.37,顯示其在評分一致性上的潛力。不同實驗設計對模型表現的影響,分析如下:實驗 A 的 Wasserstein 距離為 1.22,ICC(2,1) 為 0.05,準確率為 18.2 %,幾乎與隨機猜測無異;實驗 B 納入以 Python 程式進行的特徵偵測,Wasserstein 距離降至 0.79,ICC(2,1) 提升至 0.16,準確率為 22.7 %,表現顯著提升;實驗 C 聚焦於前 10 次敲擊動作,強調早期動作特徵,Wasserstein 距離為 0.58,ICC(2,1) 為 0.21,準確率為 50.0 %,準確度顯著提升;實驗 D 採用了包含四個範例的少樣本學習方法,在剩餘測試案例上達到 Wasserstein 距離為 0.63,ICC(2,1) 為 0.45,準確率為 61.1 %,顯示少樣本學習策略在信度上帶來了顯著提升。在實驗 D 中,ChatGPT-4 與神經科醫師之間的評分者間信度(ICC(3,k) = 0.85)高於過去幾項研究中神經科醫師彼此之間的評分者間信度。不過,在解讀此比較結果時,應考量不同資料集中受試者特性與評分者背景的差異。
研究結論: 在實驗 D 中加入具體範例後,顯著提升了 ChatGPT-4 的評分者內信度與準確率。這些結果顯示,在設計良好的輸入引導下,ChatGPT-4 能夠產出可與臨床神經科醫師相當的評分結果。
Introduction: Parkinson's disease (PD) is a progressive neurodegenerative disorder that significantly impacts global disability-adjusted life years, with over 11 million people diagnosed worldwide according to the Global Burden of Disease Study 2021. A key diagnostic marker of PD, bradykinesia, can be quantified using the Movement Disorder Society-Sponsored Revision of the Unified Parkinson's Disease Rating Scale (MDS-UPDRS) Finger tapping (FT) test. Bradykinesia is characterized by slowness in initiating and executing movements, which in the FT is assessed by evaluating the speed, amplitude, and rhythm. Currently, these scores are determined by clinical neurologists, which is costly, time-consuming, and can suffer from subjectivity. General-purpose large language models (LLMs) are setting new records each quarter in various analytical and logical tasks, and in some cases even outperform human performance. Therefore, it is of interest to test how well LLMs can perform in scoring finger-tapping tests.
Objective: This study aims to test how well a general-purpose LLM—ChatGPT-4 (version: gpt-4-turbo-2024-04-09)—can perform the scoring of FT based on instructions with and without examples, and compare its performance to that of clinical neurologists.
Methods: The study enrolled 16 subjects with PD and recorded 53 videos of finger-tapping tests using smartphones (720p/240 FPS). These videos were independently evaluated by two neurologists to establish ground truth scores and evaluate human inter-rater reliability. Concurrently, ChatGPT-4 performed 10 independent severity evaluations of each video based on time-series trajectory of markers placed on the index finger and thumb extracted from a single smartphone using CoTracker and the MDS-UPDRS FT instruction. Four experiments were conducted to test the model's performance: the first used a 10-second segment data of the recording taken 2.9-9.4 (mean 4.0) seconds after the start (Exp. A), the second added Python code for identifying tapping features based on extrema (Exp. B), the third focused on the initial 10 taps to assess early movement characteristics (Exp. C), and the fourth included four examples of poorest performance from Exp. C together with the neurologists' score to see if few-shot learning improves model performance (Exp. D).
Results: ChatGPT-4 achieved an intra-rater reliability of 0.45 using the Intraclass Correlation Coefficient (ICC(2,1)) in Exp D, surpassing the inter-rater reliability (ICC(2,1)) of 0.37 by the neurologists, highlighting its potential for consistent evaluations. The influence of different experimental designs on performance was analyzed as follows: Exp. A with Wasserstein distance (WD) 1.22, ICC(2,1) 0.05, and accuracy 18.2 % is indistinguishable from random guessing. Exp. B with WD 0.79, ICC(2,1) 0.16, and accuracy 22.7 % incorporated Python-based feature detection, leading to significant improvements. Exp. C with WD 0.58, ICC(2,1) 0.21, and accuracy 50.0 % focused on the initial 10 taps, emphasizing early movements, and achieved significantly better accuracy. Exp. D, employed few-shot learning with four examples and achieved WD 0.63, ICC(2,1) 0.45, and accuracy 61.1 % on the remaining test cases, demonstrating substantial improvement in reliability through few-shot learning. The inter-rater reliability (ICC(3,k)) of 0.85 for ChatGPT-4 versus neurologists in Exp. D is higher than values reported between neurologists in past few studies. However, differences in subject characteristics and rater backgrounds across datasets should be taken into account when interpreting this comparison.
Conclusion: The inclusion of specific examples in Experiment D substantially improved the intra-rater reliability and accuracy of ChatGPT-4. These results demonstrate that, when guided by well-designed inputs, ChatGPT-4 can produce scoring outcomes comparable to those of clinical neurologists.
Abadi, M. and Moore, D. R. (2022). Selection of circular proposals in building projects: an mcdm model for lifecycle circularity assessments using ahp. Buildings, 12(8):1110.
Abdo, W. F., Van De Warrenburg, B. P., Burn, D. J., Quinn, N. P., and Bloem, B. R. (2010). The clinical approach to movement disorders. Nature Reviews Neurology, 6(1):29–37.
Akoglu, H. (2018). User’s guide to correlation coefficients. Turkish Journal of Emergency Medicine, 18(3):91–93.
Alzubaidi, S. and Soori, P. K. (2012). Energy efficient lighting system design for hospitals diagnostic and treatment room—a case study. Journal of light & visual environment, 36(1): 23–31.
Andrés, A. M. and Marzo, P. F. (2004). Delta: A new measure of agreement between two raters. British journal of mathematical and statistical psychology, 57(1):1–19.
Ashyani, A., Lin, C.-L., Roman, E., Yeh, T., Kuo, T., Tsai, W.-F., Lin, Y., Tu, R., Su, A., Wang, C.-C., Tan, C.-H., and Nordling, T. E. M. (2022). Digitization of updrs upper limb motor examinations towards automated quantification of symptoms of parkinson’s disease. Manuscript in preparation.
Berardelli, A., Wenning, G., Antonini, A., Berg, D., Bloem, B., Bonifati, V., Brooks, D., Burn, D., Colosimo, C., Fanciulli, A., et al. (2013). Efns/mds-es recommendations for the diagnosis of p arkinson’s disease. European journal of neurology, 20(1):16–34.
Berg, E. C., Mascha, M., and Capehart, K. W. (2022). Judging reliability at wine and water competitions. Journal of Wine Economics, 17(4):311–328.
Bromm, K. N., Lang, I.-M., Twardzik, E. E., Antonakos, C. L., Dubowitz, T., and Colabianchi, N. (2020). Virtual audits of the urban streetscape: comparing the inter-rater reliability of gigapan® to google street view. International Journal of Health Geographics, 19:1–15.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Butt, A. H., Rovini, E., Dolciotti, C., De Petris, G., Bongioanni, P., Carboncini, M., and Cavallo, F. (2018). Objective and automatic classification of parkinson disease with leap motion controller. Biomedical engineering online, 17:1–21.
Cao, Z., Simon, T., Wei, S.-E., and Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299.
Caruccio, L., Cirillo, S., Polese, G., Solimando, G., Sundaramurthy, S., and Tortora, G. (2024). Can chatgpt provide intelligent diagnoses? a comparative study between predictive models and chatgpt to define a new medical diagnostic bot. Expert Systems with Applications, 235:121186.
Chan, Y. (2003). Biostatistics 104: correlational analysis. Singapore Med J, 44(12):614–619.
Chang, J. R. and Nordling, T. E. M. (2021). Skin feature point tracking using deep feature encodings. arXiv preprint.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin, 70(4):213.
Dave, T., Athaluri, S. A., and Singh, S. (2023). Chatgpt in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Frontiers in artificial intelligence, 6:1169595.
De Coninck, K., Hambly, K., Dickinson, J. W., and Passfield, L. (2018). Measuring the morphological characteristics of thoracolumbar fascia in ultrasound images: an inter-rater reliability study. BMC Musculoskeletal Disorders, 19:1–6.
De Raadt, A., Warrens, M. J., Bosker, R. J., and Kiers, H. A. (2019). Kappa coefficients for missing data. Educational and psychological measurement, 79(3):558–576.
De Raadt, A., Warrens, M. J., Bosker, R. J., and Kiers, H. A. (2021). A comparison of reliability coefficients for ordinal rating scales. Journal of Classification, pages 1–25.
Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., Xia, H., Xu, J., Wu, Z., Liu, T., et al. (2022). A survey on in-context learning. arXiv preprint arXiv:2301.00234.
Fahn, S., Elton, R. L., and Committee, M. o. T. U. D. (1987). Unified Parkinson’s Disease Rating Scale. In Recent Developments in Parkinson’s Disease, pages 153–163.
Feng, G. C. (2013a). Factors affecting intercoder reliability: A monte carlo experiment. Quality & Quantity, 47:2959–2982.
Feng, G. C. (2013b). Underlying determinants driving agreement among coders. Quality & Quantity, 47(5):2983–2997.
Finn, R. H. (1970). A note on estimating the reliability of categorical data. Educational and psychological measurement, 30(1):71–76.
Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378.
Franceschini, T., Bontemps, J.-D., Perez, V., and Leban, J.-M. (2013). Divergence in latewood density response of norway spruce to temperature is not resolved by enlarged sets of climatic predictors and their non-linearities. Agricultural and forest meteorology, 180:132–141.
Fritz, E. S. (1974). Total diet comparison in fishes by spearman rank correlation coefficients. Copeia, pages 210–214.
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., and Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
Goetz, C. G., Poewe, W., Rascol, O., Sampaio, C., Stebbins, G. T., Counsell, C., Giladi, N., Holloway, R. G., Moore, C. G., Wenning, G. K., et al. (2004). Movement disorder society task force report on the hoehn and yahr staging scale: status and recommendations the movement disorder society task force on rating scales for parkinson’s disease. Movement disorders, 19(9):1020–1028.
Goetz, C. G., Stebbins, G. T., Chmura, T. A., Fahn, S., Poewe, W., and Tanner, C. M. (2010). Teaching program for the movement disorder society-sponsored revision of the unified parkinson’s disease rating scale:(mds-updrs). Movement disorders, 25(9):1190–1194.
Goetz, C. G., Tilley, B. C., Shaftman, S. R., Stebbins, G. T., Fahn, S., Martinez-Martin, P., Poewe, W., Sampaio, C., Stern, M. B., Dodel, R., et al. (2008). Movement disorder society-sponsored revision of the unified parkinson’s disease rating scale (mds-updrs): scale presentation and clinimetric testing results. Movement disorders: official journal of the Movement Disorder Society, 23(15):2129–2170.
Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1):29–48.
Hayes, A. F. and Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication methods and measures, 1(1):77–89.
Heo, M. (2008). Utility of weights for weighted kappa as a measure of interrater agreement on ordinal scale. Journal of Modern Applied Statistical Methods, 7(1):17.
Heye, K., Li, R., Bai, Q., St George, R. J., Rudd, K., Huang, G., Meinders, M. J., Bloem, B. R., and Alty, J. E. (2024). Validation of computer vision technology for analyzing bradykinesia in outpatient clinic videos of people with parkinson’s disease. Journal of the Neurological Sciences, 466:123271.
Hsu, Y.-C., Su, Y.-H., Cheng, B.-R., Sung, S.-F., Liu, J.-X., Hsu, H.-C., and Hsiung, P.-A. (2024). Movement disorder evaluation of parkinson’s disease severity based on deep neural network models. IEEE Access.
Islam, M. S., Rahman, W., Abdelkader, A., Lee, S., Yang, P. T., Purks, J. L., Adams, J. L., Schneider, R. B., Dorsey, E. R., and Hoque, E. (2023). Using ai to measure parkinson's disease severity at home. npj Digital Medicine, 6(1):156.
Joshi, A., Kale, S., Chandel, S., and Pal, D. K. (2015). Likert scale: Explored and explained. British journal of applied science & technology, 7(4):396–403.
Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., and Rupprecht, C. (2023). Cotracker: It is better to track together. arXiv preprint arXiv:2307.07635.
Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2):81–93.
Kendall, M. G. (1945). The treatment of ties in ranking problems. Biometrika, 33(3):239–251.
Kenny, L., Azizi, Z., Moore, K., Alcock, M., Heywood, S., Johnsson, A., McGrath, K., Foley, M. J., Sweeney, B., O'Sullivan, S., et al. (2024). Inter-rater reliability of hand motor function assessment in parkinson's disease: Impact of clinician training. Clinical Parkinsonism & Related Disorders, page 100278.
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. (2022). Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
Koo, T. K. and Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of chiropractic medicine, 15(2):155–163.
Krippendorff, K. (2011). Computing krippendorff’s alpha-reliability.
Krippendorff, K. (2018). Content analysis: An introduction to its methodology. Sage publications.
Landis, J. (1977). The measurement of observer agreement for categorical data. Biometrics.
Lester, B., Al-Rfou, R., and Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
Li, H., Shao, X., Zhang, C., and Qian, X. (2021). Automated assessment of parkinsonian finger-tapping tests through a vision-based fine-grained classification model. Neurocomputing, 441:260–271.
Li, Z., Lu, K., Cai, M., Liu, X., Wang, Y., and Yang, J. (2022). An automatic evaluation method for parkinson’s dyskinesia using finger tapping video for small samples. Journal of Medical and Biological Engineering, 42(3):351–363.
Lim, Z. W., Pushpanathan, K., Yew, S. M. E., Lai, Y., Sun, C.-H., Lam, J. S. H., Chen, D. Z., Goh, J. H. L., Tan, M. C. J., Sheng, B., et al. (2023). Benchmarking large language models'performances for myopia care: a comparative analysis of chatgpt-3.5, chatgpt-4.0, and google bard. EBioMedicine, 95.
Ling, H., Massey, L. A., Lees, A. J., Brown, P., and Day, B. L. (2012). Hypokinesia without decrement distinguishes progressive supranuclear palsy from parkinson’s disease. Brain, 135(4):1141–1153.
Liu, Y., Chen, J., Hu, C., Ma, Y., Ge, D., Miao, S., Xue, Y., and Li, L. (2019). Vision-based method for automatic quantification of parkinsonian bradykinesia. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 27(10):1952–1961.
Liu, Y., Yang, Z., Cai, M., Wang, Y., Liu, X., Tong, H., Peng, Y., Lou, Y., and Li, Z. (2024). Atst-net: A method to identify early symptoms in the upper and lower extremities of pd. Medical Engineering & Physics, 128:104171.
Lombard, M., Snyder-Duch, J., and Bracken, C. C. (2002). Content analysis in mass communication: Assessment and reporting of intercoder reliability. Human communication research, 28(4):587–604.
Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.-L., Yong, M. G., Lee, J., et al. (2019). Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172.
Ma, Y., Tang, W., Feng, C., and Tu, X. M. (2008). Inference for kappas for longitudinal study data: applications to sexual health research. Biometrics, 64(3):781–789.
Martinez-Manzanera, O., Roosma, E., Beudel, M., Borgemeester, R., van Laar, T., and Maurits, N. M. (2015). A method for automatic and objective scoring of bradykinesia using orientation sensors and classification algorithms. IEEE Transactions on Biomedical Engineering, 63(5):1016–1024.
Martínez-Martín, P., Gil-Nagel, A., Gracia, L. M., Gómez, J. B., Martinez-Sarries, J., Bermejo, F., and Group, C. M. (1994). Unified parkinson’s disease rating scale characteristics and structure. Movement disorders, 9(1):76–83.
Marzi, G., Balzano, M., and Marchiori, D. (2024). K-alpha calculator–krippendorff’s alpha calculator: A user-friendly tool for computing krippendorff’s alpha inter-rater reliability coefficient. MethodsX, 12:102545.
McGraw, K. O. and Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological methods, 1(1):30.
McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia medica, 22(3): 276–282.
Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., and Gao, J. (2024). Large language models: A survey. arXiv preprint arXiv:2402.06196.
Moore, I. S., Mount, S., Mathema, P., and Ranson, C. (2018). Application of the subsequent injury categorisation model for longitudinal injury surveillance in elite rugby and cricket: intersport comparisons and inter-rater reliability of coding. British journal of sports medicine, 52(17):1137–1142.
Morley, D. D. (2009). Spss macros for assessing the reliability and agreement of student evaluations of teaching. Assessment & Evaluation in Higher Education, 34(6):659–671.
Mukaka, M. M. (2012). A guide to appropriate use of correlation coefficient in medical research. Malawi medical journal, 24(3):69–71.
Münnix, M. C., Schäfer, R., and Grothe, O. (2014). Estimating correlation and covariance matrices by weighting of market similarity. Quantitative Finance, 14(5):931–939.
Nori, H., Lee, Y. T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., King, N., Larson, J., Li, Y., Liu, W., et al. (2023). Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452.
Nugraha, I. G. D. and Kosasih, D. (2022). Evaluation of computer engineering practicum based-on virtual reality application.
OpenAI (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Please cite this work as “OpenAI (2023)".
Park, K. W., Lee, E.-J., Lee, J. S., Jeong, J., Choi, N., Jo, S., Jung, M., Do, J. Y., Kang, D.-W., Lee, J.-G., et al. (2021). Machine learning–based automatic rating for cardinal symptoms of parkinson disease. Neurology, 96(13):e1761–e1769.
Parker, R. I., Vannest, K. J., and Davis, J. L. (2013). Reliability of multi-category rating scales. Journal of School Psychology, 51(2):217–229.
Post, B., Merkus, M. P., de Bie, R. M., de Haan, R. J., and Speelman, J. D. (2005). Unified parkinson’s disease rating scale motor examination: are ratings of nurses, residents in neurology, and movement disorders specialists interchangeable? Movement disorders: official journal of the Movement Disorder Society, 20(12):1577–1584.
Postuma, R. B., Berg, D., Stern, M., Poewe, W., Olanow, C. W., Oertel, W., Obeso, J., Marek, K., Litvan, I., Lang, A. E., et al. (2015). Mds clinical diagnostic criteria for parkinson’s disease. Movement Disorders, 30(12):1591–1601.
Sarker, I. H. (2022). Ai-based modeling: techniques, applications and research issues towards automation, intelligent and smart systems. SN Computer Science, 3(2):158.
Schapira, A. H., Chaudhuri, K. R., and Jenner, P. (2017). Non-motor features of parkinson disease. Nature Reviews Neuroscience, 18(7):435–450.
Schober, P., Boer, C., and Schwarte, L. A. (2018). Correlation coefficients: appropriate use and interpretation. Anesthesia & analgesia, 126(5):1763–1768.
Schuster, C. (2004). A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales. Educational and psychological measurement, 64(2):243–253.
Sengupta, A., Jin, F., Zhang, R., and Cao, S. (2020). mm-pose: Real-time human skeletal posture estimation using mmwave radars and cnns. IEEE Sensors Journal, 20(17):10032–10044.
Shi, W.-P. and Nordling, T. E. M. (2024). Combining old school autoencoder with cotracker for improved skin feature tracking. In The 19th IEEE Conference on Industrial Electronics and Applications (ICIEA 2024), IEEE Conference on Industrial Electronics and Applications (ICIEA 2024), Kristiansand, Norway. IEEE.
Shi, Z. and Lipani, A. (2023). Dept: Decomposed prompt tuning for parameter-efficient fine-tuning. arXiv preprint arXiv:2309.05173.
Shin, J., Matsumoto, M., Maniruzzaman, M., Hasan, M. A. M., Hirooka, K., Hagihara, Y., Kotsuki, N., Inomata-Terada, S., Terao, Y., and Kobayashi, S. (2024). Classification of hand-movement disabilities in parkinson’s disease using a motion-capture device and machine learning. IEEE Access.
Shrout, P. E. and Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological bulletin, 86(2):420.
Sim, J. and Wright, C. C. (2005). The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Physical therapy, 85(3):257–268.
Singh, J. and Singh, D. (2024). Solving quantitative reasoning problems with language models. Advanced Engineering Informatics, 62:102799.
Singh, M., Prakash, P., Kaur, R., Sowers, R., Brašić, J. R., and Hernandez, M. E. (2023). A deep learning approach for automatic and objective grading of the motor impairment severity in parkinson’s disease for use in tele-assessments. Sensors, 23(21):9004.
Spearman, C. (1961). The proof and measurement of association between two things.
Steinmetz, J. D., Seeher, K. M., Schiess, N., Nichols, E., Cao, B., Servili, C., Cavallera, V., Cousin, E., Hagins, H., Moberg, M. E., et al. (2024). Global, regional, and national burden of disorders affecting the nervous system, 1990–2021: a systematic analysis for the global burden of disease study 2021. The Lancet Neurology, 23(4):344–381.
Tay, Y., Wei, J., Chung, H. W., Tran, V. Q., So, D. R., Shakeri, S., Garcia, X., Zheng, H. S., Rao, J., Chowdhery, A., et al. (2022). Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. (2023). Llama: Open and efficient foundation language models.
Uday, S., Kongjonaj, A., Aguiar, M., Tulchinsky, T., and Högler, W. (2017). Variations in infant and childhood vitamin d supplementation programmes across europe and factors influencing adherence. Endocrine connections, 6(8):667–675.
ÜNÜSAN, N. and YALÇIN, H. (2019). Teachers’ self-efficacy is related to their nutrition teaching methods.
Vanbelle, S. (2016). A new interpretation of the weighted kappa coefficients. Psychometrika, 81(2):399–410.
Vaswani, A., Shazeer, N., Parmer, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Illia, P. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 2017-Decem(Nips):5999–6009.
Villani, C. and Villani, C. (2009). The wasserstein distances. Optimal transport: old and new, pages 93–111.
Volkmann, N., Stracke, J., and Kemper, N. (2019). Evaluation of a gait scoring system for cattle by using cluster analysis and krippendorff’s α reliability. Veterinary Record, 184(7): 220–220.
Warrens, M. J. (2014). Corrected zegers-ten berge coefficients are special cases of cohen's weighted kappa. Journal of Classification, 31:179–193.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
Williams, S., Wong, D., Alty, J. E., and Relton, S. D. (2023). Parkinsonian hand or clinician's eye? finger tap bradykinesia interrater reliability for 21 movement disorder experts. Journal of Parkinson’s Disease, 13(4):525–536.
Yang, N., Liu, D.-F., Liu, T., Han, T., Zhang, P., Xu, X., Lou, S., Liu, H.-G., Yang, A.-C., Dong, C., Vai, I, M., Pun, S. H., and Zhang, J.-G. (2022). Automatic detection pipeline for accessing the motor severity of parkinson’s disease in finger tapping and postural stability. IEEE ACCESS, 10:66961–66973.
Yu, T., Park, K. W., McKeown, M. J., and Wang, Z. J. (2023). Clinically informed automated assessment of finger tapping videos in parkinson’s disease. Sensors, 23(22):9149.
Zapf, A., Castell, S., Morawietz, L., and Karch, A. (2016). Measuring inter-rater reliability for nominal data–which coefficients and confidence intervals are appropriate? BMC medical research methodology, 16:1–10.
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. (2023). A survey of large language models. arXiv preprint arXiv: 2303.18223.
Zhou, H., Liu, F., Gu, B., Zou, X., Huang, J., Wu, J., Li, Y., Chen, S. S., Zhou, P., Liu, J., et al. (2023). A survey of large language models in medicine: Progress, application, and challenge. arXiv preprint arXiv:2311.05112.
Zysno, P. V. (1997). The modification of the phi-coefficient reducing its dependence on the marginal distributions. Methods of Psychological Research Online, 2(1):41–52.