簡易檢索 / 詳目顯示

研究生: 陳嘉儀
Chen, Jia-Yi
論文名稱: 利用聲音和臉部特徵進行多模態資料融合的失智症檢測
Multimodal Fusion for Dementia Detection using Voice and Facial Features
指導教授: 朱威達
Chu, Wei-Ta
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 45
中文關鍵詞: 多模態融合失智症檢測失智症輕度認知障礙
外文關鍵詞: Multimodal Fusion, Dementia Detection, Dementia, Mild Cognitive Impairment
相關次數: 點閱:39下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在人口高齡化的社會中,能夠早期檢測家中老年人的失智程度是非常重要的。傳統醫學診斷方式通常需要經過繁瑣的檢查步驟,可能需耗時三到四個月才能確定診斷。研究表明,早期診斷和治療能有效延緩老年人認知功能的衰退。因此,我們的目標是建立一套快速且低成本的失智檢測方法。我們的Dementia數據集來自高雄榮民總醫院失智症門診的136位年齡介於65到85歲之間的老年人,其中30位為健康對照組,45位為輕度認知障礙,61位為失智症患者。數據集中的失智評估影片是拍攝自患者在進行簡易心智狀態問卷調查表(SPMSQ)測驗的過程。我們提出了一種多模態融合方法,通過結合患者的聲音和臉部特徵,可以有效預測患者的CDR、MMSE和CASI分數。我們的研究證明,這種方法比目前最先進的方法更為有效。

    In an aging society, the ability to detect dementia in elderly family members early is extremely important. Traditional diagnostic methods often involve lengthy and complex procedures, taking three to four months to reach a diagnosis. Research has shown that early diagnosis and treatment can effectively slow down cognitive decline in the elderly. Therefore, our goal is to develop a quick and low-cost method for dementia detection. Our dementia dataset is collected from the dementia clinic of Kaohsiung Veterans General Hospital, consisting of 136 elderly individuals aged 65 to 85. Among them, 30 are healthy controls, 45 have mild cognitive impairment, and 61 have dementia. The dementia assessment videos in the dataset are recorded during patients' Short Portable Mental Status Questionnaire (SPMSQ) test. We propose a multimodal fusion method that effectively predicts patients' Clinical Dementia Rating (CDR), Mini-Mental Status Examination (MMSE), and Cognitive Abilities Screening Instrument (CASI) scores by combining their voice and facial features. Our research demonstrates that this method is more effective than the current state-of-the-art approaches.

    摘要 i Abstract ii Table of Contents iii List of Tables v List of Figures vi Chapter 1. Introduction 1 1.1. Motivation 1 1.2. Overview 2 1.3. Contributions 3 1.4. Thesis Organization 3 Chapter 2. Related Works 4 2.1. Detecting Dementia based on Machine Learning 4 2.2. Detecting Dementia based on Deep Learning 5 2.2.1. SingleModality Methods 5 2.2.2. Multi-Modality Methods 6 2.3. Multimodal Contrstive Learning Model 7 Chapter 3. Methodology 9 3.1. Dataset 9 3.1.1. Participants 9 3.1.2. Neuropsychological Assessments 10 3.1.3. Preprocessing 10 3.2. Overview 11 3.3. Aural Branch 11 3.4. Visual Branch 14 3.5. Audio-Visual Fusion 17 3.6. Loss Function 19 Chapter 4. Experimental Results 22 4.1. Experimental Settings 22 4.1.1. Implementation Details 22 4.1.2. Evaluation Metrics 23 4.2. Data Statistics 24 4.3. CDR Classification 25 4.4. MMSE/CASI Prediction 26 4.5. Ablation Study 27 4.5.1. Concatenation or Element-wise Addition 27 4.5.2.  Mixup Parameters 28 4.5.3.  Adopting Different Components 29 4.5.4.  Impact of Contrastive Loss on Audio-Visual Embedding Alignment 29 Chapter 5. Conclusion 31 5.1. Conclusion 31 5.2. Future Works 31 5.2.1. Open Datasets 31 5.2.2. Automatic Pipeline 32 References 33

    [1]  American Psychiatric Association. Diagnostic and statistical manual of mental disorders 5th edition. 2013.
    [2]  AlexeiBaevski, YuhaoZhou, AbdelrahmanMohamed, and MichaelAuli. wav2vec2.0: A framework for self-supervised learning of speech representations. In Proceedings of Advances in Neural Information Processing Systems, 33:12449–12460, 2020.
    [3]  James T Becker, François Boiler, Oscar L Lopez, Judith Saxton, and Karen L McGonigle. The natural history of alzheimer’s disease: description of study cohort and accuracy of diagnosis. Archives of Neurology, 51(6):585–594, 1994.
    [4]  Flavio Bertini, Davide Allevi, Gianluca Lutero, Laura Calzà, and Danilo Montesi. An automatic alzheimer’s disease classifier based on spontaneous spoken english. Computer Speech & Language, 72:101298, 2022.
    [5]  Leo Breiman. Random forests. Machine Learning, 45:5–32, 2001.
    [6]  JunChen, JiepingYe, FengyiTang, andJiayuZhou. Automatic detection of alzheimer’s disease using spontaneous speech only. In Proceedings of Interspeech, volume 2021, page 3830, 2021.
    [7]  Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016.
    [8]  Hans Christian, Mikhael Pramodana Agus, and Derwin Suhartono. Single document automatic text summarization using term frequency-inverse document frequency (tfidf). ComTech: Computer, Mathematics and Engineering Applications, 7(4):285–294, 2016.
    [9]  Che-ShengChu, Di-YuanWang, Chih-KuangLiang, Ming-YuehChou, Ying-HsinHsu, Yu-Chun Wang, Mei-Chen Liao, Wei-Ta Chu, and Yu-Te Lin. Automated video analysis of audio-visual approaches to predict and detect mild cognitive impairment and dementia in older adults. Journal of Alzheimer’s Disease, 92(3):875–886, 2023.
    [10]  Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
    [11]  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
    [12]  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
    [13]  Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
    [14]  Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Trans- formers for image recognition at scale. In Proceedings of International Conference on Learning Representations, 2021.
    [15]  Florian Eyben, Klaus R Scherer, Björn W Schuller, Johan Sundberg, Elisabeth André, Carlos Busso, Laurence Y Devillers, Julien Epps, Petri Laukka, Shrikanth S Narayanan, et al. The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2):190–202, 2015.
    [16]  Marshal F Folstein, Susan E Folstein, and Paul R McHugh. ”mini-mental state”: a practical method for grading the cognitive state of patients for the clinician. Journal of Psychiatric Research, 12(3):189–198, 1975.
    [17]  Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pages 776–780, 2017.
    [18]  Yuan Gong, Yu-An Chung, and James Glass. AST: Audio Spectrogram Transformer. In Proceedings of Interspeech, pages 571–575, 2021.
    [19]  Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlin-sky, Hilde Kuehne, and James R. Glass. Contrastive audio-visual masked autoencoder. In Proceedings of International Conference on Learning Representations, 2023.
    [20]  Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of International Conference on Machine Learning, pages 1764–1772. PMLR, 2014.
    [21]  Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In Proceedings of European Conference on Computer Vision, pages 87–102. Springer, 2016.
    [22]  Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. Support vector machines. IEEE Intelligent Systems and Their Applications, 13(4):18– 28, 1998.
    [23]  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Compu-tation, 9(8):1735–1780, 1997.
    [24]  Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: Self-supervised speech repre- sentation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
    [25]  Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, Christoph Feichtenhofer, et al. Mavil: Masked audio- video learners. In Proceedings of Advances in Neural Information Processing Systems, 36, 2024.
    [26]  Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models for sequence tag- ging. arXiv preprint arXiv:1508.01991, 2015.
    [27]  Cosimo Ieracitano, Nadia Mammone, Amir Hussain, and Francesco C Morabito. A novel multi-modal machine learning based approach for automatic classification of eeg recordings in dementia. Neural Networks, 123:176–190, 2020.
    [28]  Loukas Ilias, Dimitris Askounis, and John Psarras. Detecting dementia from speech and transcripts using transformers. Computer Speech & Language, 79:101485, 2023.
    [29]  Longbin Jin, Yealim Oh, Hyunseo Kim, Hyuntaek Jung, Hyo Jin Jon, Jung Eun Shin, and Eun Yi Kim. Consen: Complementary and simultaneous ensemble for alzheimer’s disease detection and mmse score prediction. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–2, 2023.
    [30]  Ker-Neng Lin, Pei-Ning Wang, Chia-Yih Liu, Wei-Ta Chen, Yi-Chung Lee, and Hsiu- Chih Liu. Cutoff scores of the cognitive abilities screening instrument, chinese version in screening of dementia. Dementia and Geriatric Cognitive Disorders, 14(4):176–182, 2002.
    [31]  Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted win- dows. In Proceedings of IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
    [32]  Saturnino Luz, Fasih Haider, Davida Fromm, Ioulietta Lazarou, Ioannis Kompatsiaris, and Brian MacWhinney. Multilingual alzheimer’s dementia recognition through spontaneous speech: A signal processing grand challenge. arXiv preprint arXiv:2301.05562, 2023.
    [33]  Pranav Mahajan and Veeky Baths. Acoustic and language based deep learning ap- proaches for alzheimer’s dementia detection from spontaneous speech. Frontiers in Aging Neuroscience, 13:623607, 2021.
    [34]  Kangdi Mei, Xinyun Ding, Yinlong Liu, Zhiqiang Guo, Feiyang Xu, Xin Li, Tuya Naren, Jiahong Yuan, and Zhenhua Ling. The ustc system for adress-m challenge. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–2, 2023.
    [35]  Sirko Molau, Michael Pitz, RalfSchluter, and Hermann Ney.Computing mel-frequency cepstral coefficients on the power spectrum. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 73–76, 2001.
    [36]  Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. Audio-visual instance discrimi- nation with cross-modal agreement. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.
    [37]  John C Morris. The clinical dementia rating (cdr) current version and scoring rules. Neurology, 43(11):2412–2412, 1993.
    [38]  Keiron O’shea and Ryan Nash. An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458, 2015.
    [39]  Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin Dogus Cubuk, and Quoc V. Le. Specaugment: A simple data augmentation method for automatic speech recognition. In Proceedings of Interspeech, 2019.
    [40]  Ronald C Petersen. Mild cognitive impairment. New England Journal of Medicine, 364(23):2227–2234, 2011.
    [41]  Ronald Carl Petersen, Paul S Aisen, Laurel A Beckett, Michael C Donohue, Anthony Collins Gamst, Danielle J Harvey, CR Jack Jr, William J Jagust, Leslie M Shaw, Arthur W Toga, et al. Alzheimer’s disease neuroimaging initiative (adni) clinical characterization. Neurology, 74(3):201–209, 2010.
    [42]  Eric Pfeiffer. A short portable mental status questionnaire for the assessment of or- ganic brain deficit in elderly patients. Journal of the American Geriatrics Society, 23(10):433–441, 1975.
    [43]  Juan Song, Jian Zheng, Ping Li, Xiaoyuan Lu, Guangming Zhu, and Peiyi Shen. An effective multimodal image fusion method using mri and pet for alzheimer’s disease diagnosis. Frontiers in Digital Health, 3:637386, 2021.
    [44]  Bastiaan Tamm, Rik Vandenberghe, and Hugo Van Hamme. Cross-lingual transfer learning for alzheimer’s detection from spontaneous speech. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–2, 2023.
    [45]  Evelyn L Teng, Kazuo Hasegawa, Akira Homma, Yukimuchi Imai, Eric Larson, Amy Graves, Keiko Sugimoto, Takenori Yamaguchi, Hideo Sasaki, Darryl Chiu, et al. The cognitive abilities screening instrument (casi): a practical test for cross-cultural epidemiological studies of dementia. International Psychogeriatrics, 6(1):45–58, 1994.
    [46]  Akshay Valsaraj, Ithihas Madala, Nikhil Garg, and Veeky Baths. Alzheimer’s dementia detection using acoustic & linguistic features and pre-trained bert. In Proceedings of International Conference on Soft Computing & Machine Intelligence, pages 171–175, 2021.
    [47]  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of Advances in Neural Information Processing Systems, 30, 2017.
    [48]  Noël Vouitsis, Zhaoyan Liu, Satya Krishna Gorti, Valentin Villecroze, Jesse C. Cress- well, Guangwei Yu, Gabriel Loaiza-Ganem, and Maksims Volkovs. Data-efficient mul-timodal fusion on a single gpu. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
    [49]  Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
    [50] Yaoyao Zhong and Weihong Deng. Face transformer for recognition. arXiv preprint arXiv:2103.14803, 2021.

    下載圖示
    2026-01-14公開
    QR CODE