| 研究生: |
邱歆惠 Chiu, Sin-Huei |
|---|---|
| 論文名稱: |
整合聲帶影像與語音特徵以進行量化評估之診斷輔助報告系統 A Diagnostic Support and Reporting System Integrating Vocal Fold Imaging and Acoustic Features for Quantitative Assessment |
| 指導教授: |
梁勝富
Liang, Sheng-Fu |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 英文 |
| 論文頁數: | 81 |
| 中文關鍵詞: | 深度學習 、喉頻閃攝影 、人機協作 、聲門區域 、聲門面積波形 |
| 外文關鍵詞: | Deep learning, Videostroboscopy, Human-machine collaboration, Glottal area, Glottal area waveform |
| 相關次數: | 點閱:13 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
聲帶檢查是多項聲帶疾病需詳細診斷時的關鍵環節,其中聲帶波動與發聲聲音的評估,長期以來主要依賴人工觀察與分析,高度依賴醫師的主觀經驗,且缺乏統一的量化標準。為了解決此問題,本研究開發了一個整合聲帶影像與語音特徵的自動化輔助報告系統。此系統整合聲帶影像與語音特徵以進行量化評估,旨在提升發聲障礙診斷的客觀性與效率。本系統的核心特色在於其專為臨床主流的喉閃頻攝影(Videostroboscopy , VS)影像所設計,旨在應對多數醫院因高速攝影(High-speed Videostroboscopy , HSV)設備成本高昂而仍廣泛使用VS的臨床現實。透過深度學習模型進行聲門分割,並從聲門面積波形(Glottal Area Waveform , GAW)中提取多維度的量化特徵。在語音方面,系統採用一個兩階段模型(EfficientNet-B4結合特徵輔助判斷)進行嚴重度分類,並透過創新的聲學雷達圖,將複雜的聲學數據轉化為非聲學專業的耳鼻喉科醫師也能輕易判讀的視覺化資訊。
本系統的核心價值在於其自動化的多模態分析與報告生成功能,能整合聲門影像與音訊數據,產出包含量化特徵參數及初步分析註解的全面性報告。系統開發的圖形使用者介面(Graphical User Interface , GUI)提供人機協作機制,並呈現如聲門波、聲帶影像畫面、音訊雷達圖及多項量化特徵等分析結果,讓臨床醫師可藉此審閱、調整系統的分析,確保最終報告在輸出前均經過專家確認,藉此提升診斷的準確性並避免誤判。
在長庚紀念醫院(CGMH)的臨床數據驗證中,本系統展現了其有效性:發聲困難嚴重度預測模型在476筆語音資料中達到71.4%的總體準確率 ;而自動化的語音與影像註解生成模組也與專家判斷高度一致,分別在158個發聲片段及32個影像中達到94.9% 與90.1%的準確率。實驗結果證明,本系統不僅在各項分析模組上表現優異,其自動生成的註解也與專家判斷高度一致。總結來說,本研究成功開發了一個具高度臨床價值的輔助工具,它不僅提升了發聲障礙診斷的效率與客觀性,更透過數據的量化與標準化,為未來大規模聲帶疾病資料庫的建立與數據驅動的深入研究奠定了堅實的基礎。
Vocal fold examination is an essential part in the detailed diagnosis of many vocal fold diseases. The assessment of vocal fold vibration and voice production has long relied on manual observation and analysis, making it highly dependent on the clinician's subjective experience and lacking uniform quantitative standards. To address this issue, this study has developed a diagnostic support and reporting system that integrates vocal fold imaging and speech features for quantitative assessment, aiming to enhance the objectivity and efficiency of diagnosing vocal fold disorders. A significant feature of this system is its design specifically for clinically mainstream videostroboscopy (VS) images, addressing the clinical reality that most hospitals rely on VS due to the prohibitive equipment cost of High-Speed Videoendoscopy (HSV). It utilizes a deep learning model for glottal segmentation and extracts multi-dimensional quantitative features from the Glottal Area Waveform (GAW). On the acoustic side, the system employs a two-stage model (EfficientNet-B4 combined with feature-assisted judgment) for severity classification. Furthermore, an innovative acoustic radar chart transforms complex acoustic data into visualized information, making it easily interpretable even for otolaryngologists who are not acoustic specialists.
The core value of this system lies in its automated multi-modal analysis and report generation capabilities. It integrates glottal imaging and audio data to produce comprehensive reports containing quantitative feature parameters and preliminary analytical annotations. The system's graphical user interface (GUI) presents various analytical results—such as the glottal waveform, individual video frames of the vocal folds, an audio radar chart, and multiple quantitative features—and facilitates human-computer collaboration. This allows clinicians to review and adjust the system's analysis, ensuring expert confirmation before the final report is generated, thereby enhancing diagnostic accuracy and preventing misjudgment.
Experimental validation on clinical data from Chang Gung Memorial Hospital (CGMH) confirms the system's efficacy: the dysphonia severity prediction model achieved 71.4% overall accuracy across 476 phonation samples, while the automated audio and video comment generation modules demonstrated high agreement with expert assessments, achieving 94.9% (on 158 segments) and 90.1% (on 32 videos) accuracy, respectively. These results demonstrate that the system not only is highly effective in all analysis modules but also that its auto-generated comment show a high degree of consistency with expert assessments. This research successfully develops a valuable auxiliary tool with high clinical value. It not only improves the efficiency and objectivity of diagnosing vocal disorders but also lays a solid foundation for future large-scale vocal fold disease databases and in-depth, data-driven research through data quantification and standardization.
[1] J. Jiang, E. Lin, and D. G. Hanson, "Vocal fold physiology," (in English), Otolaryng Clin N Am, vol. 33, no. 4, pp. 699-+, Aug 2000, doi: Doi 10.1016/S0030-6665(05)70238-3.
[2] S. Krischke, S. Weigelt, U. Hoppe, T. Köllner, M. Klotz, U. Eysholdt, and F. Rosanowski, "Quality of life in dysphonic patients," (in English), J Voice, vol. 19, no. 1, pp. 132-137, Mar 2005, doi: 10.1016/j.jvoice.2004.01.007.
[3] R. B. Fujiki and S. L. Thibeault, "Voice Disorder Prevalence and Vocal Health Characteristics in Adolescents," (in English), Jama Otolaryngol, vol. 150, no. 9, pp. 800-810, Sep 2024, doi: 10.1001/jamaoto.2024.2081.
[4] R. J. Stachler, D. O. Francis, S. R. Schwartz, C. C. Damask, G. P. Digoy, H. J. Krouse, S. J. McCoy, D. R. Ouellette, R. R. Patel, C. W. Reavis, L. J. Smith, M. Smith, S. W. Strode, P. Woo, and L. C. Nnacheta, "Clinical Practice Guideline: Hoarseness (Dysphonia) (Update)," (in English), Otolaryng Head Neck, vol. 158, pp. S1-S42, Mar 2018, doi: 10.1177/0194599817751030.
[5] N. Roy, R. M. Merrill, S. D. Gray, and E. M. Smith, "Voice disorders in the general population: Prevalence, risk factors, and occupational impact," (in English), Laryngoscope, vol. 115, no. 11, pp. 1988-1995, Nov 2005, doi: 10.1097/01.mlg.0000179174.32345.41.
[6] Y. T. Lai, Y. H. Wang, Y. C. Yen, T. Y. Yu, P. Z. Chao, F. P. Lee, and S. Dailey, "The Epidemiology of Benign Voice Disorders in Taiwan: A Nationwide Population-Based Study," (in English), Ann Oto Rhinol Laryn, vol. 128, no. 5, pp. 406-412, May 2019, doi: 10.1177/0003489419826136.
[7] M. R. Naunheim, E. K. Devore, M. N. Huston, P. C. Song, R. J. r. Franco, and N. Bhattacharyya, "Increasing Prevalence of Voice Disorders in the USA: Updates in the COVID Era," (in English), Laryngoscope, vol. 134, no. 8, pp. 3713-3718, Aug 2024, doi: 10.1002/lary.31409.
[8] R. R. Casiano, V. Zaveri, and D. S. Lundy, "Efficacy of Videostroboscopy in the Diagnosis of Voice Disorders," (in English), Otolaryng Head Neck, vol. 107, no. 1, pp. 95-100, Jul 1992, doi: Doi 10.1177/019459989210700115.
[9] D. D. Deliyski, M. E. G. Powell, S. R. C. Zacharias, T. T. Gerlach, and A. de Alarcon, "Experimental investigation on minimum frame rate requirements of high-speed videoendoscopy for clinical voice assessment," (in English), Biomed Signal Proces, vol. 17, pp. 21-28, Mar 2015, doi: 10.1016/j.bspc.2014.11.007.
[10] D. D. Deliyski, P. P. Petrushev, H. S. Bonilha, T. T. Gerlach, B. Martin-Harris, and R. E. Hillman, "Clinical implementation of laryngeal high-speed videoendoscopy: Challenges and evolution," (in English), Folia Phoniatr Logo, vol. 60, no. 1, pp. 33-44, 2008, doi: 10.1159/000111802.
[11] J. Hoffman, M. Baranska, E. Niebudek-Bogusz, and W. Pietruszewska, "Comparative Evaluation of High-Speed Videoendoscopy and Laryngovideostroboscopy for Functional Laryngeal Assessment in Clinical Practice," (in English), J Clin Med, vol. 14, no. 5, Mar 2025, doi: ARTN 172310.3390/jcm14051723.
[12] M. E. Powell, D. D. Deliyski, R. E. Hillman, S. M. Zeitels, J. A. Burns, and D. D. Mehta, "Comparison of Videostroboscopy to Stroboscopy Derived From High-Speed Videoendoscopy for Evaluating Patients With Vocal Fold Mass Lesions," (in English), Am J Speech-Lang Pat, vol. 25, no. 4, pp. 576-589, Nov 2016, doi: 10.1044/2016_Ajslp-15-0050.
[13] P. H. Dejonckere, P. Bradley, P. Clemente, G. Cornut, L. Crevier-Buchman, G. Friedrich, P. Van de Heyning, M. Remacle, and V. Woisard, "A basic protocol for functional assessment of voice pathology, especially for investigating the efficacy of (phonosurgical) treatments and evaluating new assessment techniques - Guideline elaborated by the Committee on Phoniatrics of the European Laryngological Society (ELS)," (in English), Eur Arch Oto-Rhino-L, vol. 258, no. 2, pp. 77-82, Feb 2001, doi: DOI 10.1007/s004050000299.
[14] Z. J. Wang, P. Yu, N. Yan, L. Wang, and M. L. Ng, "Automatic Assessment of Pathological Voice Quality Using Multidimensional Acoustic Analysis Based on the GRBAS Scale," (in English), J Signal Process Sys, vol. 82, no. 2, pp. 241-251, Feb 2016, doi: 10.1007/s11265-015-1016-2.
[15] M. S. DeBodt, F. L. Wuyts, P. H. VandeHeyning, and C. Croux, "Test-retest study of the GRBAS scale: Influence of experience and professional background on perceptual rating of voice quality," (in English), J Voice, vol. 11, no. 1, pp. 74-80, Mar 1997, doi: Doi 10.1016/S0892-1997(97)80026-4.
[16] J. Kreiman and B. R. Gerratt, "The perceptual structure of pathologic voice quality," (in English), J Acoust Soc Am, vol. 100, no. 3, pp. 1787-1795, Sep 1996, doi: Doi 10.1121/1.416074.
[17] Y. L. Yan, X. Chen, and D. Bless, "Automatic tracing of vocal-fold motion from high-speed digital images," (in English), Ieee T Bio-Med Eng, vol. 53, no. 7, pp. 1394-1400, Jul 2006, doi: 10.1109/Tbme.2006.873751.
[18] J. Lohscheller, H. Toy, F. Rosanowski, U. Eysholdt, and M. Döllinger, "Clinically evaluated procedure for the reconstruction of vocal fold vibrations from endoscopic digital high-speed videos," (in English), Med Image Anal, vol. 11, no. 4, pp. 400-413, Aug 2007, doi: 10.1016/j.media.2007.04.005.
[19] A. M. Kist, S. Dürr, A. Schützenberger, and M. Döllinger, "OpenHSV: an open platform for laryngeal high-speed videoendoscopy," (in English), Sci Rep-Uk, vol. 11, no. 1, Jul 2 2021, doi: ARTN 1376010.1038/s41598-021-93149-0.
[20] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, 2015: Springer, pp. 234-241.
[21] H. J. Ding, Q. Cen, X. Y. Si, Z. P. Pan, and X. D. Chen, "Automatic glottis segmentation for laryngeal endoscopic images based on U-Net," (in English), Biomed Signal Proces, vol. 71, Jan 2022, doi: ARTN 10311610.1016/j.bspc.2021.103116.
[22] M.-C. Cheng, "A System for Glottal Area and Glottal Area Waveform Feature Computation Based on Deep Learning," M.S. thesis, National Cheng Kung University, Tainan, Taiwan, 2024. [Online]. Available: https://hdl.handle.net/11296/zaqyk7
[23] P. Woo, "Quantification of videostrobolaryngoscopic findings‐measurements of the normal glottal cycle," The Laryngoscope, vol. 106, no. S79, pp. 1-27, 1996.
[24] A. Yamauchi, H. Yokonishi, H. Imagawa, K. Sakakibara, T. Nito, N. Tayama, and T. Yamasoba, "Age- and Gender-Related Difference of Vocal Fold Vibration and Glottal Configuration in Normal Speakers: Analysis With Glottal Area Waveform," (in English), J Voice, vol. 28, no. 5, pp. 525-531, Sep 2014, doi: 10.1016/j.jvoice.2014.01.016.
[25] C. M. Hsu, M. Y. Yang, T. J. Fang, C. Y. Wu, Y. T. Tsai, G. H. Chang, and M. S. Tsai, "Maximum and Minimum Phonatory Glottal Area before and after Treatment for Vocal Nodules," (in English), Healthcare-Basel, vol. 8, no. 3, Sep 2020, doi: ARTN 32610.3390/healthcare8030326.
[26] J. P. Noordzij and P. Woo, "Glottal area waveform analysis of benign vocal fold lesions before and after surgery," (in English), Ann Oto Rhinol Laryn, vol. 109, no. 5, pp. 441-446, May 2000, doi: Doi 10.1177/000348940010900501.
[27] R. R. Patel, J. Sundberg, B. Gill, and F. M. B. La, "Glottal Airflow and Glottal Area Waveform Characteristics of Flow Phonation in Untrained Vocally Healthy Adults," (in English), J Voice, vol. 36, no. 1, Jan 2022, doi: 10.1016/j.jvoice.2020.07.037.
[28] D. Aziz and S. David, "Multitask and Transfer Learning Approach for Joint Classification and Severity Estimation of Dysphonia," (in English), Ieee J Transl Eng He, vol. 12, pp. 233-244, 2024, doi: 10.1109/Jtehm.2023.3340345.
[29] R. Islam, E. Abdel-Raheem, and M. Tarique, "Cochleagram to Recognize Dysphonia: Auditory Perceptual Analysis for Health Informatics," (in English), Ieee Access, vol. 12, pp. 59198-59210, 2024, doi: 10.1109/Access.2024.3392808.
[30] J.-Y. Han, C.-J. Hsiao, W.-Z. Zheng, K.-C. Weng, G.-M. Ho, C.-Y. Chang, C.-T. Wang, S.-H. Fang, and Y.-H. Lai, "Enhancing the performance of pathological voice quality assessment system through the attention-mechanism based neural network," J Voice, 2023.
[31] G. Thomas, S. S. Mathews, S. B. Chrysolyte, and V. Rupa, "Outcome analysis of benign vocal cord lesions by videostroboscopy, acoustic analysis and voice handicap index," Indian Journal of Otolaryngology and Head & Neck Surgery, vol. 59, no. 4, pp. 336-340, 2007.
[32] YOLO (Version 8.0.0). (2023). [Online]. Available: https://github.com/ultralytics/ultralytics
[33] Praat: doing phonetics by computer. (2023). Accessed: May 28, 2024. [Online]. Available: http://www.praat.org/
[34] Y. Jadoul, B. Thompson, and B. de Boer, "Introducing Parselmouth: A Python interface to Praat," (in English), J Phonetics, vol. 71, pp. 1-15, Nov 2018, doi: 10.1016/j.wocn.2018.07.001.
[35] M. Tan and Q. Le, "Efficientnet: Rethinking model scaling for convolutional neural networks," in International conference on machine learning, 2019: PMLR, pp. 6105-6114.
[36] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
[37] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, "Mobilenetv2: Inverted residuals and linear bottlenecks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510-4520.
校內:2030-08-18公開