簡易檢索 / 詳目顯示

研究生: 胡毓翁
Hu, Yu-Weng
論文名稱: 台灣客家四縣腔之高齡照護用語特徵研究
A Speech Feature Study of Taiwan Hakka Si-Yen Accent for Elderly Care
指導教授: 周榮華
Chou, Jung-Hua
學位類別: 碩士
Master
系所名稱: 工學院 - 工程科學系
Department of Engineering Science
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 46
中文關鍵詞: 語者辨識語音辨識語音特性分析台灣客家話
外文關鍵詞: Taiwan Hakka language, Speaker recognition, Speech recognition, Speech feature
相關次數: 點閱:121下載:9
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 台灣客家語(Hakka-Language)相對於在台灣本土語言中,廣泛使用的國語與河洛語而言屬於弱勢語言。本論文針對高齡者常用的特定詞彙,進行台灣客家話四縣腔錄音,包括日常用語、身心狀況、緊急狀況與服務指令四種分類之用語,錄製人數共有十位語者,每位語者各別錄製五十六個特定詞彙。研究內容為使用錄製的音訊,做時域與頻域上的特性分析,使用互相關函數來分析短劇中單字訊號的相關性,並以訊號的相似程度判斷該音訊所屬的語者。本論文也使用機器學習的監督式學習進行分類任務,辨識五十六個短句與十位語者。作為機器學習輸入所使用的語音特徵為MFCC (Mel-Frequency Cepstral Coefficients),訓練模型使用循環神經網路的LSTM (Long Short Term Memory) 模型,於訓練資料量較缺乏的情況下,五十六個短句的辨識正確率為93.75%、語者辨識正確率為99.17%;也比較不同的語音特徵對於機器學習的結果,發現30維的MFCC能訓練出較高準確率的模型。訓練完模型之後,也建構一個系統能夠利用先前的結果,即時辨識輸入的音訊。使用麥克風接收之語音做為輸入,音訊中可能夾雜非語音片段與雜訊,為了辨識的準確率,將音訊透過訊號設定語音特性門檻的方法,過濾成有效人聲的訊號後再進行辨識。

    Taiwan is facing a declining birthrate and an aging population. Hence, companion robots may be an important alternative to the manpower shortage for elderly caring. Although voice assistants are popular and bring convenience to various services, there are few voice interfaces that can help the users of Taiwan Hakka people. This is quite in contrast to the situation that there are still many elderly people currently using Taiwan Hakka language in their daily life. Thus, it would be helpful to extend voice assistant services to the Hakka elderly.
    In this thesis, 56 speaking sounds of Hakka Si-Yen accent for elderly care were recorded in a WAV format. There were a total of 10 different speakers including 6 females and 4 males; namely 560 speech signals altogether. The fundamental frequency of the signal was obtained via auto-correlation. It is found that the average fundamental frequencies of the male speakers are lower and relatively more concentrated than those of the female speakers. Speech feature analysis was also performed by zero-crossing rate, root mean square energy, spectral centroid, and spectral flatness. By the maximum value of the cross correlation function, the accuracy of speaker recognition is 70.62%.
    For speech recognition, machine learning was performed by supervised learning. The signals were transformed into MFCC (Mel-Frequency Cepstral Coefficients) and LSTM (Long Short-Term Memory) was used for training. The accuracy of recognition is 93.75% for the 56 sentences and 99.18% for the 10 speakers. In addition, the microphone voice input was added into the system of real time speech recognition. The original signal is transformed into speech signal by an energy threshold before recognition by the pre-trained model.

    摘要 II Extended Abstract III 致謝 XIII 目錄 XIV 表目錄 XVI 圖目錄 XVII 第一章 緒論 1 1-1 研究動機與背景 1 1-2 研究目的 1 1-3 文獻回顧 1 1-3-1 陪伴機器人文獻回顧 1 1-3-2 語音特徵文獻回顧 3 1-3-3 神經網路模型文獻回顧 6 1-4 論文架構 8 第二章 背景技術介紹 9 2-1 語音特徵提取(Speech Feature Extraction) 9 2-1-1 基本頻率(Fundamental Frequency) 9 2-1-2 過零率(Zero crossing rate, ZCR) 9 2-1-3 均方根能量(Root mean square energy, RMSE) 9 2-1-4 頻譜質心(Spectral centroid) 10 2-1-5 頻譜平坦度(Spectral flatness) 10 2-2 梅爾頻率倒頻譜係數(Mel-Frequency Cepstral Coefficients, MFCC) 10 2-3 長短記憶模型(Long Short-Term Memory, LSTM) 15 第三章 分析方法與討論 17 3-1 分析資料說明 17 3-2 音訊特性分析 19 3-2-1 基本頻率分析 19 3-2-2 過零率與方均跟能量 22 3-2-3 過零率與平坦度 22 3-2-4 頻譜質心 23 3-3 訊號相關性 24 3-3-1 互相關函數(Cross Correlation Function) 25 3-3-2 語者訊號相關性實驗 26 第四章 語音辨識與語者辨識 31 4-1 特徵提取 31 4-2 神經網路架構 33 4-2-1 Batch Normalization 34 4-2-2 Dropout 35 4-3 模型訓練 36 4-4 語音辨識系統 38 第五章 結論與建議 41 5-1 結論 41 5-2 建議 42 參考文獻 43

    [1] 105年度全國客家人口暨語言基礎資料調查研究,行政院客家委員會,2017. https://www.hakka.gov.tw/File/Attach/37585/File_73865.pdf
    [2] Chandimal Jayawardena, I-Han Kuo, Elizabeth Broadbent and Bruce A. MacDonald, “Socially Assistive Robot HealthBot: Design, Implementation, and Field Trials”, IEEE Systems Journal, Volume 10, Issue 3, pp. 1056-1067, 2016
    [3] Yoshiaki Takada, Yukie Majima, Seiko Masuda, Naoki Taira and Yumiko Nakamura, “Reducing Isolation and Cognitive Function of Older People at Home by Family Robot “LOVOT” Intervention”, 2021 10th International Congress on Advanced Applied Informatics (IIAI-AAI),2021.
    [4] MaitreyeeWairagkar , Maria R. Lima , Graduate Student Member, IEEE, Daniel Bazo, Richard Craig , Hugo Weissbart , Appolinaire C. Etoundi, Tobias Reichenbach , Member, IEEE, Prashant Iyengar, SnehVaswani, Christopher James , Senior Member, IEEE, PayamBarnaghi , Senior Member, IEEE, Chris Melhuish, and Ravi Vaidyanathan , Member, IEEE, “Emotive Response to a Hybrid-Face Robot and Translation to Consumer Social Robots”, IEEE Internet of Things Journal, Volume 9, Issue 5, pp. 3174-3188, 2022, DOI: 10.1109/JIOT.2021.3097592.
    [5] SonghitaMisra, TusharKanti Das, ParthaSaha, UjwalaBaruah and Rabul H. Laskar, “Comparison of MFCC and LPCC for a fixed phrase speaker verification system, time complexity and failure analysis”, 2015 International Conference on Circuits, Power and Computing Technologies [ICCPCT-2015], 2015, DOI: 10.1109/ICCPCT.2015.7159307.
    [6] EricksRachmatSwedia, Achmad Benny Mutiara, Muhammad Subali and Ernastuti, ”Deep Learning Long-Short Term Memory (LSTM) for Indonesian Speech Digit Recognition using LPC and MFCC Feature”, 2018 Third International Conference on Informatics and Computing (ICIC), 2018, DOI: 10.1109/IAC.2018.8780566.
    [7] Tumisho Billson Mokgonyane, Tshephisho Joseph Sefara, Thipe Isaiah Modipa, Mercy MosibudiMogale; Madimetja Jonas Manamela, Phuti John Manamela, “Automatic Speaker Recognition System based on Machine Learning Algorithms”, 2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA), 2019, DOI: 10.1109/RoboMech.2019.8704837.
    [8] Seol-Hyun Noh, “Analysis of Gradient Vanishing of RNNs and Performance Comparison”,.MDPI, 2021, DOI:10.3390/info12110442.
    [9] Shudong Yang, Xueying Yu and Ying Zhou, “LSTM and GRU Neural Network Performance Comparison Study: Taking Yelp Review Dataset as an Example”, 2020 International Workshop on Electronic Communication and Artificial Intelligence (IWECAI), 2020, DOI: 10.1109/IWECAI50956.2020.00027.
    [10] FarzadIzadi, Ramin Mohseni, Ahmad Daneshi and Nazila Sandughdar, “Determination of Fundamental Frequency and Voice Intensity in Iranian Men and Women Aged Between 18 and 45 Years”, Journal of Voice, pp. 336-340, 2012.
    [11] KatarzynaPisanski, Jordan Raine† and David Reby, “Individual differences in human voice pitch are preserved from speech to screams, roars and pain cries” The Royal Society, 2020.
    [12] Thein HtayZaw and Nu War, “The combination of spectral entropy, zero crossing rate, short time energy and linear prediction error for voice activity detection”, 2017 20th International Conference of Computer and Information Technology (ICCIT), 2017, DOI: 10.1109/ICCITECHN.2017.8281794.
    [13] V. SubbaRamaiah and R. Rajeswara Rao, “Multi-speaker activity detection using zero crossing rate”, 2016 International Conference on Communication and Signal Processing (ICCSP), 2016, DOI: 10.1109/ICCSP.2016.7754232.
    [14] Syed Asif Ahmad Qadri, Teddy Surya Gunawan, TaibaWani, Muhammad FahrezaAlghifari, HasmahMansor and Mira Kartiwi, “Comparative Analysis of Gender Identification using Speech Analysis and Higher Order Statistics”, 2019 IEEE International Conference on Smart Instrumentation, Measurement and Application (ICSIMA), 2019, DOI: 10.1109/ICSIMA47653.2019.9057296.
    [15] YunanCai and Wenlong Xu, “Recognition and Extraction of Cough Sound from Audio Signals”, 2021 2nd International Conference on Artificial Intelligence and Computer Engineering (ICAICE), 2021, DOI: 10.1109/ICAICE54393.2021.00136.
    [16] 梅爾倒頻譜係數, https://blog.maxkit.com.tw/2019/12/mfcc.html , 2022.07查詢
    [17] Jueting Liu, Marisha Speights, Dallin Bailey, Sicheng Li, Huanyi Zhou; Yaoxuan Luan, TianshiXie and Cheryl Seals, “Speech Disorders Classification in Phonetic Exams with MFCC and DTW”, 2021 IEEE 7th International Conference on Collaboration and Internet Computing (CIC), 2021, DOI: 10.1109/CIC52973.2021.00015.
    [18] Suman K. Saksamudre and R. R. Deshmukh, “Comparative Study of Isolated Word Recognition System for Hindi Language”, International Journal of Engineering Research & Technology (IJERT), 2015, pp. 536-540, DOI:10.17577/IJERTV4IS070443.
    [19] MarlynMaseri and MazlinaMamat, “Performance Analysis of Implemented MFCC and HMM-based Speech Recognition System”, 2020 IEEE 2nd International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), 2020, DOI: 10.1109/IICAIET49801.2020.9257823
    [20] 音訊處理與辨識http://mirlab.org/jang/books/audiosignalprocessing/speechFeatureMfcc_chinese.asp?title=12 , 2022.07查詢.
    [21] 數位訊號處理,http://timag-shield.blogspot.com/2017/04/thinkdsp_1.html, 2022.07查詢
    [22] Louis Ferdinand Boesday, Yoyon K.Suprapto, “Polyphonic Signal Analysis Using Cross-Correlationon Sasando”, 2015 4th International Conference on Instrumentation, Communications, Information Technology, and Biomedical Engineering (ICICI-BME), DOI: 10.1109/ICICI-BME.2015.7401340
    [23] Cross correlation from Wilipedia, https://en.wikipedia.org/wiki/Cross-correlation, 2022.07查詢
    [24] Yi Du, Naxin Cui, Huixin Li, HaoNie, Yuemei Shi, Ming Wang and Tao Li, “The Vehicle’s Velocity Prediction Methods Based on RNN and LSTM Neural Network”, 2020 Chinese Control And Decision Conference (CCDC), 2020, DOI: 10.1109/CCDC49329.2020.9164532.
    [25] Yujun Lu, Yihua Shi, GuiminJia and Jinfeng Yang, “A new method for semantic consistency verification of aviation radiotelephony communication based on LSTM-RNN”, 2016 IEEE International Conference on Digital Signal Processing (DSP), 2016, DOI: 10.1109/ICDSP.2016.7868592.
    [26] Rawia Ab. Mohammeda, Akbas E. Ali b and Nidaa F. Hassan, “Advantages and Disadvantages of Automatic Speaker Recognition Systems”, Journal of Al-Qadisiyah for Computer Science and Mathematics Vol.11(3), 2019 , pp.21–30.
    [27] Pooja Vinod Janse, Smita B Magre, Pratik Kurzekar and Ratnadeep R. Deshmukh, “A Comparative Study between MFCC and DWT Feature Extraction Technique”, International Journal of Engineering Research & Technology (IJERT) Vol 3, Issue 1, 2014.
    [28] 機器學習,https://ithelp.ithome.com.tw/articles/10278254 ,2022.07查詢
    [29] Shayak Chakraborty, JayantaBanik, ShubhamAddhya and Debraj Chatterjee, “Study of Dependency on number of LSTM units for Character based Text Generation models”, 2020 International Conference on Computer Science, Engineering and Applications (ICCSEA), 2020, DOI: 10.1109/ICCSEA49143.2020.9132839.
    [30] S. Ioffe and C. Szegedy,"Batch normalization: accelerating deep network training by reducing internal covariate shift",International Conference on Machine Learning, PMLR, 2015.
    [31] Taesup Moon, Heeyoul Choi, Hoshik Lee and Inchul Song, “RNNDROP: A novel dropout for RNNS in ASR”, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015, DOI: 10.1109/ASRU.2015.7404775.
    [32] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,"Dropout: a simple way to prevent neural networks from overfitting",The Journal of Machine Learning Research, Volume 15, pp. 1929-1958, 2014.
    [33] 潘佑欣. (2021). 台灣河洛話求救語音特性分析. 國立成功大學工程科學系學位論文.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE