| 研究生: |
潘佑欣 Pan, Yu-Hsin |
|---|---|
| 論文名稱: |
台灣河洛話求救語音特性分析 The Voice Features of Help-asking of Taiwan Heluo Language |
| 指導教授: |
周榮華
Chou, Jung-Hua |
| 共同指導教授: |
侯廷偉
Hou, Ting-Wei |
| 學位類別: |
碩士 Master |
| 系所名稱: |
工學院 - 工程科學系 Department of Engineering Science |
| 論文出版年: | 2021 |
| 畢業學年度: | 109 |
| 語文別: | 中文 |
| 論文頁數: | 46 |
| 中文關鍵詞: | 陪伴機器人 、語者辨識 、台灣河洛話 、語音特性分析 |
| 外文關鍵詞: | companion robot, speaker recognition, Taiwan Heluo language, voice feature analysis |
| 相關次數: | 點閱:147 下載:2 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文旨在研究台灣河洛話(Heluo Language)特定詞彙之語音辨識,用來作為緊急溝通的工具,由於高齡化社會的到來,陪伴機器人(Companion Robot)用來服務高齡者的機會日益增加,而台灣有一大部分高齡者慣於使用河洛話作為日常溝通的主要語言,故讓陪伴機器人能辨識常用的河洛話詞彙即可在緊急狀況給予及時幫助。
本研究針對陪伴機器人在服務高齡者時可能會使用到的常用詞彙進行錄音,包含日常服務需求之用語、緊急狀況之用語和查詢資訊之用語,共有五名錄音者與五十二個常用辭彙,並針對這些音檔進行音訊分析,使用FFT(Fast Fourier Transform)來得到頻譜資訊,將頻譜圖疊加後找出基頻(Fundamental Frequency)與其他主要頻率來進行特徵分析,此節分為語者有關(Speaker Dependent)和語者無關(Speaker Independent)兩個方向的分析,但因FFT無法得知時間資訊,故再使用小波轉換(Wavelet Transform)以此獲得時間與頻率的資訊,此外,本研究嘗試對音訊音節數偵測實驗做分析與評估,結果顯示以能量與過零率為標準之音節數偵測正確率可達52%,以語句時間長度為標準之音節數偵測正確率根據不同語者落在58%~75%,此外也針對偵測困難之處進行討論。
由於陪伴機器人在居家服務或是醫療相關機構服務時,可能會面臨多位語者,為了讓陪伴機器人更加實用與彈性化,可能會因應不同的與者而有不同的互動模式或是權限差異,故本研究使用此五名錄音者進行語者辨識,透過MFCC (Mel-Frequency Cepstral Coefficients)做特徵提取,再經由神經網路LSTM (Long Short-Term Memory)的模型進行語者辨識,正確率可達98%。
Recently elderly-care becomes an important issue in Taiwan and companion robots may play an important role in this respect. The survey indicates that a large proportion of the elderly currently speaks Taiwan Heluo language in their daily life. Thus, for effective communications with the robots, it is necessary to know the features of Taiwan Heluo language. In this thesis, 52 Taiwan Helou voice sentences of help-asking nature from five speakers were analyzed to help the elderly in need.
The voice signals the 5 speakers were recorded in the WAV format. Their frequency contents were obtained via FFT (Fast Fourier Transform) for both speaker dependent and independent identification. Wavelet transform was also performed so that the information about relation between frequency and time could be deduced. Two methods were used to detect syllable numbers in a sentence. One was based on both energy and zero crossing rate; the accuracy is 52%. The other was based on the length of speaking time; the accuracy varies with different speakers and ranges from 58% to 75%. The inaccuracy was caused by the speaking speed when speaking time is below 0.2s per syllable.
As the robot may face several speakers and need to provide personalized service according to different speakers, speaker recognition was conducted. In this situation, MFCC (Mel-Frequency Cepstral Coefficients) was used to extract features first. Then, LSTM (Long Short-Term Memory) was used to train the recognition model. The accuracy of recognition is up to 98% for the 5 speakers.
[1] 國家發展委員會之人口推估查詢系統,高齡化指標,https://pop-proj.ndc.gov.tw/chart.aspx?c=10&uid=66&pid=60
[2] 台灣行政院教育部本土語言資源網, 6歲以上本國籍常住人口在家常用語言情形按人口特性分類,https://mhi.moe.edu.tw/newsList.jsp?ID=5&ID2=33
[3] D. Portugal, L. Santos, P. Alvito, J. Dias, G. Samaras, and E. Christodoulou, "SocialRobot: An interactive mobile robot for elderly home care", 2015 IEEE/SICE International Symposium on System Integration (SII), pp. 811-816, 2015, doi: 10.1109/SII.2015.7405084.
[4] C. Jayawardenaet al., "Deployment of a service robot to help older people", 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5990-5995, 2010, doi: 10.1109/IROS.2010.5649910.
[5] A. Di Nuovo et al., "A web based multi-modal interface for elderly users of the Robot-Era multi-robot services", 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 2186-2191, 2014, doi: 10.1109/SMC.2014.6974248.
[6] http://www.parorobots.com/, 2014
[7] T. Parthornratt, D. Kitsawat, P. Putthapipat, and P. Koronjaruwat, "A smart home automation via Facebook chatbot and Raspberry Pi", 2018 2nd International Conference on Engineering Innovation (ICEI), pp. 52-56, 2018, doi: 10.1109/ICEI18.2018.8448761.
[8] N. Rosruen and T. Samanchuen, "Chatbot utilization for medical consultant system", 2018 3rd Technology Innovation Management and Engineering Science International Conference (TIMES-iCON), pp. 1-5, 2018, doi: 10.1109/TIMES-iCON.2018.8621678.
[9] S. Roca, J. Sancho, J. García, and Á. Alesanco, "Microservice chatbot architecture for chronic patient support”, Journal of Biomedical Informatics, Volume 102, 103305, 2020.
[10] T. B. Mokgonyane, T. J. Sefara, T. I. Modipa, M. M. Mogale, M. J. Manamela, and P. J. Manamela, "Automatic speaker recognition system based on machine learning algorithms", 2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa(SAUPEC/RobMech/PRASA), pp. 141-146, 2019,
doi: 10.1109/RoboMech.2019.8704837.
[11] J. Larsson, "Optimizing text-independent speaker recognition using an LSTM neural network", Master thesis, Design and Engineering, School of Innovation, Mälardalen University, Sweden, 2014.
[12] S. Kim, M. Ji, K. C. Kwak, S. Y. Chi, and H. R. Kim, "Text-independent speaker recognition for ubiquitous robot companion", The 3rd International Conference on Ubiquitous Robots and Ambient Intelligence (URAI 2006)
[13] S. Mallat, "A wavelet tour of signal processing", Elsevier, 1999.
[14] P. Du, W. A. Kibbe, and S. M. Lin. "Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching", Bioinformatics, Volume 22, Issue 17, pp. 2059-2065, 2006.
[15]S. Molau, M. Pitz, R. Schluter, and H. Ney, "Computing Mel-frequency cepstral coefficients on the power spectrum", 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), Volume 1, pp. 73-76, 2001, doi: 10.1109/ICASSP.2001.940770.
[16] L. Muda, M. Begam and I. Elamvazuthi, "Voice recognition algorithms using Mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques", Journal of Computing, Volume 2, Issue 3, pp. 138-143, 2010.
[17] M. R. Hasan, M. Jamil, and M. Rahman, "Speaker identification using mel frequency cepstral coefficients", 3rd International Conference on Electrical & Computer Engineering, ICECE 2004.
[18] R. C. Staudemeyer and E.R. Morris, "Understanding LSTM –a tutorial into Long Short-Term Memory recurrent neural networks", 2019, Corpus ID: 202712597.
[19] Z. Pang, F. Niu, Z. O’Neill, "Solar radiation prediction using recurrent neural network and artificial neural network: A case study with comparisons", Renewable Energy, Volume 156, pp. 279-289, 2020.
[20] L. Bontemps, V. L. Cao, J. McDermott, and N. A. Le-Khac, "Collective anomaly detection based on long short-term memory recurrent neural networks", International Conference on Future Data and Security Engineering. Springer, Cham, 2016, Can Tho City, Vietnam.
[21] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, "LSTM: A search space odyssey", in IEEE Transactions on Neural Networks and Learning Systems, Volume 28, no. 10, pp. 2222-2232, 2017, doi: 10.1109/TNNLS.2016.2582924.
[22] X. Huang and K. F. Lee, "On speaker-independent, speaker-dependent, and speaker-adaptive speech recognition", IEEE Transactions on Speech and Audio Processing, Volume 1, no. 2, pp. 150-157, 1993, doi: 10.1109/89.222875.
[23] R. G. Bachu, S. Kopparthi, B. Adapa, and B. D. Barkana, "Separation of voiced and unvoiced using zero crossing rate and energy of the speech signal", ASEE Regional Conference, 2008, Pittsburgh, Pennsylvania, USA.
[24] R. G. Bachu, S. Kopparthi, B. Adapa, and B. D. Barkana, "Voiced/unvoiced decision for speech signals based on zero-crossing rate and energy", Advanced Techniques in Computing Sciences and Software Engineering, Springer, Dordrecht, pp. 279-282, 2010, https://doi.org/10.1007/978-90-481-3660-5_47.
[25] M. Jalil, F. A. Butt and A. Malik, "Short-time energy, magnitude, zero crossing rate and autocorrelation measurement for discriminating voiced and unvoiced segments of speech signals", 2013 The International Conference on Technological Advances in Electrical, Electronics and Computer Engineering (TAEECE), pp. 208-212, 2013, doi: 10.1109/TAEECE.2013.6557272.
[26] M. Hébert, "Text-dependent speaker recognition. Springer handbook of speech processing", Springer, Berlin, Heidelberg, pp. 743-762, 2008.
[27] D. Mitrović, M. Zeppelzauer, and C. Breiteneder, "Features for Content-Based Audio Retrieval”, Chapter 3, Advances in Computers, 2010.
[28] T. V. Ananthapadmanabha, A. G. Ramakrishnan, and P. Balachandran, "An interesting property of LPCs for sonorant vs. fricative discrimination", arXiv preprint arXiv:1411.1267, 2014.
[29] S. Ioffe and C. Szegedy, "Batch normalization: accelerating deep network training by reducing internal covariate shift", International Conference on Machine Learning, PMLR, 2015.
[30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: a simple way to prevent neural networks from overfitting", The Journal of Machine Learning Research, Volume 15, pp. 1929-1958, 2014.