| 研究生: |
黃裕軒 Huang, Yu-Hsuan |
|---|---|
| 論文名稱: |
深度學習應用於語音增強演算法 Deep Learning Applied to Speech Enhancement Algorithm |
| 指導教授: |
雷曉方
Lei, Sheau-Fang |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2018 |
| 畢業學年度: | 106 |
| 語文別: | 中文 |
| 論文頁數: | 85 |
| 中文關鍵詞: | 深度學習 、時頻遮罩 、監督式學習 、噪音特徵 |
| 外文關鍵詞: | Deep Learning, Time Frequency mask, Supervised learning, noise estimation features |
| 相關次數: | 點閱:68 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在現今社會中,人類與機器之間的語音溝通將會越來越重要,但環境中有極複雜的噪音干擾,影響語音品質,而對於單聲道語音增強演算法來說,只有單聲道資訊是較難去消除環境噪音,在本篇論文中,利用深度學習技術應用於語音增強演算法,透過增加噪音特徵及改變代價函數去讓模型學習,從含噪語音中分離出語音。
訓練過後的模型,得到語音和噪音估計值,再利用時頻遮罩(Time Frequency mask)對含噪語音遮罩出估測語音。而本論文在訓練階段將代價函數改變,不再是利用均方差作為代價函數,讓深度學習模型並不是直接學習到語音頻譜振幅或是噪音頻譜振幅,而是學習到一段含噪語音中的語音比重以及噪音比重,以及增加噪音輸入特徵來適應噪音變化。實驗結果在客觀語音指標上和在不同含噪語音輸入-15dB ~ 5dB的情況下,能夠有一定程度的抑制噪音和消除大部分噪音干擾,並保留大部分語音資訊。
The communication between people and machine is getting more important in today’s world. But the complicated noise in the environment seriously influence the dialogue in the speech quality. Using the single channel speech enhance way is more difficult to eliminate noise in environment. In this paper will propose a speech enhancement way which is based on deep learning to divide the noise in the noise speech. The usual way of deep learning algorithm used speech and noise frequency spectrum as learning target. After model training, using the estimation of model’s output to masked noise speech, and divided the speech from the noise speech. By changing the input features at training stage, we can improve the quality of speech. In this way, model learned not only speech frequency amplitude and noise frequency amplitude but also the ratio between speech and noise. It used the ratio to separate the clear speech in the noise speech and used noise estimation features to adapt the variation of noise. The conclusion of experiment show that effect had getting better in the voice objective evaluation indexes which are in different SNR(-15dB~5dB) noise speech input. The noise estimation features greatly promote effect of speech enhancement.
[1] P. C. Loizou, Speech enhancement: theory and practice. CRC press, 2013.
[2] 羅華強 and 通信工程, 類神經網路: MATLAB 的應用. 高立, 2011.
[3] E. J. Candes, "The restricted isometry property and its implications for compressed sensing," Comptes Rendus Mathematique, vol. 346, no. 9, pp. 589-592, 2008.
[4] M. Minsky and S. Papert, Perceptrons. MIT press, 1988.
[5] S. Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE Transactions on acoustics, speech, and signal processing, vol. 27, no. 2, pp. 113-120, 1979.
[6] A. D. Berstein and I. D. Shallom, "An hypothesized Wiener filtering approach to noisy speech recognition," in Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference on, 1991, pp. 913-916: IEEE.
[7] E. A. Wan and A. T. Nelson, "Networks for speech enhancement," Handbook of neural networks for speech processing. Artech House, Boston, USA, vol. 139, p. 1, 1999.
[8] F. Xie and D. Van Compernolle, "A family of MLP based nonlinear spectral estimators for noise reduction," in Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE International Conference on, 1994, vol. 2, pp. II/53-II/56 vol. 2: IEEE.
[9] I. Cohen and B. Berdugo, "Noise estimation by minima controlled recursive averaging for robust speech enhancement," IEEE signal processing letters, vol. 9, no. 1, pp. 12-15, 2002.
[10] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
[11] H.-T. Lin, "Machine Learning http://www.csie.ntu.edu.tw/~htlin/mooc/," 2016.
[12] A. Ng, "Machine Learning https://www.coursera.org/learn/machine-learning/home/welcome."
[13] M. D. Zeiler, "ADADELTA: an adaptive learning rate method," arXiv preprint arXiv:1212.5701, 2012. from Cornell University Library.
[14] J. Duchi, E. Hazan, and Y. Singer, "Adaptive subgradient methods for online learning and stochastic optimization," Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121-2159, 2011.
[15] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436-444, 2015.
[16] M. Nielsen, "Neural Networks and Deep Learning http://neuralnetworksanddeeplearning.com/," Dec 2017.
[17] F. Seide, G. Li, and D. Yu, "Conversational speech transcription using context-dependent deep neural networks," in Twelfth Annual Conference of the International Speech Communication Association, interspeech,2011.
[18] Y. Bengio, "Learning deep architectures for AI," Foundations and trends® in Machine Learning, vol. 2, no. 1, pp. 1-127, 2009.
[19] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, "A regression approach to speech enhancement based on deep neural networks," IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 1, pp. 7-19, 2015.
[20] S. Srinivasan, N. Roman, and D. Wang, "Binary and ratio time-frequency masks for robust speech recognition," Speech Communication, vol. 48, no. 11, pp. 1486-1501, 2006.
[21] Y. Wang and D. Wang, "Towards scaling up classification-based speech separation," IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 1381-1390, 2013.
[22] Y. Wang, A. Narayanan, and D. Wang, "On training targets for supervised speech separation," IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 22, no. 12, pp. 1849-1858, 2014.
[23] M. L. Seltzer, D. Yu, and Y. Wang, "An investigation of deep neural networks for noise robust speech recognition," in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, 2013, pp. 7398-7402: IEEE.
[24] A. Narayanan and D. Wang, "Joint noise adaptive training for robust automatic speech recognition," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014, pp. 2504-2508: IEEE.
[25] J. Ma, Y. Hu, and P. C. Loizou, "Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions," The Journal of the Acoustical Society of America, vol. 125, no. 5, pp. 3387-3405, 2009.
[26] J. Du and Q. Huo, "A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions," in INTERSPEECH, 2008, pp. 569-572.
[27] S. Rangachari and P. C. Loizou, "A noise-estimation algorithm for highly non-stationary environments," Speech communication, vol. 48, no. 2, pp. 220-231, 2006.
[28] "CS231n Convolutional Neural Networks for Visual Recognition."from Stanford University Course http://cs231n.stanford.edu/
[29] Hermans, Michiel, and Benjamin Schrauwen. "Training and analysing deep recurrent neural networks." Advances in Neural Information Processing Systems. 2013. (pp. 190-198).
[30] R. Hecht-Nielsen, "Theory of the backpropagation neural network," Neural Networks, vol. 1, no. Supplement-1, pp. 445-448, 1988.
[31] P. J. Werbos, "Backpropagation through time: what it does and how to do it," Proceedings of the IEEE, vol. 78, no. 10, pp. 1550-1560, 1990.
[32] R. Pascanu, T. Mikolov, and Y. Bengio, "On the difficulty of training recurrent neural networks," in International Conference on Machine Learning, 2013, pp. 1310-1318.
[33] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick, "Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2874-2883.
[34] J. M. Kates and K. H. Arehart, "Coherence and the speech intelligibility index," The journal of the acoustical society of America, vol. 117, no. 4, pp. 2224-2237, 2005.
[35] B. C. Moore, "Speech processing for the hearing-impaired: successes, failures, and implications for speech mechanisms," Speech Communication, vol. 41, no. 1, pp. 81-91, 2003.
[36] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, "An algorithm for intelligibility prediction of time–frequency weighted noisy speech," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125-2136, 2011.
[37] E. Rothauser, "IEEE recommended practice for speech quality measurements," IEEE Trans. on Audio and Electroacoustics, vol. 17, pp. 225-246, 1969.
[38] J. S. Garofolo, "Getting started with the DARPA TIMIT CD-ROM: An acoustic phonetic continuous speech database," National Institute of Standards and Technology (NIST), Gaithersburgh, MD, vol. 107, 1988.
[39] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, "Deep learning for monaural speech separation," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014, pp. 1562-1566: IEEE.
[40] Xu, Yong, et al. "An experimental study on speech enhancement based on deep
neural networks." IEEE Signal processing letters 21.1 (2014): 65-68.
[41] Deng, Li, et al. "Recent advances in deep learning for speech research at
Microsoft." Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE
International Conference on. IEEE, 2013.
[42] Huang, Po-Sen, et al. "Singing-Voice Separation from Monaural Recordings using
Deep Recurrent Neural Networks." ISMIR. 2014.
[43] 王小川, 語音訊號處理: 全華圖書, 2008.
[44] Han, Wei, et al. "Speech enhancement based on improved deep neural networks
with MMSE pretreatment features." Signal Processing (ICSP), 2016 IEEE 13th
International Conference on. IEEE, 2016.
校內:2023-03-01公開