簡易檢索 / 詳目顯示

研究生: 黃裕軒
Huang, Yu-Hsuan
論文名稱: 深度學習應用於語音增強演算法
Deep Learning Applied to Speech Enhancement Algorithm
指導教授: 雷曉方
Lei, Sheau-Fang
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 中文
論文頁數: 85
中文關鍵詞: 深度學習時頻遮罩監督式學習噪音特徵
外文關鍵詞: Deep Learning, Time Frequency mask, Supervised learning, noise estimation features
相關次數: 點閱:68下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在現今社會中,人類與機器之間的語音溝通將會越來越重要,但環境中有極複雜的噪音干擾,影響語音品質,而對於單聲道語音增強演算法來說,只有單聲道資訊是較難去消除環境噪音,在本篇論文中,利用深度學習技術應用於語音增強演算法,透過增加噪音特徵及改變代價函數去讓模型學習,從含噪語音中分離出語音。
    訓練過後的模型,得到語音和噪音估計值,再利用時頻遮罩(Time Frequency mask)對含噪語音遮罩出估測語音。而本論文在訓練階段將代價函數改變,不再是利用均方差作為代價函數,讓深度學習模型並不是直接學習到語音頻譜振幅或是噪音頻譜振幅,而是學習到一段含噪語音中的語音比重以及噪音比重,以及增加噪音輸入特徵來適應噪音變化。實驗結果在客觀語音指標上和在不同含噪語音輸入-15dB ~ 5dB的情況下,能夠有一定程度的抑制噪音和消除大部分噪音干擾,並保留大部分語音資訊。

    The communication between people and machine is getting more important in today’s world. But the complicated noise in the environment seriously influence the dialogue in the speech quality. Using the single channel speech enhance way is more difficult to eliminate noise in environment. In this paper will propose a speech enhancement way which is based on deep learning to divide the noise in the noise speech. The usual way of deep learning algorithm used speech and noise frequency spectrum as learning target. After model training, using the estimation of model’s output to masked noise speech, and divided the speech from the noise speech. By changing the input features at training stage, we can improve the quality of speech. In this way, model learned not only speech frequency amplitude and noise frequency amplitude but also the ratio between speech and noise. It used the ratio to separate the clear speech in the noise speech and used noise estimation features to adapt the variation of noise. The conclusion of experiment show that effect had getting better in the voice objective evaluation indexes which are in different SNR(-15dB~5dB) noise speech input. The noise estimation features greatly promote effect of speech enhancement.

    中文摘要 I EXTENDED ABSTRACT II 致謝 XII 目錄 XIII 表目錄 XV 圖目錄 XVI 第一章 緒論 1 1.1噪音訊號 1 1.2類神經網路簡介 3 1.3語音增強 7 1.4動機與目的 8 1.5論文章節組織 9 第二章相關文獻探討回顧及介紹 10 2.1 單聲道語音增強演算法 10 2.1.1頻譜相減法(Spectral Subtractive Algorithms) 10 2.1.2 最小控制遞迴平均法(Minima Controlled Recursive Averaging, MCRA) 11 2.2 學習演算法 14 2.2.1線性回歸(Linear Regression) 15 2.2.2代價函數(Cost Function) 15 2.2.3梯度下降(Gradient descent) 16 2.2.4函數誤差、過度擬合、欠擬合 18 2.2.5深度學習 21 2.4 深度學習語音強化演算法 24 2.4.1基於深度學習的語音增強演算法 24 2.3.2基於時頻遮罩的深度學習演算法 26 2.3.3噪音感知(Noise aware training,NAT) 27 第三章 深度學習語音增強演算法 28 3.1 語音增強演算法簡介 28 3.2 資料前處理及噪音特徵 30 3.2.1前處理與傅立葉轉換 30 3.2.2噪音特徵 32 3.3 深度神經網路架構 35 3.3.1深度學習 35 3.3.2深度神經網路(Deep Neural Network , DNN) 38 3.3.3遞歸神經網路(Recurrent Neural Network , RNN) 41 3.3.4代價函數(cost function) 43 3.3.5正規化(regularization) 44 3.4網路訓練 45 3.4.1後向傳播演算法(back-propagation algorithm) 45 3.4.2 時序反向傳播演算法(Backpropagation through Time algorithm, BPTT) 48 3.4.3 RNN網路時間權重初始化 52 3.5演算法統整 53 3.5.1 深度網路訓練虛擬碼 53 3.5.2 時頻遮罩(Time-frequency mask) 54 3.5.3 演算法之總結 55 第四章 演算法分析比較與結果 60 4.1語音增強演算法相關客觀指標分析 60 4.1.1信噪比SNR 60 4.1.2 基於相干性的語音辨識度指標CSII 61 4.1.3 短時距客觀清晰度指標(STOI) 62 4.2 模型訓練結果 63 4.2.1 實驗設置 63 4.2.2 模型驗證實驗 64 4.3 模型測試數據結果及比較 67 4.3.1測試集訊號雜訊比之結果 67 4.3.2測試集語音品質之結果 73 4.3.3文獻比較與總結 78 第五章 結論與未來展望 81 參考文獻 82

    [1] P. C. Loizou, Speech enhancement: theory and practice. CRC press, 2013.
    [2] 羅華強 and 通信工程, 類神經網路: MATLAB 的應用. 高立, 2011.
    [3] E. J. Candes, "The restricted isometry property and its implications for compressed sensing," Comptes Rendus Mathematique, vol. 346, no. 9, pp. 589-592, 2008.
    [4] M. Minsky and S. Papert, Perceptrons. MIT press, 1988.
    [5] S. Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE Transactions on acoustics, speech, and signal processing, vol. 27, no. 2, pp. 113-120, 1979.
    [6] A. D. Berstein and I. D. Shallom, "An hypothesized Wiener filtering approach to noisy speech recognition," in Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference on, 1991, pp. 913-916: IEEE.
    [7] E. A. Wan and A. T. Nelson, "Networks for speech enhancement," Handbook of neural networks for speech processing. Artech House, Boston, USA, vol. 139, p. 1, 1999.
    [8] F. Xie and D. Van Compernolle, "A family of MLP based nonlinear spectral estimators for noise reduction," in Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE International Conference on, 1994, vol. 2, pp. II/53-II/56 vol. 2: IEEE.
    [9] I. Cohen and B. Berdugo, "Noise estimation by minima controlled recursive averaging for robust speech enhancement," IEEE signal processing letters, vol. 9, no. 1, pp. 12-15, 2002.
    [10] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
    [11] H.-T. Lin, "Machine Learning http://www.csie.ntu.edu.tw/~htlin/mooc/," 2016.
    [12] A. Ng, "Machine Learning https://www.coursera.org/learn/machine-learning/home/welcome."
    [13] M. D. Zeiler, "ADADELTA: an adaptive learning rate method," arXiv preprint arXiv:1212.5701, 2012. from Cornell University Library.
    [14] J. Duchi, E. Hazan, and Y. Singer, "Adaptive subgradient methods for online learning and stochastic optimization," Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121-2159, 2011.
    [15] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436-444, 2015.
    [16] M. Nielsen, "Neural Networks and Deep Learning http://neuralnetworksanddeeplearning.com/," Dec 2017.
    [17] F. Seide, G. Li, and D. Yu, "Conversational speech transcription using context-dependent deep neural networks," in Twelfth Annual Conference of the International Speech Communication Association, interspeech,2011.
    [18] Y. Bengio, "Learning deep architectures for AI," Foundations and trends® in Machine Learning, vol. 2, no. 1, pp. 1-127, 2009.
    [19] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, "A regression approach to speech enhancement based on deep neural networks," IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 1, pp. 7-19, 2015.
    [20] S. Srinivasan, N. Roman, and D. Wang, "Binary and ratio time-frequency masks for robust speech recognition," Speech Communication, vol. 48, no. 11, pp. 1486-1501, 2006.
    [21] Y. Wang and D. Wang, "Towards scaling up classification-based speech separation," IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 1381-1390, 2013.
    [22] Y. Wang, A. Narayanan, and D. Wang, "On training targets for supervised speech separation," IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 22, no. 12, pp. 1849-1858, 2014.
    [23] M. L. Seltzer, D. Yu, and Y. Wang, "An investigation of deep neural networks for noise robust speech recognition," in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, 2013, pp. 7398-7402: IEEE.
    [24] A. Narayanan and D. Wang, "Joint noise adaptive training for robust automatic speech recognition," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014, pp. 2504-2508: IEEE.
    [25] J. Ma, Y. Hu, and P. C. Loizou, "Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions," The Journal of the Acoustical Society of America, vol. 125, no. 5, pp. 3387-3405, 2009.
    [26] J. Du and Q. Huo, "A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions," in INTERSPEECH, 2008, pp. 569-572.
    [27] S. Rangachari and P. C. Loizou, "A noise-estimation algorithm for highly non-stationary environments," Speech communication, vol. 48, no. 2, pp. 220-231, 2006.
    [28] "CS231n Convolutional Neural Networks for Visual Recognition."from Stanford University Course http://cs231n.stanford.edu/
    [29] Hermans, Michiel, and Benjamin Schrauwen. "Training and analysing deep recurrent neural networks." Advances in Neural Information Processing Systems. 2013. (pp. 190-198).
    [30] R. Hecht-Nielsen, "Theory of the backpropagation neural network," Neural Networks, vol. 1, no. Supplement-1, pp. 445-448, 1988.
    [31] P. J. Werbos, "Backpropagation through time: what it does and how to do it," Proceedings of the IEEE, vol. 78, no. 10, pp. 1550-1560, 1990.
    [32] R. Pascanu, T. Mikolov, and Y. Bengio, "On the difficulty of training recurrent neural networks," in International Conference on Machine Learning, 2013, pp. 1310-1318.
    [33] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick, "Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2874-2883.
    [34] J. M. Kates and K. H. Arehart, "Coherence and the speech intelligibility index," The journal of the acoustical society of America, vol. 117, no. 4, pp. 2224-2237, 2005.
    [35] B. C. Moore, "Speech processing for the hearing-impaired: successes, failures, and implications for speech mechanisms," Speech Communication, vol. 41, no. 1, pp. 81-91, 2003.
    [36] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, "An algorithm for intelligibility prediction of time–frequency weighted noisy speech," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125-2136, 2011.
    [37] E. Rothauser, "IEEE recommended practice for speech quality measurements," IEEE Trans. on Audio and Electroacoustics, vol. 17, pp. 225-246, 1969.
    [38] J. S. Garofolo, "Getting started with the DARPA TIMIT CD-ROM: An acoustic phonetic continuous speech database," National Institute of Standards and Technology (NIST), Gaithersburgh, MD, vol. 107, 1988.
    [39] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, "Deep learning for monaural speech separation," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 2014, pp. 1562-1566: IEEE.
    [40] Xu, Yong, et al. "An experimental study on speech enhancement based on deep
    neural networks." IEEE Signal processing letters 21.1 (2014): 65-68.
    [41] Deng, Li, et al. "Recent advances in deep learning for speech research at
    Microsoft." Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE
    International Conference on. IEEE, 2013.
    [42] Huang, Po-Sen, et al. "Singing-Voice Separation from Monaural Recordings using
    Deep Recurrent Neural Networks." ISMIR. 2014.
    [43] 王小川, 語音訊號處理: 全華圖書, 2008.
    [44] Han, Wei, et al. "Speech enhancement based on improved deep neural networks
    with MMSE pretreatment features." Signal Processing (ICSP), 2016 IEEE 13th
    International Conference on. IEEE, 2016.

    無法下載圖示 校內:2023-03-01公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE