簡易檢索 / 詳目顯示

研究生: 劉益呈
Liu, Yi-Cheng
論文名稱: 運用神經網路於老式錄音的語音增強研究
Speech Enhancement by Neural Network for Old Voice Recording
指導教授: 郭致宏
Kuo, Chih-Hung
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2020
畢業學年度: 109
語文別: 中文
論文頁數: 59
中文關鍵詞: 語音降噪老式錄音生成對抗性網路喀擦噪音
外文關鍵詞: Speech enhancement, generative adversarial network, old voice recording
相關次數: 點閱:147下載:11
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 大多數應用於語音增強的網路,只關注降噪效果卻忽略移除噪音時所造成的失真,其中失真代表語音資訊被破壞。因此本篇論文提出預處理步驟和修復-降噪網路,並應用於老式錄音。其中預處理步驟用於移除喀擦噪音和被噪音嚴重破壞的部分,修復-降噪網路由兩個網路組成:修復網路和降噪網路,修復網路以時域生成對抗網路修復因預處理步驟造成的語音缺失;降噪網路以增強型兩階段網路和頻域生成對抗網路移除背景噪音,藉由頻域生成對抗網路強大的生成能力修補增強型兩階段網路降噪時引發的失真。大多數的降噪架構能很好地移除噪音,但是相對地造成更多失真,在老式錄音中,往往因為找不到與錄音中紀錄語言相同的訓練數據集導致失真更加嚴重。透過生成對抗網路的生成能力,可以修復失真並獲得與人聲更接近的結果。實驗結果表明,本篇論文所提的演算法有較高的語音質量和較少的失真,並在STOI和PESQ有優於其他降噪方法的表現。

    Most of the neural networks used in speech enhancement focus only on noise reduction but ignore the mitigation of other sources of distortion which crucially mark the difference between clean speech and denoised speech. This study proposes a preprocessing step and a repair-denoise network (RDN) model for speech enhancement of old speech recording. The preprocessing step is to silence the click noise and the seriously corrupted parts. The RDN model consists of two networks: repair network and denoise network. We use the time-domain generative adversarial network (TGAN) to repair the silence caused by the declick step. We use the advanced two-stage network (aTSN) and the frequency-domain generative adversarial network (FGAN) as the denoise network. The FGAN is used to repair the distorted result of the aTSN. Most denoise architectures provide good performance for noise reduction. However, distortion is a critical problem that has not been resolved. Moreover, the problem gets worse when there is no training dataset available in the same language, such as old speech recording. Therefore, using power generation capabilities of GAN, we can repair the distortion and obtain an advanced result which is more similar to human voice. Experimental results show that the proposed model RDN retains high speech quality with lower distortion outperforming other baseline methods in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ).

    目錄 中文摘要 I 英文摘要 II 誌謝 IX 目錄 X 表目錄 XIII 圖目錄 XIV 第一章 緒論 1 1-1 前言 1 1-2 研究動機 1 1-3 研究貢獻 2 1-4 論文架構 3 第二章 研究背景介紹 4 2-1 語音前處理與後處理 4 2-1-1 時域輸入 4 2-1-2 頻域輸入 5 2-2 神經網路(Neural Network, NN) 7 2-2-1 反向傳播算法(Back Propagation, BP) 7 2-2-2 激活函數(Activation function) 8 2-2-3 卷積神經網路(Convolutional Neural Networks, CNN) 8 2-3 生成對抗性網路(Generative Adversarial Network, GAN) 9 2-3-1 傳統生成對抗性網路 9 2-3-2 條件式生成對抗性網路 (conditional GAN, cGAN) 10 2-3-3 Wasserstein GAN(WGAN) 11 第三章 降噪演算法相關研究介紹 15 3-1 深度神經網路(Deep Neural Network, DNN) 15 3-2 多目標學習網路(Multi-target Learning network, MTL network) 17 3-3 語音增強生成對抗網路(Speech Enhancement Generative Adversarial Network, SEGAN) 18 3-4 兩階段網路(Two Stage Networks, TSN) 19 3-5 相關降噪研究比較 20 第四章 基於機器學習之語音降噪演算法 23 4-1 去除喀擦噪音模組 24 4-2 時域生成對抗性網路(Time-domain Generative Adversarial Network, TGAN) 26 4-3 降噪網路(Denoise network) 29 4-3-1 噪音感知特徵(Noise aware feature) 30 4-3-2 增強型兩階段網路(advanced two-stage network, aTSN) 31 4-3-3 頻域生成對抗性網路(Frequency-domain Generative Adversarial Network, FGAN) 34 4-3-4 損失函數 37 第五章 實驗環境與數據分析 39 5-1 實驗環境 39 5-1-1 實驗資料 39 5-1-2 實驗設置 40 5-1-3 語音質量評分方法 40 5-2 演算法實驗結果比較 41 5-2-1 時域對抗生成網路測試結果 42 5-2-2 噪音感知特徵分析 44 5-2-3 降噪網路架構改進分析 48 5-2-4 降噪網路與相關文獻的降噪效果比較 51 5-2-5 老式錄音降噪結果 53 第六章 結論與未來展望 55 6-1 結論 55 6-2 未來展望 55 參考文獻 56

    參考文獻
    [1] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” arXiv preprint arXiv:1701.07875, 2017.
    [2] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of Wasserstein GANs,” in Advances in neural information processing systems, pp. 5767–5777, 2017.
    [3] D. Michelsanti and Z.-H. Tan, “Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification,” arXiv preprint arXiv:1709.01703, 2017.
    [4] J. Kim and M. Hahn, “Speech enhancement using a two-stage network for an efficient boosting strategy,” IEEE Signal Processing Letters, vol. 26, no. 5, pp. 770–774, 2019.
    [5] Y.-H. Tu, J. Du, and C.-H. Lee, “DNN training based on classic gain function for single-channel speech enhancement and recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 910–914, IEEE, 2019.
    [6] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. L. Moreno, Y. Wu, et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Advances in neural information processing systems, pp. 4480–4490, 2018.
    [7] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on acoustics, speech, and signal processing, vol. 27, no. 2, pp. 113–120, 1979.
    [8] J. Chen, J. Benesty, Y. Huang, and S. Doclo, “New insights into the noise reduction Wiener filter,” IEEE Transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1218–1234, 2006.
    [9] S.-W. Fu, Y. Tsao, X. Lu, and H. Kawai, “Raw waveform-based speech enhancement by fully convolutional networks,” in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 006–012, IEEE, 2017.
    [10] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541-551, 1989.
    [11] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
    [12] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
    [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014.
    [14] Maas, Andrew L., Awni Y. Hannun, and Andrew Y. Ng., “Rectifier nonlinearities improve neural network acoustic models,” in Proc. International Conference on Machine Learning (ICML), pp. 3-8, 2013.
    [15] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
    [16] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” arXiv preprint arXiv:1802.05957, 2018.
    [17] D. Liu, P. Smaragdis, and M. Kim, “Experiments on deep learning for speech denoising,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
    [18] A. Narayanan and D. Wang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7092–7096, IEEE, 2013.
    [19] L. Sun, J. Du, L.-R. Dai, and C.-H. Lee, “Multiple-target deep learning for LSTM-RNN based speech enhancement,” in 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA), pp. 136–140, IEEE, 2017.
    [20] S. Pascual, A. Bonafonte, and J. Serra, “SEGAN: Speech enhancement generative adversarial network,” arXiv preprint arXiv:1703.09452, 2017.
    [21] M. L. Seltzer, D. Yu, and Y. Wang, “An investigation of deep neural networks for noise robust speech recognition,” in 2013 IEEE international conference on acoustics, speech and signal processing, pp. 7398–7402, IEEE, 2013.
    [22] J. S. Garofolo, “TIMIT acoustic phonetic continuous speech corpus,” Linguistic Data Consortium, 1993, 1993.
    [23] G. Hu, “100 nonspeech environmental sounds,” The Ohio State University, Department of Computer Science and Engineering, 2004.
    [24] A. Varga and H. J. Steeneken, “Assessment for automatic speech recognition: Ii. NOISEX -92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech communication, vol. 12, no. 3, pp. 247–251, 1993.
    [25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
    [26] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
    [27] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), vol. 2, pp. 749–752, IEEE, 2001.
    [28] T. Kounovsky and J. Malek, “Single channel speech enhancement using convolutional neural network,” in 2017 IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM), pp. 1–5, IEEE, 2017.

    下載圖示 校內:2022-11-06公開
    校外:2022-11-06公開
    QR CODE