簡易檢索 / 詳目顯示

研究生: 陳奕勳
Chen, Yi-Shiung
論文名稱: 利用深度學習迴歸與麥克風陣列除噪系統應用於賣場機器人
DNN Regression Model and Microphone Array for Noise Reduction System on Supermarket Robot Application
指導教授: 王駿發
Wang, Jhing-Fa
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2018
畢業學年度: 106
語文別: 英文
論文頁數: 58
中文關鍵詞: 深度學習迴歸模型卷積遞歸神經網絡語音增強麥克風陣列使用者介面
外文關鍵詞: deep learning regression model, convolutional-recurrent neural network, speech enhancement, microphone array, user interface
相關次數: 點閱:94下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文提出一個利用深度學習迴歸模型與麥克風陣列除噪之系統,並將其應用於賣場機器人。所提出的語音增強算法分為訓練階段和增強階段兩部分。在訓練階段,使用大量數據讓DNN模型學習映射函數,並分離估計的乾淨語音特徵與吵雜的語音特徵。我們並沒有假設吵雜的語音和乾淨的語音之間的關係。此外,我們添加卷積遞歸神經網絡以改善原始DNN模型的性能。主要目的是利用卷積神經網絡來提取特徵,並使用循環神經網絡來處理時態模型。在增強階段,利用麥克風陣列收音獲得前處理的音檔,將音檔輸入進已經訓練好的卷積遞歸神經網絡來獲得乾淨的語音特徵,最後利用方法重建增強後語音。在語音增強實驗中,使用我們的模型可以將PESQ得分提高0.83。這表明我們的模型在不同的噪音測試下可以保持一定的噪音抑制效果。在語音辨識實驗中,我們的模型可以有效地將語音辨識的正確率提高0.73%。這表明我們的模型在實際場域中也具有抑制噪音的能力並改善語音質量。該語音增強系統用於克服現實生活中的噪音干擾,提高語音辨識的準確性,為後端對話系統提供正確的輸入語句。麥克風陣列用於收音,這不僅可以幫助我們前端的除噪,還可以增加系統收音的距離。使用者界面可以幫助使用者更輕鬆地與對話系統進行互動。

    In this thesis, we proposed a DNN regression model and microphone array for noise reduction system on supermarket robot. The proposed speech enhancement algorithm is divided into two parts, training phase and enhancement phase. Using a large amount of data to let the DNN model learn the mapping function, and separate the estimated clean speech features from the noisy speech features. There is also no assumption of the relationship between noisy speech and clean speech. In addition, Convolutional recurrent neural networks are added to improve the performance of the original DNN model. The main purpose is to take advantage of convolutional networks in feature extraction and to use recurrent networks to have the ability to process temporal models. In the experiment of speech enhancement, using our model can increase the PESQ score of 0.83. This shows that our model can maintain a certain noise suppression effect under different noise tests. In the experiment of speech recognition, our model can also effectively improve the correct rate of speech recognition by 0.73%. This shows that our model also has noise suppression in the real field and improves the voice quality. This speech enhancement system is used to overcome the noise interference in real life, increase the accuracy of speech recognition and provide correct sentences for the backend dialogue system. The microphone array is used for collecting audio, which not only helps us to reduce noise at the front end, but also enhances the collected audio distance of the system. The conversational user interface can help users interact with the conversation system more easily.

    中文摘要 I Abstract II 致謝 IV Content V Figure List VII Table List VIII Chapter 1. Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 Thesis Objective 3 1.4 Thesis Organization 4 Chapter 2. Related Works 5 2.1 Overview of Service Robots 5 2.2 Overview of Single-Channel Speech Enhancement 6 2.2.1 Speech Enhancement Algorithms based on Spectral Subtraction 7 2.2.2 Speech Enhancement Algorithms based on Deep Neural Network 9 2.2.3 Speech Enhancement Algorithms based on Generative Adversarial Network (GAN) 11 2.2.4 Noise Aware Training (NAT) 13 2.3 Overview of Multi-Channel Speech Enhancement 14 2.3.1 Microphone Array 14 2.4 User Interface 16 Chapter 3. Regression Model for Speech Enhancement Systems based on Deep Neural Networks 17 3.1 System Overview 17 3.2 Data Preprocessing and Log Spectral Feature Extraction 19 3.2.1 Preprocessing and Fourier Transform 19 3.2.1.1 Noisy Sample Generation and Normalization 19 3.2.1.2 Framing 20 3.2.1.3 Hamming Window 21 3.2.1.4 Fast Fourier Transform (FFT) 21 3.2.2 Log Spectral Feature Extraction 22 3.2.2.1 Take Logarithm 24 3.2.2.2 Feature Scaling 24 3.3 Deep Neural Networks Architecture 26 3.3.1 Deep Learning 26 3.3.2 Deep Neural Network (DNN) 28 3.3.2.1 Deep Neural Network Architecture Design 29 3.3.2.2 Forward-Propagation 30 3.3.2.3 Loss Function 32 3.3.2.4 Regularization 33 3.3.2.5 Back-Propagation Algorithm 34 3.4 Convolutional-Recurrent Neural Networks Architecture 37 3.4.1 Convolutional-Recurrent Neural Network Design 37 3.4.2 Convolutional Component 38 3.4.3 Long Short Term Memory Networks 40 3.4.4 Bi-directional Recurrent Component 41 3.4.5 Fully-connected Component and Optimization 43 3.5 Enhancement Phase of the System 43 3.6 Human Interface of the Dialogue System 44 Chapter 4. Experimental Results 47 4.1 Experimental Environment 47 4.2 Objective evaluation method for speech enhancement algorithm 47 4.3 Experiment for Speech Enhancement System 48 4.3.1 Experimental database collection 48 4.3.2 Dataset and Experiment Setup 49 4.3.3 Experimental Results 50 4.4 Evaluation for the Proposed System 52 4.4.1 Speech recognition correct rate evaluation 53 Chapter 5. Conclusion and Future Work 55 5.1 Conclusion 55 5.2 Future Work 55 Reference 56

    [1] (2018). Zenbo智慧居家好夥伴. Available: https://zenbo.asus.com/tw/
    [2] (2018). Who is Pepper? Available: https://www.ald.softbankrobotics.com/en/cool-robots/pepper
    [3] A. Purington, J. G. Taft, S. Sannon, N. N. Bazarova, and S. H. Taylor, "Alexa is my new BFF: social roles, user satisfaction, and personification of the amazon echo," in Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, 2017, pp. 2853-2859: ACM.
    [4] (2015). Webcast: Investing in Robotics for 2015. Available: https://www.roboticsbusinessreview.com/manufacturing/webcast_investing_in_robotics_for_20151/
    [5] T. Sugiyama, K. Funakoshi, M. Nakano, and K. Komatani, "Estimating response obligation in multi-party human-robot dialogues," in Humanoid Robots (Humanoids), 2015 IEEE-RAS 15th International Conference on, 2015, pp. 166-172: IEEE.
    [6] (2013). 各音源分貝比較. Available: https://news.housefun.com.tw/news/article/40772430998.html
    [7] (2017). 如果 AI 連台灣國語都能聽懂 陳良基:我們會有更多競爭優勢. Available: https://www.limitlessiq.com/news/post/view/id/1451/
    [8] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, "A regression approach to speech enhancement based on deep neural networks," IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 1, pp. 7-19, 2015.
    [9] A. R. Patel, R. S. Patel, N. M. Singh, and F. S. Kazi, "Vitality of Robotics in Healthcare Industry: An Internet of Things (IoT) Perspective," in Internet of Things and Big Data Technologies for Next Generation Healthcare: Springer, 2017, pp. 91-109.
    [10] S. Booth, J. Tompkin, H. Pfister, J. Waldo, K. Gajos, and R. Nagpal, "Piggybacking robots: human-robot overtrust in university dormitory security," in Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction, 2017, pp. 426-434: ACM.
    [11] D. Fischinger et al., "Hobbit, a care robot supporting independent living at home: First prototype and lessons learned," Robotics and Autonomous Systems, vol. 75, pp. 60-78, 2016.
    [12] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, "Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks," in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
    [13] P. C. Loizou, Speech Enhancement: Theory and Practice, Second Edition. CRC Press, 2017.
    [14] Y. Ephraim and H. L. Van Trees, "A signal subspace approach for speech enhancement," IEEE Transactions on speech and audio processing, vol. 3, no. 4, pp. 251-266, 1995.
    [15] Y. Li, H. Cui, and K. Tang, "Speech enhancement algorithm based on spectral subtraction," JOURNAL-TSINGHUA UNIVERSITY, vol. 46, no. 10, p. 1685, 2006.
    [16] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, "An experimental study on speech enhancement based on deep neural networks," IEEE Signal processing letters, vol. 21, no. 1, pp. 65-68, 2014.
    [17] M. Tu and X. Zhang, "Speech enhancement based on deep neural networks with skip connections," in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, 2017, pp. 5565-5569: IEEE.
    [18] S. Pascual, A. Bonafonte, and J. Serra, "SEGAN: Speech enhancement generative adversarial network," arXiv preprint arXiv:1703.09452, 2017.
    [19] G. W. Elko, J. M. Meyer, and T. F. Gaensler, "Noise-reducing directional microphone array," ed: Google Patents, 2015.
    [20] (2018). OLAMI Voice Kit – 語音智慧喇叭 DIY 開發套件. Available: https://tw.olami.ai/open/website/price/kit
    [21] (2017). OLAMI Microphone-Array Board:歐拉蜜麦麥克風陣列板. Available: http://z.elecfans.com/137.html
    [22] L. Baldwin, T. Freeman, M. Tjalve, B. Ebersold, and C. Weider, "System and method for a cooperative conversational voice user interface," ed: Google Patents, 2015.
    [23] J. Du and Q. Huo, "A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions," in Ninth Annual Conference of the International Speech Communication Association, 2008.
    [24] F. Xie and D. Van Compernolle, "A family of MLP based nonlinear spectral estimators for noise reduction," in Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE International Conference on, 1994, vol. 2, pp. II/53-II/56 vol. 2: IEEE.
    [25] E. A. Wan and A. T. Nelson, "Networks for speech enhancement," Handbook of neural networks for speech processing. Artech House, Boston, USA, vol. 139, p. 1, 1999.
    [26] (2005). Frames Representation of Speech Signal Available: https://basic-programming.blogspot.com/2005/11/frames-representation-of-speech-signal.html
    [27] (2005). 3-2 Basic Acoustic Features (基本聲學特徵). Available: http://mirlab.org/jang/books/audioSignalProcessing/audioBasicFeature.asp?title=3-2%20Basic%20Acoustic%20Features%20(%B0%F2%A5%BB%C1n%BE%C7%AFS%BCx)
    [28] (2005). 12-2 MFCC. Available: http://mirlab.org/jang/books/audioSignalProcessing/speechFeatureMfcc.asp?title=12-2%20MFCC
    [29] G. Shi, M. M. Shanechi, and P. Aarabi, "On the importance of phase in human speech recognition," IEEE transactions on audio, speech, and language processing, vol. 14, no. 5, pp. 1867-1874, 2006.
    [30] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, "Exploring strategies for training deep neural networks," Journal of machine learning research, vol. 10, no. Jan, pp. 1-40, 2009.
    [31] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
    [32] A. L. Maas, A. Y. Hannun, and A. Y. Ng, "Rectifier nonlinearities improve neural network acoustic models," in Proc. icml, 2013, vol. 30, no. 1, p. 3.
    [33] E. Arisoy, A. Sethy, B. Ramabhadran, and S. Chen, "Bidirectional recurrent neural network language models for automatic speech recognition," in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, 2015, pp. 5421-5425: IEEE.
    [34] Y. Bengio, P. Simard, and P. Frasconi, "Learning long-term dependencies with gradient descent is difficult," IEEE transactions on neural networks, vol. 5, no. 2, pp. 157-166, 1994.
    [35] Y. Yao and Z. Huang, "Bi-directional LSTM recurrent neural network for Chinese word segmentation," in International Conference on Neural Information Processing, 2016, pp. 345-353: Springer.
    [36] D. Griffin and J. Lim, "Signal estimation from modified short-time Fourier transform," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236-243, 1984.
    [37] J. S. Garofolo, "Getting started with the DARPA TIMIT CD-ROM: An acoustic phonetic continuous speech database," National Institute of Standards and Technology (NIST), Gaithersburgh, MD, vol. 107, p. 16, 1988.
    [38] A. L. Maas, Q. V. Le, T. M. O'Neil, O. Vinyals, P. Nguyen, and A. Y. Ng, "Recurrent neural networks for noise reduction in robust ASR," in Thirteenth Annual Conference of the International Speech Communication Association, 2012.

    無法下載圖示 校內:2023-08-31公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE