| 研究生: |
江承峻 Chiang, Cheng-Chun |
|---|---|
| 論文名稱: |
群組稀疏隱藏式馬可夫模型應用於語音辨識 Group Sparse Hidden Markov Models for Speech Recognition |
| 指導教授: |
簡仁宗
Chien, Jen-Tzung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2011 |
| 畢業學年度: | 99 |
| 語文別: | 中文 |
| 論文頁數: | 110 |
| 中文關鍵詞: | 語音辨識 、噪音環境 、隱藏式馬可夫模型 、貝氏法則 、稀疏表示法 、群組稀疏表示法 、拉普拉斯比率混合分佈 |
| 外文關鍵詞: | Speech Recognition, Noise Environment, Hidden Markov Model, Bayesian Rule, Sparse Representation, Group Sparse Representation, Laplacian Scale Mixture Distribution |
| 相關次數: | 點閱:138 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
以隱藏式馬可夫模型(hidden Markov model, HMM)為主的聲學模型(acoustic model)已成功並廣泛的應用於語音辨識及不同多媒體分類及辨識系統中,然而透過高斯混合模型(Gaussian mixture model)且利用計算最大相似度(maximum likelihood)準則進行參數估測及訓練時,當高斯分佈機率分佈個數太多時會造成過度學習(over-fitting)的問題,導致測試語音的辨識正確率下降,更何況在真實語音辨識系統的異質環境(heterogeneous environments)中,如何克服噪音干擾及補償測試語音與訓練模型間不匹配等問題,達到模型規則化(model regularization)並具有良好預估(prediction)效果,是現今語音辨識研究中刻不容緩的研究課題。
為了解決過度學習並建立異質環境下之語音辨識系統,在本篇論文中,我們基於貝氏理論(Bayesian theory),發展群組稀疏表示法(group sparse representation)並透過這個方法訓練出一套新穎性之群組稀疏隱藏式馬可夫模型(group sparse HMMs, GS-HMMs)並應用於語音辨識,群組稀疏表示法是將每個語音音框特徵向量在不同隱藏式馬可夫模型狀態的統計特性用兩組基底生成出來,這兩組基底分別為共同性基底(common basis)和個別性基底(individual basis),分別反應的是隱藏式馬可夫模型狀態間(inter-state)共有的特性及模型狀態內(intra-state)個別的特性,每組基底對應到的組合係數是假設來自於一個拉普拉斯比率混合分佈機率分佈(Laplacian scale mixture distribution),它是拉普拉斯分佈機率分佈(Laplacian distribution)乘上一個倒數比率混合數(inverse scale mixture),拉普拉斯分佈機率分佈本身就是一個稀疏(sparse)或尖銳(spiky)的分佈,加入倒數比率混合數使得分佈更為尖銳,透過倒數比率混合數控制一組基底對應到的組合係數的稀疏程度,實現出自動相關測定(automatic relevance determination, ARD)功能,自動決定語音特徵向量在馬可夫鍊結(Markov chain)中用來表示特徵向量的兩組相關基底(relevance basis)。
在模型推論過程中,我們使用貝氏學習(Bayesian learning)並透過計算最大化事後機率(maximum a posteriori)求解出組合係數、共同性基底和個別性基底,其中組合係數是每個語音音框都要個別求取,另外隱藏式馬可夫模型(HMM)相關之共同性基底和模型狀態(state)相關之個別性基底是經由expectation-maximization(EM)演算法從所有的訓練語料估測出來。在實驗方面,我們使用Aurora2語音資料庫,測試以群組稀疏隱藏式馬可夫模型(GS-HMMs)為主的聲學模型在多種環境噪音下之語音辨識效能,並評估本模型在表示連續語音之語音向量時的稀疏程度,另外也驗證本方法對測試語音之預估能力。
Hidden Markov models (HMMs) have been successfully developed for acoustic modeling and widely applied for state-of-art speech recognition systems and other multimedia classification/recognition systems. Conventionally, each Markov state in continuous-density HMMs was constructed by a set of Gaussian mixture components which was estimated according to maximum likelihood (ML) criterion. However, ML-based HMMs are prone to build too large model with too many Gaussian parameters so that the over-fitting problem is serious and the speech recognition performance is substantially degraded. On the other hand, in real-world applications, the speech data is inevitably collected in heterogeneous environments with noise interference, mislabeling effect and sparse data problem, etc. The mismatch between training and test environments also deteriorates the performance of speech recognition. Accordingly, how to build a robust speech recognition system with good model regularization has become a crucial issue in real-world applications.
To tackle the over-fitting problem and construct the heterogeneous classifier, we obey Bayesian theory and develop the group sparse representation for speech features. In particular, we establish the framework of group sparse hidden Markov models (GS-HMMs) and apply it for robust speech recognition. In general, the sequence of speech feature vectors is driven by Markov chain and each feature vector is represented by two groups of basis vectors; one is the group of common basis vectors and the other is the group of individual basis vectors. Common basis vectors are used to represent the feature vectors corresponding to different states of an HMM which is associated with a word. In addition, the individual basis vectors are state-dependent and are introduced to compensate the residual information that common basis vectors cannot model. Importantly, we incorporate the Laplacian scale mixture distribution as a prior distribution to characterize the randomness of weight parameters. This distribution is obtained by multiplying the Laplacian distribution with an inverse scale mixture parameter. The resulting Laplacian scale mixture distribution is a sparse or spiky distribution. The inverse scale mixture makes the distribution even more sparse or spiky. This parameter controls the degree of sparsity and is applied to fulfill the scheme of automatic relevance determination. Under the guidance of Markov chain, the speech features automatically adopt the corresponding relevance basis vectors in two groups for feature representation. Model regularization is assured in acoustic models based on GS-HMMs.
We conduct Bayesian learning in model inference procedure of GS-HMMs. The GS-HMM parameters are estimated according to maximum a posteriori (MAP) criterion. The word-dependent common basis vectors and state-dependent individual basis vectors are accordingly calculated by expectation-maximization (EM) algorithm. During optimization procedure, the frame-based weight parameters are also estimated by MAP principle. In the experiments, we adopt AURORA2 speech database and investigate the acoustic modeling performance based on GS-HMMs. The effectiveness of GS-HMMs is demonstrated by evaluation of speech recognition under different noisy environments. The prediction of test speech and the group sparsity in feature representation are examined.
[1] L. R. Bahl, P. F. Brown, P. V. De Souza, R. L. Mercer, “Maximum mutual information estimation of hidden Markov model parameters for speech recognition”, IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP), pp. 49–52, 1986.
[2] S. Bengio, F. Pereira, Y. Singer, and D. Strelow, “Group sparse coding,” Advances in Neural Information Processing Systems (NIPS), pp. 82–89, 2009.
[3] C. M. Bishop, Pattern Recognition and Machine Learning, New York: Springer. 2006.
[4] L. Burget and P. Schwartz, “Multilingual acoustic modeling for speech recognition based on subspace Gaussian mixture models,” International Conference on Acoustic, Speech, and Signal Processing, pp. 4334–4337, 2010.
[5] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” Society for Industrial and Applied Mathematics (SIAM) Journal on Scientific Computing, vol. 43, no. 1, pp. 129–159, 2001.
[6] S. Chen, S. R. Gunn and C. J. Harris, “The relevance vector machine technique for channel equalization applications,” IEEE Transactions on Neural Networks, vol. 12, pp. 1529–1532, 2001.
[7] J.-T. Chien and C.-W. Ting, “Factor analyzed subspace modeling and selection”, IEEE Transactions on Audio, Speech and Language Processing, vol. 16, no. 1, pp. 239-248, 2008.
[8] D. Donoho and Y. Tsaig, “Fast solution of l1-norm minimization problems when the solution may be sparse,” IEEE Transactions on Information Theory, vol. 54, no. 11, pp. 4789–4812, 2008.
[9] J. Duchateau, T. Leroy, K. Demuynck and H. Van hamme, “Fast speaker adaptation using non-negative matrix factorization,” IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4269–4272, 2008.
[10] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” Annals of Statistics, vol. 32, no. 2, pp. 407–499, 2004.
[11] M. Elad and M. Aharon, “Image denoising via sparse and redundant representation over learned dictionaries,” IEEE Transaction on Image Processing, vol. 15, no. 12, pp. 3736–3745, 2006.
[12] M. Figueiredo, R. Nowak, and S. Wright, “Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems,” IEEE Journal of Selected Topics in Signal Processing, vol. 1, no. 4, pp. 586–597, 2007.
[13] J. Friedman, T. Hastie, H. Hoefling, and R. Tibshirani, “Pathwise coordinate optimization,” Annals of Applied Statistics, vol. 1, no. 2, pp. 302–332, 2007.
[14] M. J. F. Gales and K. Yu, “Canonical state models for automatic speech recognition,” Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 58–61, 2010.
[15] M. J. F. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Computer Speech and Language, vol. 12, pp. 75–98, 1998.
[16] M. J. F. Gales, “Transformation streams and the HMM error model,” Computer Speech and Language, vol. 16, no. 2, pp. 225-243, 2002.
[17] P. J. Garrigues and B. A. Olshausen, “Group sparse coding with a Laplacian scale mixture prior,” Advances in Neural Information Processing Systems (NIPS), pp. 1-9, 2010.
[18] J. L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture observation of Markov chains”, IEEE Transactions on Speech and Audio Processing, vol. 2, no. 2, pp. 291–298, 1994.
[19] J. Hamaker, J. Picone, and A. Ganapathiraju, “A sparse modeling approach to speech recognition based on relevance vector machines,” International Conference of Spoken Language Processing, vol. 2, pp. 1001–1004, 2002.
[20] P. O. Hoyer, “Non-negative matrix factorization with sparseness constraints,” Journal of Machine Learning Research, pp. 1457-1469, 2004.
[21] X. D. Huang, Y. Ariki, and M. A. Jack, Hidden Markov models for speech recognition, Edinburgh Univ. Press, 1990.
[22] M. Hwang and X. Huang, “Shared-distribution hidden Markov models for speech recognition,” IEEE Transaction Speech and Audio Processing, vol. 1, no. 4, pp. 414–420, 1993.
[23] L. Jacob, G. Obozinski, and J.-P. Vert, “Group lasso with overlap and graph lasso,” International Conference on Machine Learning (ICML), vol. 382, pp. 55–62, 2009.
[24] S. Ji, Y. Xue and L. Carin, “Bayesian compressive sensing”, IEEE Transactions on Signal Processing, vol. 56, no. 6, pp. 2346–2356, 2008.
[25] B. H. Juang, W. Chou and C.-H. Lee, “Minimum classification error rate methods for speech recognition,” IEEE Transaction Speech and Audio Processing, vol. 5, no. 2, pp. 257–265, 1997.
[26] B. H. Juang and L. R. Rabiner, “Hidden Markov Models for Speech Recognition,” American Statistical Association and American Society for Quality, vol. 33, pp. 251-272, 1991.
[27] H. Lee and S. Choi, “Group nonnegative matrix factorization for EEG classification,” International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 320–327, 2009.
[28] H. Lee, A. Battle, R. Raina and A. Y. Ng, “Efficient sparse coding algorithms,” Advances in Neural Information Processing Systems (NIPS), pp. 801–808, 2007.
[29] M. S. Lewicki and T. J. Sejnowski, “Learning overcomplete representations,” Neural Computation, vol. 12, no. 2, pp. 337–365, 2000.
[30] D. J. C. MacKay, “The evidence framework applied to classification networks”, Neural Computation, vol. 4, no. 5, pp. 720–736, 1992.
[31] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learning for sparse coding,” International Conference on Machine Learning (ICML), vol. 382, pp. 87–94, 2009.
[32] J. Mairal, M. Elad, and G. Sapiro, “Sparse representation for color image restoration,” IEEE Transaction on Image Processing, vol. 17, no. 1, pp. 53-69, 2008.
[33] A. Y. Ng., “Feature selection, l1 vs. l2 regularization, and rotational invariance,” International Conference on Machine Learning (ICML), vol. 21, pp. 78–85, 2004.
[34] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 2161–2168, 2006.
[35] B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by V1?”, Vision Research, vol. 37, pp. 3311–3325, 1997.
[36] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, pp. 607–609, 1996.
[37] D. Povey, “A tutorial introduction to subspace Gaussian mixture models for speech recognition,” Tech. Rep. MSR-TR-2009-111, Microsoft Research, 2009.
[38] D. Povey et al., “Subspace Gaussian Mixture Models for Speech Recognition,” International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4330–4333, 2010.
[39] L. R. Rabiner and B. H. Juang, “An introduction to hidden Markov models,” IEEE ASSP Magazine, vol. 3, pp. 4–16, 1986.
[40] L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–285, 1989.
[41] C. J. Rozell, D. H Johnson, R. G. Baraniuk, and B. A. Olshausen, “Sparse coding via thresholding and local competition in neural circuits,” Neural Computation, vol. 20, no.10, pp. 2526–2563, 2008.
[42] T. N. Sainath, A. Carmi, D. Kanevsky and B. Ramabhadran, “Bayesian compressive sensing for phonetic classification”, International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4370–4373, 2010.
[43] G. Saon and J.-T. Chien, “Bayesian sensing hidden Markov model for speech recognition,” International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5056–5059, 2011.
[44] G. Saon and J.-T. Chien, “Discriminative training for Bayesian sensing hidden Markov models,” International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5316–5319, 2011.
[45] G. S. V. S. Sivaram, S. K. Nemala, M. Elhilali, T. D. Tran and H. Hermansky, “Sparse coding for speech recognition,” International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4346–4349, 2010.
[46] M. E. Tipping, “Sparse Bayesian learning and the relevance vector machine,” Journal of Machine Learning Research, vol. 1, pp. 211–244, 2001.
[47] J. A. Tropp, “Just relax: convex programming methods for identifying sparse signals in noise,” IEEE Transactions on Information Theory, vol. 52, no. 3, pp. 1030–1051, 2006.
[48] M. J. Wainwright, E. P. Simoncelli, and A. S. Willsky, “Random cascades on wavelet trees and their use in modeling and analyzing natural imagery,” Applied and Computational Harmonic Analysis, vol. 11, pp. 89–123, 2001.
[49] T. T. Wu and K. Lange, “Coordinate descent algorithm for lasso penalized regression,” Annals of Applied Statistics, vol. 2, no. 1, pp. 224–244, 2008.
[50] S. Young, J. Jansen, J. Odell, D. Ollason, P. Woodland, “The HTK Book (Version 2.0),” ECRL, 1995.
[51] S. Yun and K. C. Toh, “A coordinate gradient descent method for l1-regularized convex minimization,” Computational Optimization and Applications, vol. 48, no. 2, pp. 273–307, 2011.