| 研究生: | 林苑寧 Lin, Yuan-Ning | 
|---|---|
| 論文名稱: | 應用SVM與MLLR於多人線上語者調適之泛在語音辨識系統 On-Line Multi-Speaker Adaptation Based on SVM and MLLR for Ubiquitous Speech Recognition System | 
| 指導教授: | 王駿發 Wang, Jhing-Fa | 
| 學位類別: | 碩士 Master | 
| 系所名稱: | 電機資訊學院 - 電機工程學系 Department of Electrical Engineering | 
| 論文出版年: | 2008 | 
| 畢業學年度: | 96 | 
| 語文別: | 英文 | 
| 論文頁數: | 70 | 
| 中文關鍵詞: | 泛在 、語者調適 、MLLR 、SVM | 
| 外文關鍵詞: | SVM, speaker adaptation, MLLR, ubiquitous | 
| 相關次數: | 點閱:96 下載:2 | 
| 分享至: | 
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 | 
科技始終來自於人性!日漸普及運用在生活中的語音辨識技術還有很大的研究改善空間,如何將語音辨識技術改善,為人們帶來更多的便利是我們一直以來努力的目標! 
目前大部分的應用在生活中的語音辨識技術,不是需要使用者事先訓練的語者相關聲學模型辨識器,像是聲控玩具、手機聲控撥號等系統,就是使用語者獨立的聲學模型來辨識。我們認為在家庭成員固定的家庭環境中,只利用語者獨立的聲學模型來辨識,在使用上仍然便利性不足的問題。所以我們提出了一套應用於泛在語音辨識系統的線上多語者調適架構,不但可以針對家庭成員個別以適合的語音模型做語音辨識,提高辨識率,也在家庭成員使用辨識系統的同時,線上即時針對每個家庭成員做語者調適的動作,讓語音模型調適得更接近使用者的語音特性。
本論文提出一個以SVM與MLLR為基礎應用於泛在語音辨識系統的多人線上語者調適架構,這個系統架構分成訓練階段、辨識階段、調適階段三個階段。在訓練階段,系統利用有限的訓練資料來訓練SVM語者辨識器與估計MLLR回歸矩陣來建立特徵向量空間。在辨識階段,系統利用SVM結果從MLLR回歸矩陣庫中取出對應測試語者的MLLR回歸矩陣並與語者獨立聲學模型結合來做語音辨識。在調適階段,系統透過信心度評估來判斷辨識的結果是否可以成為語者調適模型的語料,降低聲學模型因為雜訊或是辨識錯誤而錯誤訓練的機率,運用權重結合三種來源的MLLR回歸矩陣改良了傳統的MLLR語者調適技術,並且更新MLLR回歸矩陣庫,達到線上即時語者調適的效果,經由實驗結果,此架構可以將語音辨識率平均提升3%~8%。
Technology always comes from human nature. The growing popularity of speech recognition applicants in living still has great room for improvement. How to improve the speech recognition technology and bring more convenient for people is our continuously effort target.
Currently, most living speech recognition applications need either a speaker independent model or a dependent model with user training at first, such as a voice-activated toy and a mobile phone with voice dialing. We think that the speech recognition with speaker independent model is not convenient enough for home environment with fixed family members. Therefore, this thesis proposes an on-line multi-speaker adaptation based on SVM and MLLR for ubiquitous speech recognition system. This system can not only speech recognizes with the appropriate model for every family member to improve accuracy, but also on-line adapts the acoustic model to be near to the speakers’ characteristic when they use the system.
The presented novel architecture can improve the adaptation modeling accuracy of the conventional maximum likelihood linear regression (MLLR) technique. The proposed system contains three phases: the training phase, the recognition phase, and the adaptation phase. First, in the training phase, we generate MLLR regression matrix sets of the Gaussian mixture model (GMM) parameters to construct the eigen-space, and build SVM speaker classification model by a few training data. Next, in the recognition phase, we integrate the speaker independent model with the MLLR transformation matrix set generated from the MLLR transformation matrix set database by the speaker class result of SVM classifier. Then we recognize the test speech by this adapted model. In the adaptation phase, we replace the present MLLR transformation matrix set by the adapted MLLR transformation matrix set merged by weighting from three transformation matrix sets: 1) the present MLLR transformation matrix set; 2) the MLLR regression transformation matrix set estimated from by maximum likelihood from eigenspace and recognition result; and 3) the MLLR regression transformation matrix set adapted by the speech recognition results which are judged by confidence measure to decrease the error training because the noise or the wrong speech recognition.
The experimental results show that the proposed method can averagely improve speech recognition accuracy about 3% ~8% with speaker adaptation.
[1]	Lawrence Rabiner and B-H Juang, “Fundamentals of Speech Recognition”. Prentice Hall ,1993.
[2]	J. L. Gauvain and C. H. Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains,” IEEE Trans. Speech Audio Process., vol. 2, no. 2, pp. 291–298, Apr. 1994.
[3]	C-H.Lee, C-H.Lin and B-H. Juang, “A Study on Speaker Adaptation of the Parameters of Continuous Density Hidden Markov Models”. IEEE Tran. On Sig. Proc., Vol. 39, No. 4, pp.806-814,April 1991
[4] 	M.J.F Gales and P.C. Woodland, “Mean and Variance Adaptation within the MLLR Framework” April 1996
[5]	C.J. Leggetter and P.C. Woodland, ”Speaker Adaptation of HMM’s using Linear Regression”. Technical Report GUED/F-INFENG/TR.181, Cambridge University, June 1994
[6]	C.J. Leggetter and P.C. Woodland, ”Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models”. Computer Speech and Language,Vol. 9, pp. 171-185, 1994
[7]	Heidi Christensen,”Speaker Adaptation of Hidden Markov Models using Maximum Likelihood Linear Regression”. MSc.E.E Thesis. Aalborg University, Denmark, June 1996
[8]	R. Kuhn, J.-C. Junqua, P. Nguyen, and N. Niedzielski, “Rapid speaker adaptation in eigenvoice space,” IEEE Trans. Speech Audio Process., vol. 8, no. 6, pp. 695–707, Nov. 2000.
[9]	K. T. Chen, W. W. Liau, H. M. Wang, and L. S. Lee, “Fast speaker adaptation using eigenspace-based maximum-likelihood linear regression,” in Proc. ICSLP, 2000, vol. 3, pp. 742–745.
[10]	R. Kuhn, F. Perronnin, P. Nguyen, J.-C. Junqua, and L. Rigazio, “Very fast adaptation with a compact context-dependent eigenvoice model,” in Proc. ICASSP, May 2001, vol. 1, pp. 373–376.
[11]	B. Zhou and J. Hansen, “Rapid discriminative acoustic model based on eigenspace mapping for fast speaker adaptation,” IEEE Trans. Speech Audio Process., vol. 13, no. 4, pp. 554–564, Jul. 2005.
[12]	M. Turk and A. Pentland, “Face recognition using eigenfaces,” in Proc. Comput. Vision Pattern Recognition, 1991, pp. 586–591. 
[13]	N. Wang, S. Lee, F. Seide, and L. S. Lee, “Rapid speaker adaptation using a priori knowledge by eigenspace analysis of MLLR parameters,” in Proc. Int. Conf. Acoust., Speech, Signal Process., 2001, pp. 345–348.
[14]	H. Botterweck, “Very fast adaptation for large vocabulary continuous speech recognition using eigenvoices,” in Proc. Int. Conf. Spoken Lang. Process., 2000, vol. 4, pp. 354–357.
[15]	X. L. Aubert, “Eigen-MLLRS applied to unsupervised speaker enrollment for large vocabulary continuous speech recognition,” in Proc. Int. Conf. Acoust., Speech, Signal Process., 2004, vol. I, pp. 349–352.
[16]	V. Doumpiotis and Y. Deng, “Eigenspace-based MLLR with speaker adaptive training in large vocabulary conversational speech recognition,” in Proc. Int. Conf. Acoust., Speech, Signal Process., 2004, vol. I, pp. 357–360.
[17]	P. Nguyen, C. Wellekens, and J.-C. Junqua, “Maximum likelihood eigenspace and MLLR for speech recognition in noisy environments,” in Proc. Eurospeech, 1999, pp. 2519–2522.
[18]	M. F. J. Gales, “Cluster adaptive training of hidden Markov models,” IEEE Trans. Speech Audio Process., vol. 8, no. 4, pp. 417–428, Jul. 2000.
[19]	B. Mak, J. T.Kwok, and S. Ho, “Kernel eigenvoice speaker adaptation,” IEEE Trans. Speech Audio Process., vol. 13, no. 5, pp. 984–992, Sep. 2005.
[20]	B. Schölkopf, A. Smola, and K. R.Müller, “Nonlinear component analysis as a kernel eigenvalue problem,” Neural Comput., vol. 10, pp. 1299–1319, 1998. 
[21] B. Mak, S. Ho, and J. T. Kwok, “Speedup of kernel eigenvoice speaker adaptation by embedded kernel PCA,” in Proc. Int. Conf. Spoken Lang. Process., Jeju Island, South Korea, Oct. 14–18, 2004, vol. IV, pp. 2913–2916.
[22]	B. Mak and S. Ho, “Various reference speakers determination methods for embedded kernel eigenvoice speaker adaptation,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Philadelphia, PA, Mar. 18–23, 2005, vol. 1, pp. 981–984.
[23]	J. T. Chien, and C.H. Huang,”Aggregate a Posteriori Linear Regression Adaptation” IEEE Transactions on Audio, Speech, and Language Processing.,2006 
[24]	X. Cui, and A. Alwan “Robust Speaker Adaptation by Weighted Model Averaging Based on the Minimum Description Length Criterion” IEEE Transactions on Audio, Speech, and Language Processing.,2007
[25]	Ephraim, Y. and Van Trees, H. L.: A signal subspace approach for speech enhancement. IEEE Transactions on Speech and Audio Processing. vol. 3, no. 4, pp. 251–266, July 1995
[26]	Wang Jia-Ching, Lee Hsiao-Ping, Wang Jhing-Fa, and Yang Chung-Hsien,: Critical Band Subspace-Based Speech Enhancement Using SNR and Auditory Masking Aware Technique. IEICE Transactions on Information and Systems. vol. E90-D, no. 7, pp. 1055–1062, July 2007
[27]	Rabiner, L. R. and Schafer, R. W.: Digital Processing of Speech Recognition Signals. Prentice-Hall Co. Ltd, 1978
[28] Huang, X., Acero, A. and Hon, H.: Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice-Hall Co. Ltd, 2001
[29]	Dempster, A., Laird, N., and Rubin, D. “Maximum likelihood from incomplete data via the EM algorithm.” Journal of the Royal Statistical Society, Series B, 39(1):1–38. ,1977.
[30]	Rabiner L. Fundamentals of Speech Recognition. PTR Prentice-Hall Inc., New Jersey,1993.
[31]	V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998
[32]	B. Schölkopf, S. Mika, C. Burges, P. Knirsch, K.-R. Müller, G. Rätsch, and A. Smola, “ Input space vs. feature space in kernel-based methods, ” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1000-1017, 1999
[33]	C. J. C. Burges, “ A tutorial on support vector machines for pattern recognition, ”Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167, 998.
[34]	Smola and B. Schölkopf, “ A tutorial on support vector regression, ” Tech. Rep.NC2-TR-1998-030, Neural and Computational Learning II, 1998
[35]	J. C. Burges and B. Schölkopf, “ Improving the accuracy and speed of support vector learning machines, ” in Advances in Neural Information Processing Systems 9 (M. Mozer, M. Jordan, and T. Petsche, eds.), pp. 375-381, Cambridge, MA: MIT Press, 1997.
[36]	G. Fung, O. L. Mangasarian, and J. Shavlik, “ Knowledge-based support vector machine classifiers, ” in Advances in Neural Information Processing, 2002.
[37]	K. Crammer and Y. Singer, “ On the learnability and design of output codes for multiclass problems, ” in Computational Learning Theory, pp. 35-46, 2000
[38]	S. Mukherjee, E. Osuna, and F. Girosi, “ Nonlinear prediction of chaotic time series using support vector machines, ” in 1997 IEEE Workshop on Neural Networks for Signal Processing, pp. 511-519, 1997.
[39]	L. J. Cao, K. S. Chua, and L. K. Guan, “ c-ascending support vector machines for financial time series forecasting, ” in 2003 International Conference on Computational Intelligence for Financial Engineering (CIFEr2003), (Hong Kong), pp. 317-323, 2003.
[40]	H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vapnik, “ Support vector regression machines, ” in Advances in Neural Information Processing Systems, vol. 9, p. 155, The MIT Press, 1997.
[41]	I.T. Jolloffe, "Pricipal Component Analysis", Springer-Verlag, 1986
[42]	R.Kuhn, et. al., "Eigenvoices for Speaker Adaptation”, Proc. ICSLP'98, pp.1771-1774, 1998
[43]	Qin Jin, Tanja Schultz, and Alex Waibel, “Far-Field Speaker Recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 7, Sep. 2007
[44]	Young, S. et al. HTKbook (V3.4), Cambridge University Engineering Dept. (2006)
[45]	C. Y. Tseng, ”A phonetically oriented speech database for Mandarin Chinese,” Proc. ICPhS95, Stockholm, pp.326-329, 1995
[46]     C.F. Li and J.F. Wang “A Design of a Mandarin and English Mixed-language Speaker Independent Speech Recognition Embedded System” Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C. Oct. 2007