| 研究生: |
闕壯華 Chueh, Chuang-Hua |
|---|---|
| 論文名稱: |
強健性語言模型於語音辨識之研究 Flexible Language Models for Speech Recognition |
| 指導教授: |
簡仁宗
Chien, Jen-Tzung |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2010 |
| 畢業學年度: | 98 |
| 語文別: | 英文 |
| 論文頁數: | 115 |
| 中文關鍵詞: | 語言模型 、語音辨識 |
| 外文關鍵詞: | Bayesian learning, language modeling, natural language processing, speech recognition |
| 相關次數: | 點閱:87 下載:5 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
語言模型已被廣泛的運用於語音辨識、資訊檢索與其他相關的資訊系統中,在眾多語言模型當中,又以統計式n-gram語言模型最為普遍。然而此模型仍存在著許多問題必須加以改進,包括長距離資訊的缺乏、測試環境不匹配、鑑別性模型參數的建立與資料稀疏等問題,為了改進語言模型的強健性與延展性,本論文將針對語言模型的四大問題加以探討並提出解決之道。
首先,在長距離資訊擷取方面,本論文提出兩個方法,首先我們利用建立潛在語意空間,將訓練文件投影至該語意空間,並將文件群組化,語意相近的文件將被加以群聚以反應某特定主題,文件中潛藏的主題資訊因此被擷取出來,進而利用最大熵(Maximum entropy)法則結合主題與n-gram資訊以建立長距離主題式語言模型;此外,為了擷取更精細的主題資訊,我們針對文件中字詞分布的變化加以模型化,結合隱藏式馬可夫模型於Latent Dirichlet Allocation主題模型之中,建立分段主題模型,以擷取更精細的主題資訊,以改進語言模型長距離資訊缺乏的問題;另外,針對測試環境不匹配問題,最小鑑別資訊法已成功將unigram資訊被運用於語言模型的調整,然而透過unigram難以擷取調整語料中較精細的特徵,但利用高階n-gram又會遇到因為調整資料不足導致估測不可靠的問題,本論文提出利用假設檢定由調整資料中擷取各特徵統計量的信賴區間,建立不等式限制條件,透過最小鑑別資訊法則調整n-gram模型參數,將可自動選取可靠的特徵達到模型調整的效果;再者,在鑑別式訓練方面,由於語音辨識時擷取出的候選文字串與所輸入的語音信號存在有相互的影響性,語音模型和語言模型也應將這樣的關係考慮進來,因此我們提出整合性最大熵模型作為語音辨識器的主要架構,並且提出如何結合語言和語音特徵在一致的模型架構下完成訓練。有別於傳統嵌入式事後機率決策法則需決定語音及語言模型間的組合係數,在此利用最大熵法則直接針對語音及語言特徵加以考慮,可有效的整合兩種資訊來源,建立出最佳之事後機率分佈以達到最小化錯誤機率;最後,針對訓練語料的不足,利用最大相似度估測對於訓練資料中無法觀測到的n-gram會導致機率為零,此論文提出一新穎性的類別語言模型,透過將歷史字串投影至潛在類別空間達到參數分享,將可有效的解決訓練語料不足的問題。我們利用變異性貝氏演算法最大化訓練資料邊際相似度以估測此貝式類別語言模型參數。此研究更進一步透過貝氏理論,將長距離類別資訊結合入主題混和權重的生成當中,以建立快取類別(cache class)語言模型,亦將可有效的增進語言模型長距離資訊擷取的效果。
在系統評估方面,本論文利用新聞語料庫針對各語言模型改進方法加以評估,並應用於大詞彙連續語音辨識系統當中,我們針對不同的語言模型問題,評估在長距離語言模型、模型調整、鑑別性語言模型與語言模型平滑化四個方向,使用本論文提出的方法對模型混淆度與語音辨識正確率的改進程度。
Statistical n-gram language models play important roles in many human-machine interaction systems, e.g. automatic speech recognition, machine translation, information retrieval, etc. However, n-gram models suffer from the problems of long distance insufficiency, domain mismatch, model confusion and data sparseness. This dissertation presents some solutions to improve the robustness and flexibility of language models for large vocabulary continuous speech recognition.
For the first issue of insufficient long distance information, we first present a new information source extracted from latent semantic analysis (LSA) and adopt the maximum entropy (ME) principle to integrate it into an n-gram language model. Using the ME approach, each information source serves as a set of constraints, which should be satisfied to estimate a hybrid statistical language model with maximum randomness. For comparative study, we also carry out knowledge integration via linear interpolation. Moreover, we build the segmented topic model (STM) that embeds a Markov chain into the latent Dirichlet allocation (LDA) topic model to extract the sophisticated topic regularities. The various usages of words in different paragraphs are characterized. The long-distance topic information by using STM is applied in the language model adaptation scheme for speech recognition.
Next, we focus on language model adaptation so as to alleviate the domain mismatch problem. The minimum discrimination information (MDI) is discussed. MDI adaptation with unigram constraints has been successfully applied for speech recognition owing to its computational efficiency. However, the unigram features only contain low-level information from adaptation articles and are too rough to attain precise adaptation performance. Accordingly, it is desirable to induce high-order features and explore delicate information for language model adaptation if the adaptation data is abundant. In this study, we adaptively select the reliable features based on re-sampling and calculating the statistical confidence interval. We identify the reliable regions and build the inequality constraints for MDI adaptation. In this way, the reliable intervals can be used for adaptation so that interval estimation is achieved rather than point estimation. Also, the features can be selected automatically in the whole procedure.
Regarding the issue of model discrimination, this thesis presents a new discriminative model where we perform the joint acoustic and linguistic modeling for speech recognition to merge the acoustic evidence into linguistic parameters, or the linguistic evidence into acoustic parameters based on ME principle. The mutual ME (MME) model is built to calculate sentence posterior probability, where the model dependence is not only merged in acoustic parameters, but also in linguistic parameters. The discriminative training is performed.
Finally, regarding the issue of data sparseness, we present a new Dirichlet class language model (DCLM), which projects the sequence of historical words onto a latent class space and calculates a marginal likelihood over the uncertainties of classes, which are expressed by Dirichlet priors. A Bayesian class-based language model is established and a variational Bayesian procedure is presented for estimating DCLM parameters. Furthermore, the long-distance class information is continuously updated by using large-span historical words and is dynamically incorporated into class mixtures for a cache DCLM. The long-distance information is embedded in n-gram model to improve speech recognition performance.
To evaluate the proposed methods, we performed several tasks on using MATBN, WSJ and TDT2 speech corpora in the experiments. Related methods in the literature were carried out for comparison. We have obtained significant improvements by using the proposed solutions over other methods in terms of model perplexity as well as speech recognition accuracy. The results confirm the benefits and effectiveness of this study on speech recognition application.
[1] Bahl, L., Brown, P., de Souza, P. and Mercer, R., “Maximum mutual information estimation of hidden Markov model parameters for speech recognition,” Proc. of IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), vol. 1, pp. 49-52, 1986.
[2] Bai, S., Li, H., Lin, Z. and Yuan, B., “Building class-based language models with contextual statistics,” Proc. of IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), pp. 173-176, 1998.
[3] Bellegarda, J. R, “Exploiting latent semantic information in statistical language modeling,” Proceedings of the IEEE, vol. 88, no. 8, pp. 1279-1296, 2000.
[4] Bellegarda, J. R., “Statistical language model adaptation: review and perspectives,” Speech Communication, vol. 42, pp. 93-108, 2004.
[5] Bengio, Y., Ducharme, R., Vincent, P. and Jauvin, C., “A neural probabilistic language model,” Journal of Machine Learning Research, vol. 3, pp. 1137-1155, 2003.
[6] Berger, A., Della Pietra S. and Della Pietra V., “A maximum entropy approach to natural language processing,” Computational Linguistics, vol. 22, no. 1, pp. 39-71, 1996.
[7] Berry, M., Dumais, S. and O’Brien, G., “Using linear algebra for intelligent information retrieval,” SIAM Review, vol. 37, no. 4, pp. 573-595, 1995.
[8] Beyerlein, P., “Discriminative model combination,” Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 481-484, 1998.
[9] Bisani M. and Ney, H. “Bootstrap estimates for confidence intervals in ASR performance evaluation,” Proceedings of IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), vol. 1, pp. 409-412, 2004.
[10] Blei, D. M., Ng, A. Y. and Jordan, M. I., “Latent Dirichlet allocation,” Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003.
[11] Brown, P., Della Pietra, V., De Souza, Lai, P., J. and Mercer, R., “Class-based n-gram models of natural language,” Computational Linguistics, vol. 18, no. 4, pp. 467-179, 1992.
[12] Chelba, C. and Jelinek, F., “Structured language modeling,” Computer Speech and Language, vol. 14, no. 4, pp. 283-332, 2000.
[13] Chien, J.-T., “Association pattern language modeling,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 5, pp. 1719-1728, 2006.
[14] Chien, J.-T. and. Chueh, C.-H., “Latent Dirichlet language model for speech recognition”, Proc. of IEEE Workshop on Spoken Language Technology, pp. 201-204, 2008.
[15] Chien, J.-T. and. Chueh, C.-H., “Joint acoustic and language modeling for speech recognition,” Speech Communication, vol. 52, no. 3, pp. 223-235, 2010a.
[16] Chien, J.-T. and. Chueh, C.-H., “Dirichlet class language models for speech recognition,” accepted for publication in IEEE Transaction on Audio, Speech and Language Processing, 2010b.
[17] Chien, J.-T. and Wu, M.-S., “Adaptive Bayesian latent semantic analysis,” IEEE Transactions on Audio, Speech and Language Processing, vol. 16, no. 1, pp. 198-207, 2008.
[18] Chueh, C.-H. and Chien, J.-T., “Maximum entropy modeling of acoustic and linguistic features,” Proc of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 1061-1064, 2006.
[19] Chueh, C.-H. and Chien, J.-T., “Reliable feature selection for language model adaptation,“ Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5089–5092, 2008a.
[20] Chueh, C.-H. and Chien, J.-T., “Continuous topic language modeling for speech recognition,“ Proc. of IEEE Workshop on Spoken Language Technology (SLT), pp. 193-196, 2008b.
[21] Chueh, C.-H. and Chien, J.-T., “Nonstationary latent Dirichlet allocation for speech recognition,” Proc. of Annual Conference of the International Speech Communication Association (INTERSPEECH), pp. 372-375, 2009a.
[22] Chueh, C.-H. and Chien, J.-T., “Segmented topic model for text classification and speech recognition,” Advances in Neural Information Processing Systems (NIPS) Workshop on Applications for Topic Models: Text and Beyond, Whistler-Canada, December 2009b. (Online available at http://umiacs.umd.edu/~jbg/nips_tm_workshop/7.pdf)
[23] Chueh, C.-H. and Chien, J.-T., “Topic cache language model for speech recognition,” Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2010.
[24] Chueh, C.-H., Chien, T.-C. and Chien, J.-T., “Discriminative maximum entropy language model for speech recognition,” Proc of European Conference on Speech Communication and Technology (INTERSPEECH), pp. 721-724, 2005.
[25] Chueh, C.-H., Chien, J.-T. and Wang, H.-m., “A Maximum entropy approach for integrating semantic information in statistical language models,“ Proc. of International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 309-312, 2004.
[26] Chueh, C.-H., Wang, H.-m. and Chien, J.-T., “A maximum entropy approach for semantic language modeling,” International Journal of Computational Linguistics and Chinese Language Processing, vol. 11, no. 1, pp. 37-56, 2006.
[27] Clarkson, P. R. and Robinson, A. J., “Language model adaptation using mixtures and an exponentially decaying cache,” Proc. of IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), pp.799-802, 1997.
[28] Darroch, J. and Ratcliff, D., “Generalized iterative scaling for log-linear models,” The Annals of Mathematical Statistics, vol. 43, pp. 1470-1480, 1972.
[29] Deerwester, S., S. Dumais, G. Furnas, T. Landauer and R. Harshman, “Indexing by latent semantic analysis,” Journal of the American Society of Information Science, vol. 41, pp. 391-407, 1990.
[30] Dempster, A., Laird, N. and Rubin, D., “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistical Society, vol. 39, no. 1, pp. 1-38, 1977.
[31] Federico, M., “Bayesian estimation methods for n-gram language model adaptation,” Proc. of IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), pp. 240-143, 1996.
[32] Federico, M., “Efficient language model adaptation through MDI estimation,” Proc. of European Conference on Speech Communication and Technology (EUROSPEECH), pp. 1583-1586, 1999.
[33] Fosler-Lussier, E. and Morris, J., “Crandem systems: conditional random field acoustic models for hidden Markov models,” Proc. of IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), pp. 4049-4052, 2008.
[34] Francois, D., Rossi, F., Wertz, V. and Verleysen, M., “Resampling methods for parameter-free and robust feature selection with mutual information,” Neurocomputing, vol. 70, 1276-1288, 2007.
[35] Gauvain, J.-L., and Lee. C.-H., “Maximum a posteriori estimation for multivariate Gaussian mixture observation of Markov chain,” IEEE Transactions on Speech and Audio Processing, 2(4), pp. 291-298, 1994.
[36] Gildea, D. and Hofmann, T., “Topic-based language models using EM,” Proc. of European Conference on Speech Communication and Technology (EUROSPEECH), pp. 2167-2170, 1999.
[37] Gillick, L. and S. J. Cox, “Some statistical issues in the comparison of speech recognition algorithms,” IEEE Proceedings of International Conference on Acoustic, Speech and Signal Processing (ICASSP), pp. 532-535, 1989.
[38] Gunawardana, A., Mahajan, M., Acero, A. and Platt, J. C., “Hidden conditional random fields for phone classification,” Proc. of European Conference on Speech Communication and Technology (INTERSPEECH), pp. 1117-1120, 2005.
[39] Heigold, G., Schluter, R. and Ney, H., “On the equivalence of Gaussian HMM and Gaussian HMM-like hidden conditional random fields,” Proc. of European Conference on Speech Communication and Technology (INTERSPEECH), pp. 1721-1724, 2007.
[40] Hofmann, T., “Probabilistic latent semantic indexing,” Proc. of International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57, 1999.
[41] Jaynes, E., “Information theory and statistical mechanics,” Physics Reviews, vol. 106, no. 4, pp. 620-630, 1957.
[42] Jelinek, F. and Mercer, R. L., “Interpolated estimation of Markov source parameters from sparse data,” Proc. Workshop on Pattern Recognition in Practice, pp. 381-402, 1980.
[43] Juang, B.-H. and Katagiri, S., “Discriminative learning for minimum error classification,” IEEE Transactions on Signal Processing, vol. 40, pp. 3043-3054, 1992.
[44] Katz, S. M., “Estimation of probabilities from spare data for the language model component of a speech recognizer,” IEEE Transaction on Acoustic, Speech and Signal Processing, vol. 35, pp. 400-401, 1987.
[45] Kneser, R. and Ney, H., “Improved backing-off for m-gram language modeling,” Proc. of IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), pp. 181-184, 1995.
[46] Kuhn, R. and De Mori, R., “A cache-based natural language model for speech recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, no. 6, pp. 570-583, 1990.
[47] Kuhn, H. W. and Tucker, A. W., “Nonlinear programming,” Proc. of the 2th Berkeley Symposium on Mathematical Statistics and Probabilities, pp. 481-492, 1951.
[48] Kuo, H. J., Fosler-Lussier, E., Jiang, H. and Lee, C.-H., “Discriminative training of language models for speech recognition,” Proc. of IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), pp. 325-328, 2002.
[49] Kuo, H.-K. J. and Gao, Y., “Maximum entropy direct model as a unified model for acoustic modeling in speech recognition,” Proc. of International Conference on Spoken Language Processing (ICSLP), pp. 681-684, 2004.
[50] Lafferty, J., McCallum, A. and Pereira, F., “Conditional random fields: probabilistic models for segmenting and labeling sequence data,” Proc. of International Conference on Machine Learning (ICML), pp. 282-289, 2001.
[51] Lee, L.-S., “Voice dictation of Mandarin Chinese,” IEEE Signal Processing Magazine, vol. 14, no. 4, pp. 63-101, 1997.
[52] Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M. and Harper, M., “Enriching speech recognition with automatic detection of sentence boundaries and disfluencies,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 5, pp. 1526-1540, 2006.
[53] Macherey, W. and Ney, H., “A comparative study on maximum entropy and discriminative training for acoustic modeling in automatic speech recognition,” Proc. of European Conference on Speech Communication and Technology (EUROSPEECH), vol. 1, pp.493-496, 2003.
[54] Mahajan, M., Gunawardana, A. and Acero, A., “Training algorithm for hidden conditional random fields,” Proc. of IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), vol. 1, pp. 273-276, 2006.
[55] McCallum, A., Freitag, D. and Pereira, F., “Maximum entropy Markov models for information extraction and segmentation,” Proc. of International Conference on Machine Learning (ICML), pp. 591-598, 2000.
[56] Moore, G. and Young, S., “Class-based language model adaptation using mixtures of word-class weights,” Proc. of International Conference on Spoken Language Processing (ICSLP), pp. 512-515, 2000.
[57] Morris, J. and Fosler-Lussier, E., “Combining phonetic attributes using conditional random fields,” Proc. of International Conference on Spoken Language Processing (ICSLP), pp. 579-600, 2006.
[58] Normandin, Y., Cardin, R. and De Mori, R., “High-performance connected digit recognition using maximum mutual information estimation,” IEEE Transactions on Speech and Audio Processing, vol. 2, pp. 299-311, 1994.
[59] Paciorek, C. and Rosenfeld, R., “Minimum classification error training in exponential language models,” Proc. of NIST/DARPA Speech Transcription Workshop, 2002.
[60] Rao, P. S., Dharanipragada, S. and Roukos, S., ”MDI adaptation of language models across corpora,” Proc. of IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), vol. 1, pp. 161-164, 1995.
[61] Paul , D. B. and Baker, J. M. “The design for the Wall Street Journal-based CSR corpus,” Proc. of International Conference on Spoken Language Processing(ICSLP), pp. 899-902, 1992.
[62] Quattoni, A., Wang, S., Morency, L.-P., Collins, M. and Darrell, T., “Hidden conditional random fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 10, pp. 1848-1853, 2007.
[63] Roark, B. Saraclar, M. and Collins, M., “Discriminative n-gram language modeling,” Computer Speech and Language, vol. 21, pp.373-392, 2007.
[64] Rosen, J. B. “The gradient projection method for nonlinear programming. Part I. Linear constraints,” Journal of the Society for Industrial and Applied Mathematics, vol. 8, pp. 181-217, 1960.
[65] Rosenfeld, R., “A maximum entropy approach to adaptive statistical language modeling,” Computer Speech and Language, vol. 10, pp. 187-228, 1996.
[66] Schwenk, H., “Continuous space language models,” Computer Speech and Language, vol. 21, pp. 492-518, 2007.
[67] Sha, F. and Pereira, F., “Shallow parsing with conditional random fields,” Proceedings of Human Language technologies - North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pp. 134-141, 2003.
[68] Tam, Y.-C. and Schultz, T., “Dynamic language model adaptation using variational Bayes inference,” Proc. of European Conference on Speech Communication and Technology (EUROSPEECH), pp. 5-8, 2005.
[69] Tam, Y.-C. and Schultz, T., “Unsupervised language model adaptation using latent semantic marginals,” Proc. of International Conference on Spoken Language Processing (ICSLP), pp. 2206-2209, 2006.
[70] Wallach, H. M., “Topic modeling: beyond bag-of-words,” Proc. of International Conference on Machine Learning, pp. 977-984, 2006.
[71] Wang, S., Schuurmans, D., Peng, F. and Zhao, Y., “Learning mixture models with the regularized latent maximum entropy principle,” IEEE Transactions on Neural Networks, vol. 15, no. 4, pp. 903-916, 2004.
[72] Wu, J. and Khudanpur, S., “Building a topic-dependent maximum entropy model for very large corpora,” Proc. of IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP), vol. 1, pp. 777-780, 2002.
[73] Xu, P. and Jelineck, F., “Random forests and the data sparseness problem in language modeling,” Computer Speech and Language, vol. 21, pp. 105-152, 2007.
[74] Yamamoto, H., Isogai, S. and Sagisaka, Y., “Multi-class composite N-gram language model,” Speech Communication, vol. 41, pp. 369-379, 2003.
[75] Zitouni, I., “Backoff hierarchical class n-gram language models: effectiveness to model unseen events in speech recognition,” Computer Speech and Language, vol. 21, pp. 88-104, 2007.