| 研究生: |
趙子賢 Chao, Tzu-Hsien |
|---|---|
| 論文名稱: |
具信賴性斷點偵測之語音搜尋演算法 LVCSR Search Algorithm Using Reliable Change Point Detection |
| 指導教授: |
簡仁宗
Chien, Jen-Tzung |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2006 |
| 畢業學年度: | 94 |
| 語文別: | 中文 |
| 論文頁數: | 95 |
| 中文關鍵詞: | 連串檢定 、震盪現象 、無母數統計 、斷點偵測 、大詞彙連續語音辨識 |
| 外文關鍵詞: | Vibration, LVCSR, Change Point Detection, Non-parametric Statistics, Run Test |
| 相關次數: | 點閱:110 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
以動態規劃法(dynamic programming)和隱藏式馬可夫模型(hidden Markov model)為基礎的自動語音辨識(automatic speech recognition)是架構目前大字彙連續語音辨識系統的核心技術。然而,在連續語音訊號做動態規劃搜尋面臨著許多困難,其中包括了環境噪音干擾和連續語音之連音現象所造成的斷點偵測不穩定問題,如何提供快速且具信賴性的斷點偵測效果並提昇語音辨識率已成為關鍵性的研究課題。傳統上以相似度比值(likelihood ratio)做信心量測(confidence measure)可以改善斷點偵測的可靠度。但由於自發性語音的連音現象非常嚴重使得相似度比值計算在不同聲音單元交界處發生模糊且震盪(vibration)的情形,造成語音資料和聲學模型之間的校準(alignment)失真、語音參數訓練不精準以及語音辨識結果不佳。雖然文獻已有語音斷點偵測方法,但常需要設定經驗臨界值做偵測判斷,且並未直接處理邊界處之震盪問題。
在本論文中,我們提出以統計學中檢定一串數列是否具有隨機性的連串檢定(run test)作斷點偵測的依據,透過連串檢定可以有效檢驗出震盪狀態的隨機性並找出最佳的斷點,此連串檢定的技術也結合聲學模型間相似度比率檢定的理論發展出一套新潁語音辨識搜尋演算法,在實驗中我們以TDT2廣播新聞語料驗證此一方法在中文大詞彙連續語音辨識上之效能。
Basically, the state-of-the-art automatic speech recognition (ASR) systems are based on techniques of dynamic programming and hidden Markov model. There are several crucial issues happening in building desirable ASR performance. Among them, how to reliably detect change points of continuous speech in presence of high co-articulation effect and distortion environments plays a critical role. In the literature, likelihood-ratio (LR) based confidence measure was developed to improve detection performance. This likelihood ratio (LR) criterion could be used to decide the acceptance or rejection for the alignment between speech frames and acoustic models/units. However, in case of spontaneous-style speech, the probabilistic scores in some intervals turn out to be vibrating and confusing. This causes unreliable alignment during search processing for large vocabulary continuous speech recognition (LVCSR). Previously, some methods were presented to detect change points in HMM state level. But, these works should specify empirical detection threshold and were not considered as a direct solution to overcome vibration problems in boundaries of speech units.
In this thesis, we present the run test approach to test the randomness of the states of decision probabilistic scores in observation speech sequence. The non-parametric statistics is calculated and used to determine the optimal change point with the best randomness for the states before and after the change point. Through combining this principle and LR criterion, we can sequentially detect change points for building desirable LVCSR search algorithm. In the experiments, we implement and evaluate this approach using TDT2 Mandarin broadcast news corpus.
[1]X.L. Aubert, “An Overview of Decoding Techniques for Large Vocabulary Continuous Speech Recognition,” Computer Speech and Language, pp. 89-114, 2002.
[2]B. Brodsky and B.S. Darkhovsky, “Nonparametric Methods in Change-Point Problems,” Norwell, MA: Kluwer, 1993.
[3]R.E. Bellman, “Dynamic Programming,” Princeton, NJ: Princeton University Press, 1957.
[4]R.K. Bansal and P. Papantoni-Kazakos, “An Algorithm for Detecting A Change in A Stochastic Process,” IEEE Transactions on Information Theory, pp. 227-235, 1986.
[5]S. Chen and P. Gopalakrishnan, “Speaker, Environment And Channel Change Detection And Clustering Via Bayesian Information Criterion,” in the Proc. of DARPA Broadcast News Transcription Understanding Workshop, pp. 127-132, 1998.
[6]S.-S. Cheng and H.-M. Wang, “A Sequential Metric-Based Audio Segmentation Method via the Bayesian Information Criterion,” in Proc. of EUROSPEECH, pp. 945-948, 2003.
[7]J-T. Chien, C.-H. Huang, K. Shinoda, and S. Furui, “Towards Optimal Bayes Decision for Speech Recognition,” in Proc. of IEEE ICASSP, pp. 45-48, 2006.
[8]R.O. Duda, P.E. Hart, and D.G. Stork, “Pattern Classification 2nd Edition,” John Wiley & Sons, Inc, 2000.
[9]J.D. Gibbons and S. Chakraborti, “Nonparametric Statistical Inference,” New York: Marcel Dekker, 1992.
[10]P. Grunwald, I.J. Myung, and M. Pitt, “Advances in Minimum Description Length: Theory and Application,” MIT Press, 2005.
[11]B. Gold and N. Morgan, “Speech And Audio Signal Processing – Processing And Perception of Speech And Music,” John Wiley & Sons, Inc. 2000.
[12]Hui Jiang, “Confidence Measures for Speech Recognition: A Survey,” Speech Communication, pp. 455-470, 2005.
[13]T. Kawahara, C.-H. Lee, and B.-H. Juang, “Flexible Speech Understanding Based Combined Key-Phrase Detection and Verification,” IEEE Transactions on Speech and Audio Processing, pp. 558-568, 1998.
[14]M.-W. Koo, C.-H. Lee, and B.-H. Juang, “Speech recognition and utterance verification based on a generalized confidence score,” IEEE Transactions on Speech and Audio Processing, pp. 821 – 832, 2001.
[15]C.-H. Lee, F.K. Soong, K.K. Paliwal, “Automatic Speech And Speaker Recognition,” Kluwer Academic Publishers, 1996
[16]Qi Li, “A Detection Approach to Search-Space Reduction for HMM State Alignment in Speaker Verification,” IEEE Transactions on Speech and Audio Processing, pp. 569-578, 2001.
[17]Qi Li, “A Fast Decoding Algorithm Based on Sequential Detection of the Changes in Distribution,” in Proc. of ICSLP, 1998.
[18]Qi Li, “A Fast, Sequential Decoding Algorithm with Application to Speaker Verification,” in Proc. of IEEE ICASSP, 1999.
[19]H. Ney, “Progress in Dynamic Programming Search for LVCSR,” Proceedings of the IEEE, pp. 1224-1240, 2000.
[20]H. Ney, “The Use of A One-Stage Dynamic Programming Algorithm for Connected Word Recognition,” IEEE Transactions on Acoustics, Speech and Signal Processing, pp. 263-271, 1984.
[21]H. Ney, D. Mergel, A. Noll, and A. Paeseler, “Data-Driven Search Organization for Continuous Speech Recognition,” IEEE Transactions on Signal Processing, 1992.
[22]H. Ney and S. Orthmanns, “Dynamic Programming Search for Continuous Speech Recognition,” IEEE Signal Processing Magazine, pp.64-83, 1999.
[23]S. Ortmanns, H. Ney, and X. Aubert, “A Word Graph Algorithm for Large Vocabulary Continuous Speech Recognition,” Computer Speech and Language, pp. 43-72, 1997.
[24]M. K. Omar, U. Chaudhari, and G. Ramaswamy, ”Blind Change Detection for Audio Segmentation,” in Proc. of IEEE ICASSP, pp.501-504, 2005.
[25]M. Oerder and H. Ney, “Word Graphs: An Efficient Interface Between Continuous Speech Recognition And Language Understanding,” in Proc. of IEEE ICASSP, pp. 119-122, 1993.
[26]E.S. Page, “Continuous Inspection Schemes,” Biometrika, pp. 100-115, 1954.
[27]E.S. Page, “A Test for a Change in a Parameter Occurring at An Unknown Point,” Biometrika, pp. 523-527, 1955.
[28]R.C. Rose, B.-H. Juang, and C.-H. Lee, “A Training Procedure for Verifying String Hypothesis in Continuous Speech Recognition,” in Proc. of IEEE ICASSP, pp. 281-284, 1995.
[29]R.A. Sukkar, “Rejection for connected digit recognition based GPD Segmental discrimination,” in Proc. of IEEE ICASSP, pp. 393-396, 1994.
[30]R.A. Sukkar, and J.G. Wilpon, “A Two Pass Classification for Utterance Rejection in Keyword Spotting,” in Proc. of IEEE ICASSP, pp. 451-454, 1993.
[31]G. Schwarz, “Estimating The Dimension of A Model,” Ann. Math. Statist, pp.461-464, 1978.
[32]R. Schwartz and S. Austin, “A Comparison of Several Approximate Algorithms for Finding Multiple (N-Best) Sentence Hypotheses,” in Proc. of IEEE ICASSP, pp. 701-704, 1991.
[33]L. R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993.
[34]V. Steinbiss, B.-H. Tran, and H. Ney, “Improvements in beam search,” in Proc. of ICSLP, pp. 1355-1358, 1994.
[35]R. Schwartz and S. Austin, “A comparison of several approximate algorithms for finding multiple (N-Best) sentence hypotheses,” in Proc. of IEEE ICASSP, pp. 701-704, 1991.
[36]Paul M. B. Vitanyi and Ming Li, "Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity", in IEEE Transactions on Information Theory, pp. 446-464, 2000.
[37]A.J. Viterbi, “Error Bounds for Convolution Codes And An Asymptotically Optimal Decoding Algorithm,” in IEEE Transactions on Information Theory, pp. 260-269, 1967.
[38]H. Vinvent Poor, “An Introduction to Signal Detection and Estimation,” Springer-Verlog, 1994.
[39]A. Wald, “Sequential Analysis,” London, U.K.: Chapman & Hall, 1947.
[40]A. Wald and J. Wolfowitz, “On A Test Whether Two Samples Are From The Same Population,” Ann. Math. Stat, pp. 147–162, 1947.
[41]C.-H. Wu, and C.-H. Hsieh, “Multiple Change-Point Audio Segmentation and Classificaiton Using an MDL-Based Gaussian Model,” IEEE Transactions on Speech and Audio Processing, pp. 1-11, 2005.
[42]F. Wessel, R. Schlüter and Hermann Ney, “Explicit word error minimization using word hypothesis posterior probabilities,” in Proc. of IEEE ICASSP, pp. 33-36, 2001.
[43]S. Young, J. Jansen, J. Odell, D. Ollason, and P. Woodland, “The HTK Book Version 2.0, ” ECRL, 1995
[44]B. Zhou, and J. H. Hansen, “Efficient Audio Stream Segmentation via the Combined T2 Statistic and Bayesian Information Criterion,” IEEE Transactions on Speech and Audio Processing, pp. 467-474, 2005.
[45]翁毓謙, ”鑑別性貝氏分類法則應用於大詞彙連續語音辨識,” 碩士論文, 國立成功大學資訊工程系, 2005.