| 研究生: |
林俊宇 Lin, Chun-Yu |
|---|---|
| 論文名稱: |
應用隱含式語意索引與語言模型於中英夾雜語音之語言鑑別 Language Identification of Language-Mixed Speech Using Latent Semantic Indexing and Language Model |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2002 |
| 畢業學年度: | 90 |
| 語文別: | 中文 |
| 論文頁數: | 64 |
| 中文關鍵詞: | 語言鑑別 、隱含式語意索引 、語言模型 |
| 外文關鍵詞: | Language Identification, Language Model, Latent Semantic Indexing |
| 相關次數: | 點閱:80 下載:4 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著全球資訊的交流與通訊的便利,具備處理多國語言能力之人機介面愈顯重要;面對不同的語言甚至語言夾雜的情形,對話應用系統必須要能夠判定使用者話語中所使用的語言,才能進一步作語音的辨認;目前的語言辨識的研究多著重於單一語言語句之語言辨識上,架構上大致可分為高斯混合模型,單一語言音素辨識或平行語言音素辨識之語言模型等。
在本文,提出富彈性且有效率的前端偵測機制以處理單一語句中語言夾雜的問題;我們的研究著重在下列幾項 1)採用貝式資訊準則根據聲學特性的變異關係將語音分割成不同的段落2)針對不同情形之段落,將鑑別性參數採用隱含式語意索引概念,個別予以訓練高斯混合模型3)整合向量量化之雙連語言模型以強化段落化語言的鑑定4)最後,應用線性濾波器概念及動態規劃程式分別針對整體語音段作平滑化的動作並進一步偵測語言邊界點。
在實驗中,共有5304句中英夾雜語句(3人),3~5秒中文和英文句各500句(5人, Database 1),約15秒之中文和英文句各250句(5人, Database 2)被收集,其中80%為訓練語料,剩餘20%為測試語料;實驗結果顯示在語言夾雜情況下,段落化語言鑑別率達74%,語言邊界偵測F值則達0.62;在單一語言鑑別上,Database 1 和 Database 2 分別達到0.79和0.90的鑑別率,與其他的方法評比,在鑑別率上,本文方法有明顯之提昇。
With the trend of globalizing information exchangeability and communication, human machine interface with multi-lingual processing ability to distinguish between languages and provide inter-connective services become increasingly important. In the multi-lingual spoken language and dialog applications, the problem of multiple language or mixed language input is crucial for speech recognition. Recent researches into automatic language identification (LID) and recognition have been addressed to keep up with the growing demand from the application side. These approaches had more emphasis on the task of determining the language in which a single utterance was spoken and can be categorized from a framework viewpoint towards building the language dependent or independent recognizer, such as Gaussian mixture modeling, single language phone or parallel phone recognition followed by language modeling, etc.
In this paper, a flexible and efficient front-end architecture for language identification was proposed for speech segmentation and detection with mixed LID in a single utterance. More specially, this study focuses on: 1) adopting the Bayesian information criteria (BIC) with language-dependent acoustic features to divide input utterance into several acoustically-associated segments, 2) proposing a feature-discriminative and language dependent GMM using Latent Semantic Indexing approach to measure the strength of language for each segment, 3) integrating a VQ-based bi-gram language model into an MAP-based decision mechanism for language identification and 4) finally, applying a linear filtering and dynamic programming approaches for the precise language boundary estimation and smoothing.
In order to evaluate our proposed approach, 5304 Mandarin-English mixed speech corpus (3 male speakers), 500 single language utterances with the duration of 3~5 seconds (Database 1), and 250 single language utterances with the duration of 15 seconds (Database 2) are collected. 80% corpus are used as the training database, 20% corpus are used as the testing database. Experimental results showed that the proposed mixed language decision mechanism achieved 74% accuracy and F value for the language boundary detection was 0.62. The LID rate for Database 1 and Database 2 were 0.79 and 0.90, respectively. Our proposed architecture outperforms than other well-established approaches. This study aims for multi-lingual speech recognition.
[1] Marc A. Zissman, “Comparison of Four Approaches to Automatic Language Identification of Telephone Speech ,” IEEE Trans. On Speech and Audio Proc., vo4. no1, pp. 31-43, January 1996.
[2] T. J. Hazen and V. W. Zue, “Automatic language identification using a segment-based approach,” in Proc. Eurospeech ’93, vol. 2, pp. 1303-1306, Sept. 1993.
[3] M. A. Zissman and E. Singer, “Automatic language identification of telephone speech messages using phoneme recognition and n-gram modeling,” in Proc. ICASSP ’94, vol. 1, pp. 305-308, Apr. 1994.
[4] R. C. F. Tucker, M. J. Carey, and E. S. Paris, “Automatic language identification using sub-words models,” in Proc. ICASSP ’94, vol. 1, pp. 301-304, Apr. 1994.
[5] Francois Pellegrino, Regine Andre-Obrecht, “Automatic language identification : an alternative approach to phonetic modeling,” Signal Processing, vol. 80, issue 7, pp. 1231-1244, July, 2000
[6] Wuei-He Tsai, Wen-Whei Chang ,“ Discriminative training of Gaussian mixture bigram models with application to Chinese dialect identification,” Speech Communication 36, pp. 317-326, 2002.
[7] M. Sugiyama, “Automatic language recognition using acoustic features,” in Proc. ICASSP ’91, vol. 2, pp. 813-816, 1991.
[8] M. A. Zissman, “Automatic language identification using Gaussian mixture and hidden Markov models, ”in Proc. ICASSP ’93, vol.2, pp.399-402, Apr. 1993.
[9] L. Riek. W.Mistreta, and D. Morgan, “Experiments in language identification, ” Lockheed Sanders, Inc., Nashua, NH, Tech. Rep. SPCOT91-002, Dec. 1991.
[10] S. Nakagawa, Y. Ueda, and T. Seino, “Speaker-independent, text-independent language identification by HMM, ”in Proc. ICSLP ’92, vol. 2, pp.1011-1014, Oct. 1992.
[11] R. J. D’Amore and C. P. Mah, “One-time complete indexing of text: Theory and practice,” in Proc. Eighth Int. ACM Conf. Res. Dev. Inform. Retrieval, pp. 155-164, 1985.
[12] R. E. Kimbrell, “Searching for text Send an N-gram,” Byte, vol. 13, no. 5, pp. 297-312, 1988.
[13] J. C. Shmitt, “Trigram-based method of language identification, ” US Patent 5 062 143, Oct. 1991.
[14] M. Damashek, “Gauging similarity via N-grams: Language- independent text sorting, categorization, and retrieval of text,” submitted for publication in Sci.
[15] Y. Yan and E. Barnard, “An approach to automatic language identification based on language-dependent phone recognition,” in Proc. ICASSP ’95, vol. 5, pp. 3511-3514, May, 1995.
[16] T. J. Hazen and V. W. Zue, “Recent improvements in an approach to segment-based automatic language identification,” in Proc. ICASSP ’94, vol. 4, pp. 1883-1886, Sep. 1994.
[17] Huang, X.D., Y. Ariki, and M.A. Jack, “Hidden Markov Models for Speech Recognition,” Edinburgh, UK., Edinburgh University Press. 1990.
[18] Rolf Johansson, “System Modeling and identification,” Prentice Hall, pp. 192, 1993.
[19] H. Akaike, “A new look at the statistical model identification,” TAC-19, pp. 718-723, 1977.
[20] R. Shibata, “Asymptotically efficient selection of the order of a model for estimating parameters of a linear process,” Ann. Statistics, vol. 8, pp. 147-164, 1980.
[21] G. Schwartz, “Estimating the dimension of a model,” Ann. Statistics, vol. 6, pp. 461-464, 1978
[22] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, pp. 465-471, 1978.
[23] D. Burshtein and E. Weinstein, “On the application of the Wald statistic to order estimation of ARMA models.” TAC-36, pp. 1091-1096, 1992.
[24] Mauro Cettolo and Marcello Federico,“ Model selection criteria for acoustic segmentation,” in Proc. Of the ISCA ITRW ASR2000 Automatic Speech Recognition, pp. 221-227, 2000.
[25] U. Iurgel, R. Meermeier, S. Eickeler, G.Rigoll, “ New approaches to audio-visual segmentation of TV news for automatic retrieval,” in Acoustics, Speech, and Signal Processing, 2001. Proceedings. 2001 IEEE International Conference on , vol. 3 , 2001.
[26] Alain Trischler and Ramesh Gopinath, “ Improved speaker segmentation and segments clustering using the Bayesian information criterion,” in Proc. EUROSPEECH, vol.2, pp.679-682, 1999.
[27] S. Chen, P. Gopalakrishnan, “Speaker environment and channel change detection and clustering via the Bayesian Information Criterion,” Proc. of the DARPA Workshop, 1998.
[28] G. Schwarz, “Estimating the dimension of a model,” The annals of statistics, vol. 6, pp. 461-464, 1978.
[29] Dian I. Witter, Michael W. Berry, “Downdating the Latent Semantic Indexing Model for Conceptual Information Retrieval,” The Computer Journal, vol. 41, no. 8, pp. 589-601, 1998.
[30] Berry, M. W. , “Large scale singular value computations,” Int. J. Supercomput. Applic., vol. 6, pp. 13-49, 1992.
[31] Ricardo Baeza-Yates, Berthier Ribeiro-Neto, “Modern Information Retrieval,” Addison-Wesley, pp.86, 1999.
[32] T.L. Lander et al., The OGI 22 language telephone speech corpus, Proc. Eurospeech’95, Madrid, pp.817-820, 1995.