| 研究生: |
沈涵平 Shen, Han-Ping |
|---|---|
| 論文名稱: |
多語與腔調語音辨識之研究 A Study on Multilingual and Accented Speech Recognition |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2014 |
| 畢業學年度: | 102 |
| 語文別: | 英文 |
| 論文頁數: | 118 |
| 中文關鍵詞: | 語言轉換 、以狀態為單位之模型 、隱含式語言空間模型 、差分貝氏資訊準則 、語音辨識 、發音特徵 、模型集建立 、腔調語音 、雙語語音辨識 、多語語音辨識 、世界英文 、語者發音聚類 、發音結構分析 、支援向量迴歸 |
| 外文關鍵詞: | code-switching, latent language space model, delta-BIC, speech recognition, articulatory features, phone set construction, accented speech, bilingual speech recognition, multilingual speech recognition, world Englishes, speaker-based pronunciation clustering, pronunciation structure analysis, support vec-tor regression |
| 相關次數: | 點閱:181 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
受到全球化的影響,溝通時同時使用多國語言已經成為了一種必然趨勢。因為對多國語言的各種相關應用需求日益增加,多國語言的語音辨識器變得日益重要。在多國語言的語音辨識中,語句中夾帶的腔調會嚴重的影響辨識器之準確率。此外,語言轉換的現象也必然出現在多國語言的溝通之中,它也同樣的會大幅降低辨識器之效能。因此,具有語言轉換的語料庫也變得更重要。所以首先本論文著手建立中英雙語語言轉換語料庫。
為了去解決因為語言轉換所造成的語音辨識問題,語言轉換偵測就被應用來增進語言轉換語句的辨識效能。本論文使用一個基於隱含式語言空間模型(latent language space model)與差分貝氏資訊準則(delta-BIC)的方法去做語言轉換偵測。隱含式語言空間模型被用來描述各種語言,我們可對測試語言所建出的隱含式語言空間模型與各種語言的隱含式語言空間模型去計算相似度。計算相似度時會用到以歐氏距離為主與以餘弦角度為主之相似度計算來得到整體的相似度。
本論文也提出一個資料驅動的方法去建立中英文的語音辨識模型集。聲學特徵與具有跨語言特性的發音特徵被整合來計算三連音聲學模型的距離資訊。本文中的發音特徵是由深層學習網絡所擷取出來的,它可以被用來減輕前後文相關三連音模型經常要面對的資料稀疏問題。而語料中出現的三連音模型就可藉由聲學特徵與發音特徵來進行階層式聚類。
多語辨識經常得面對腔調相關問題。如何去模型化腔調變成是一種挑戰。本論文產生帶有中文腔調的英文模型以辨識母語為中文語者之語句。一重腔調語音片段驗證被用以自動擷取語料中之重腔調語音片段。再來,轉換函式與決策樹可根據無腔調的聲學模型去產生重腔調模型的高斯成份以解決資料稀疏之問題,最後再使用鑑別性函式檢測所產生出之重腔調模型之鑑別力,剔除鑑別力低之模型,所保留下來的重腔調模型將與正常發音模型一起對中英文的語句進行辨識。
此外,英文是最常被用於多語溝通的語言,然而因為母語的影響,不同地方的人在講英語都會有不同的腔調。本論文以發音腔調為基礎進行聚類,根據這樣的聚類結果,我們可挑選適合的腔調語音辨識器去進行腔調語音之辨識。本論文結合不變發音結構(invariant pronunciation structure)與支援向量回歸(support vector regression)去預測語者與語者之間發音的距離。
以上所有的方法都在本篇論文中被實作,且實驗結果顯示所提出的方法都能改進多語,語言轉換與腔調語音之辨識效能。
Due to globalization, multilingual communication is becoming more and more popular. With the increase of multilingual speech data and the need of wide applications, the development of multilingual automatic speech recognition (ASR) systems has become more and more important. For an ASR system, accents produced by non-native speakers will dramatically deteriorate the recognition performance. Moreover, Code-switching, a phenomenon of language change during conversation, could be easily found in multilingual communities. It also degrades the recognition accuracy of ASR seriously. Thus, the design and development of a code-switching speech database for ASR training becomes highly desirable. This dissertation presents the procedure for the design and development of a Chinese-English code-switching speech database.
In order to conquer the recognition problems caused by code-switching, code-switching event detection can be used to improve the recognition accuracy of ASR. This dissertation presents a new paradigm for code-switching event detection based on latent language space models (LLSMs) and delta-BIC. A LLSM is proposed to characterize a language by modeling the spatial relationships of the senones/articulatory features in the eigenspace using the PCA-transformed features. The language likelihood between the input speech LLSM and each of the language-dependent LLSMs is estimated basd on Euclidian distance-based and cosine angle distance-based similarities.
This dissertation also proposes a data-driven approach to phone set construction for code-switching ASR. Acoustic and context-dependent cross-lingual articulatory features (AFs) are incorporated into the estimation of the distance between triphone units for constructing a Chinese-English phone set. The AFs, extracted using a deep neural network, are used for code-switching articulation modeling to alleviate the data sparseness problem. The triphones are finally clustered to obtain a Chinese-English phone set.
Multilingual speech recognition is confronted with the accent-related problems caused by non-native speech. The acoustic properties in accented speech are quite divergent. The dissertation generates the highly Mandarin-accented English models for the speakers whose mother tongue is Mandarin. A verification method is proposed to extract the highly accented speech segments automatically. Gaussian components of the highly accented speech models are then generated from the corresponding Gaussian components of the native speech models using a linear transformation function and decision tree to deal with the data sparseness problem. A discrimination function is further applied to verify the generated accented acoustic models.
Furthermore, English is the most common language used by multiligual speakers. The dissertation creates a global pronunciation map of World Englishes. Successful clustering of accented English can be benifitial to speech recognition since people can choose suitable accented model to recognize. This dissertation investigates invariant pronunciation structure analysis and Support Vector Regression to predict the inter-speaker pronunciation distances for clustering.
These methods were implemented and the experimental results show that the proposed approaches achieved improvements in multilungual, code-switching and accented speech recognition.
[Akaike, 1974] Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic control, vol. 19, no. 6, pp. 716-723.
[Bellegarda, 2000] Bellegarda, J. (2000). Exploiting latent semantic information in statistical language modeling, in Proceedings of the IEEE, vol. 88, no. 8, pp. 1279–1296.
[Bimbot et al., 2005] Bimbot, F., Bonastre, J.F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., Merlin, T., Ortega-Garcia, J., Petrovska-Delacretaz, D., Reynolds, D.A. (2005). A tutorial on text-independent speaker verification, Journal of Applied Signal Processing, vol 4, pp. 430–451.
[Bouselmi et al., 2007] Bouselmi, G., Fohr, D., and Illina, I. (2007). Combined acoustic and pronunciation modeling for non-native speech recognition, in Proceedings of Interspeech, pp. 1449-1452.
[Bouselmi et al., 2008] Bouselmi, G., Fohr, D., and Illina, I. (2008). multi-accent and accent-independent non-native speech recognition, in Proceedings of Interspeech .
[Burget et al., 2006] Burget, L., Matejka, P., and Cernocky, J. (2006). Discriminative training techniques for acoustic language identification, in Proceeding of IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP), vol. 1, pp. 209–212.
[Burgmer, 2009] Burgmer, C. (2009). Detecting code-switch events based on textual features, Diplomarbeit
[Campbell et al., 2006 (a)] Campbell, W. M., Brady, K. J., Campbell, J. P., Granville, R. D., Reynolds, A. (2006). Understanding scores in forensic speaker recognition, in Proceedings of The Odyssey 2006-Speaker and Language Recognition Workshop.
[Campbell et al., 2006 (b)] Campbell, W. M., Campbell, J. P., Reynolds, D. A., Singer, E., and Torres-Carrasquillo, P. A. (2006) Support vector machines for speaker and language recognition, Computer Speech & Language, vol. 20, no. 2, pp. 210–229.
[Campbell et al., 2006 (c)] Campbell, W. M., Sturim, D. E., Reynolds, D. A., and Solomonoff , A. (2006) SVM based speaker verification using a GMM supervector kernel and NAP variability compensation, in Proceedings of. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1.
[Castaldo et al., 2007] Castaldo, F., Colibro, D., Dalmasso, E., Laface, P., and Vair, C. (2007). Acoustic language identification using fast discriminative training. in Proceedings of Interspeech, vol. 7, pp. 346–349.
[Cettolo and Federico, 2000] Cettolo, M. and Federico, M. (2000). Model selection criteria for acoustic segmentation, in Proceedings of the ISCA ITRW ASR Automatic Speech Recognition, pp. 221–227.
[Chao et al., 2007] Chao, Y.-H., Wang, H.-M., Chang, R.-C. (2007). A novel characterization of the alternative hypothesis using kernel discriminant analysis for LLR-based speaker verification, International Journal of Computational Linguistics and Chinese Language Processing, vol. 12, no. 3, pp. 255-272.
[Chang and Lin, 2001] Chang, C.-C. and Lin,C.-J. (2001). LIBSVM: a library for support vector machines.
[Chen, 2004] Chen, C. (2004). Two types of code-switching in Taiwan, in Proceedings of Sociolinguistics Symposium 15 (SS15), Newcastle upon Tyne, United Kingdom.
[Chen and Wang, 2009] Chen, I.-F. and Wang, H.-M. (2009). Articulatory feature asynchrony analysis and compensation in detection-based ASR, in Proceedings of Interspeech, Brighton, UK, pp. 3059–3062.
[Chen et al., 2011] Chen, N., Shen, W., Campbell, J., Torres-Carrasquillo, P. (2011). Informative dialect recognition using context-dependent pronunciation modeling, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP).
[Chen et al., 2002] Chen, Y.-J., Wu, C.-H., Chiu, Y.-H., Liao, H.-C. (2002). Generation of robust phonetic set and decision tree for Mandarin using Chi-square testing, Speech Communication, vol. 38, no. 3-4, pp.349-364.
[Chiu et al., 2009] Chiu, C.-Y., Liao, Y.-F., Kulls, D., Mixdorff, H., Chen, S.-L. (2009). A preliminary study on corpus design for computer-assisted German and Mandarin language learning, Speech Database and Assessments, Oriental COCOSDA International Conference, pp.154-159.
[Chomsky and Halle 1968] Chomsky, N., Halle, M. (1968). The sound pattern of English. Harper & Row, New York.
[CMU] The CMU pronunciation dictionary, http://www.speech.cs.cmu.edu/cgi-bin/cmudict
[Dempster et al. 1997] Dempster, A.-P., Laird, N.-M., Rubin, D.-B. (1997). Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society. Series B (Methodological), vol. 39, no. 1, pp.1–38.
[Deng and Liu, 2008] Deng, Y. and Liu, J. (2008). Automatic language identification using support vector machines and phonetic n-gram, in Proceedings of IEEE International Conference on Audio, Language and Image Processing (ICALIP), pp. 71–74.
[English Across Taiwan, 2005] EAT [online] http://www.aclclp.org.tw/use_mat.php#eat.
[Felps et al., 2012] Felps, D., Geng, C., Gutierrez-Osuna, R. (2012). Foreign accent conversion through concatenative synthesis in the articulatory domain, IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 8, pp. 2301-2312.
[Flege, 1995] Flege, J. (1995). second language speech learning theory, findings, and problems , in Speech Perception and Linguistic Experience: Issues in Cross-Language Research, W. Strange, Ed.: Baltimore: York Press, pp. 233-277.
[Fung et al., 2005] Fung, P., Liu, Y. (2005). Effects and modeling of phonetic and acoustic confusions in accented speech, Journal of the Acoustical Society of America, vol.118, no. 5, pp. 3279 – 3293.
[Fisher et al. 1986] Fisher, W. M., Doddington, G. R., Goudie-marshall, K. M. (1986). The DARPA speech recognition research database: specifications and status, in Proceedings of DARPA Workshop on Speech Recognition, pp. 93–99.
[Fukada et al., 1999] Fukada, T., Yoshimura, T., Sagisaka, Y. (1999). Automatic generation of multiple pronunciations based on neural networks. Speech Communication, vol. 27, no. 1, pp. 63–73.
[Ghosh and Narayanan, 2011] Ghosh, P. and Narayanan, S. (2011). Automatic speech recognition using articulatory features from subject-independent acoustic-toarticulatory inversion. The Journal of the Acoustical Society of America, vol. 130, no. 4, pp. EL251–EL257 .
[Goldberger and Aronowitz, 2005] Goldberger, J., and Aronowitz, H.(2005). Distance measure between GMMs based on the unsented transform and its application to speaker recognition, in Proceedings of Eurospeech, pp. 1985-1988.
[Halmari, 1997] Halmari, H. (1997). Government and code-switching: explaining American Finnish. Amsterdam: John Benjamins.
[Hanani et al., 2013] Hanani, A., Russell, M.J., Carey, M.J. (2013). Human and computer recognition of regional accents and ethnic groups from British English speech, Computer Speech & Language, vol. 27, no. 1, pp. 59–74.
[Hieronymus, 1993] Hieronymus, J.-L. (1993). ASCII phonetic symbols for the World's languages: Worldbet, AT&T Technical Report. http://www.cslu.ogi.edu/publications/
[Huang and Wu, 2007] Huang, C.-L., Wu, C.-H. (2007), Generation of phonetic units for mixed-language speech recognition based on acoustic and contextual analysis, IEEE Transactions on Computers, vol. 56, no. 9, pp. 1245-1254.
[Hwang and Huang, 1992] Hwang, M. Y., Huang, X. (1992). Subphonetic modeling with Markov states-senone, in Proceedings of ICASSP, vol. 1, pp. 33-36.
[Huang et al., 1992] Huang, X., Alleva, F., Hon, H. W., Hwang, M. Y., and Rosenfeld, R. (1992). The sphinx-ii speech recognition system: An overview, Computer Speech and Language, vol. 7, pp. 137–148.
[International Phonetic Association] IPA. (1999). Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet, Cambridge University Press.
[Jakobson and Waugh, 1979] Jakobson, R., and Waugh, L. R. (1979) Sound shape of language, Branch Line.
[Jakobson et al., 1952] Jakobson, R., Fant, C. G. M., and Halle, M. (1952). Preliminaries to speech analysis : the distinctive features and their correlates, MIT Press…
[Joachims et al. 1999] Joachims, T., Schölkopf, B., Burges, C., Smola, A., (ed.) (1999). Making large-Scale SVM Learning Practical, Advances in Kernel Methods - Support Vector Learning, MIT-Press.
[Joachims, 2002] Joachims, T. (2002). Learning to classify text using support vector machines, Dissertation, Kluwer.
[Kirchhoff, 2000] Kirchhoff, K. (2000). Integrating articulatory features into acoustic models for speech recognition, in Proceedings of the Workshop on Phonetics and Phonology in ASR, Saarbr¨ucken, Germany.
[Koher, 2001] Koher, J. (2001). Multilingual phone model for vocabulary-independent speech recognition tasks, Speech Communication, vol. 35, pp. 21-30.
[Lebese et al., 2012] Lebese, E., Manamela, J. and Gasela, N (2012). Towards a multilingual recognition system based on phone-clustering scheme for decoding local languages, in Proceedings of Southern Africa Telecommunication Networks and Applications Conference (SATNAC).
[Lee et al., 2007] Lee, C. H., Clements, M. A., Dusan, S., Fosler-Lussier, E., Johnson, K., Juang, B. H., Rabiner, L. R. (2007). An overview on automatic speech attribute transcription (ASAT), in Proceedings of Interspeech.
[Liu and Fung, 2004] Liu, Y., Fung, P. (2004). State-dependent phonetic tied mixtures with pronunciation modeling for spontaneous speech recognition, IEEE Transactions on Speech and Audio Processing, vol 12, no. 4, pp. 351-364.
[Lu et al., 2001] Lu, L., Li, S. Z., and Zhang, H.-J. (2001). Content-based audio segmentation using support vector machines, in IEEE International Conference on Multimedia and Expo (ICME), pp. 749–752.
[Lyu and Lyu, 2008] Lyu, D.-C. and Lyu, R.-Y. (2008). Language identification on code-switching utterances using multiple cues, in Proceedings of Interspeech, pp. 711-714.
[Ma et al., 2010] Ma, X., Xu, R., Minematsu, N., Qiao, Y., Hirose, K., Li A. (2010). Dialect-based speaker classification using speaker invariant dialect features, in Proceedings of Int. Symposium on Chinese Spoken Language Processing, pp.171-176.
[Mak and Bamard, 1996] Mak, B., and Barnard, E. (1996). Phone clustering using the Bhattacharyya distance, in Proceedings of the International Conference on Spoken Language Processing, pp. 2005-2008.
[Minematsu et al., 2004] Minematsu, N., Tomiyama, Y., Yoshimoto, K., Shimizu, K, Nakagawa, S., Dantsuji M., Makino, S. (2004). Development of English speech database read by Japanese to support CALL research, in Proceedings of ICA, pp.557-560.
[Minematsu, 2005] Minematsu, N. (2005). Mathematical evidence of the acoustic universal structure in speech, in
Proceedings of ICASSP, pp.889-892.
[Minematsu et al., 2007] Minematsu, N., Kamata, K., Asakawa, S., Makino, T., and Hirose, K.. (2007). Structural representation of the pronunciation and its use for clustering Japanese learners of English, in
Proceedings of SLaTE, CD-ROM.
[Minematsu, 2010] Minematsu, N., Qiao, Y., Asakawa, S., Suzuki, M. (2010). Speech structure and its application to robust speech processing, Journal of New Generation Computing, vol. 28, no. 3, pp. 299-319.
[Minematsu et al., 2011] Minematsu, N., Okabe, K., Ogaki, K., Hirose, K.. (2011). Measurement of objective intelligibility of Japanese accented English using ERJ (English Read by Japanese) database, in Proceedings of Interspeech, pp.1481-1484.
[Mokbel and Jouvet, 1998] Mokbel, H., Jouvet, D. (1998). Derivation of the optimal phonetic transcription set for a Word from its acoustic realizations, in Proceedings of the ESCA Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition, pp. 73–78.
[Nagarajan and Murthy, 2004] Nagarajan, T., Murthy, H. (2004) Language identification using parallel syllable-like unit recognition, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 401–404.
[Oh et al., 2007] Oh, Y.-R., Yoon, J.-S., Kim, H.-K. (2007). Acoustic model adaptation based on pronunciation variability analysis for non-native speech recognition, Speech Communication, vol. 49, no. 1, pp. 59-70.
[Ostendorf, 1999] Ostendorf, M. (1999). Moving beyond the ‘beads-on-a-string’ model of speech, in Proceedings of the Automatic Speech Recognition and Understanding Workshop, vol. 1, pp. 79, Keystone, Colorado, USA.
[Qian and Liu, 2010] Qian, Y. and Liu, J. (2010). Phone modeling and combining discriminative training for Mandarin-English bilingual speech recognition, in Proceedings of ICASSP, Dallas, USA, pp.4918-4921.
[Qian et al., 2011] Qian, Y., Povey, D., Lu, J. (2011). State-level data borrowing for low-resource speech recognition based on subspace GMMs, in Proceedings of Interspeech.
[Qiao et al., 2009] Qiao, Y., Suzuki, M., Minematsu, N. (2009). A study of hidden structure model and its application of labeling sequences, in
Proceedings of ASRU, pp.118-123.
[Qiao and Minematsu, 2010] Qiao, Y. and Minematsu, N. (2010). A study on invariance of f-divergence and its application to speech recognition,
IEEE Transactions on Signal Processing, vol.58, no.7, pp.3884-3890.
[Qu and Wang, 2003] Qu, D. and Wang, B. (2003). Discriminative training of GMM for language identification, in ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition.
[Rahman et al., 2013] Rahman, B.K.M.M. Ahamed, B. Islam, R. Huda, M.N. (2013). Articulatory feature-based gender factor minimization in automatic speech recognition, in Proceedings of International Conference on Informatics, Electronics & Vision (ICIEV), pp.1 – 5.
[Ravishankar and Eskenazi, 1997] Ravishankar, M., Eskenazi, M. (1997). Automatic generation of context-dependent pronunciations, in Proceedings of EuroSpeech-97, pp. 2467–2470.
[Rosenberg et al., 1992] Rosenberg, A., E., Delong, J., Lee, C., H., Juang, B., H., Soong, F., K. (1992). The use of cohort normalized scores for speaker verification, in Proceeding of The Second International Conference on Spoken Language Processing.
[Sfikas et al., 2005] Sfikas, G., Constantinopoulos, C., Likas, A., Galatsanos, N.P. (2005). An analytic distance metric for Gaussian mixture models with application in image retrieval, in: ICANN (2), Lecture Notes in Computer Science, vol. 3697, Springer, pp. 835–840.
[Shen et al., 2011] Shen, H.-P., Wu, C.-H., Yang, Y.-T., and Hsu, C.-S. (2011). CECOS: A Chinese-English code-switching speech database, in Proceedings of International Conference on Speech Database and Assessments (Oriental COCOSDA), pp. 120–123.
[Shen et al., 2011] Shen, H.-P., Wu, C.-H., and Tsai, P.-S. (2011). Transformation-based accented speech modeling using articulatory attributes for nonnative speech recognition, in Proceedings of APSIPA Annual Summit and Conference (APSIPA ASC).
[Shih et al., 2008] Shih, P. -Y., Wang, J.-F., Lee, H.-P., Kai, H.-J., Kao, H.-T. and Lin, Y.-N. (2008). Acoustic and phoneme modeling based on confusion matrix for ubiquitous mixed-language speech recognition, IEEE SUTC, Newport Beach, USA, pp.500-506.
[Siniscalchi et al., 2008] Siniscalchi, S. M., Svendsen, T., Lee, C.-H. (2008). Toward A detector-based universal phone recognizer, in Proceedings of ICASSP, pp. 4261-4264.
[Stefan et al., 2003] Stefan, S. (2003). Generating non-native pronunciation lexicons by phonological rules, in Proceedings of International Conference of Phonetic Sciences (ICPhS), pp. 2545-2548.
[Strik and Cucchiarini, 1999] Strik, H., Cucchiarini, C. (1999). Modeling pronunciation variation for ASR: a survey of the literature, Speech Communication, vol. 29, no. 2-4, pp. 225-246.
[Stuker et al., 2003 (a)] Stuker, S., Schultz, T., Metze, F., Waibel, A. (2003). Multilingual articulatory features, in Proceedings of ICASSP, vol. 1, I-144~I-147
[Stuker et al., 2003 (b)] Stuker, S., Metze, F., Schultz, T., Waibel, A. (2003). Integrating multilingual articulatory features into speech recognition, in Proceedings of EuroSpeech.
[Stylianou et al. 1998] Stylianou, Y., Cappe, O., Moulines, E. (1998), Continuous probabilistic transform for voice conversion, IEEE Transactions on Speech and Audio Processing, vol. 6, no. 2, pp. 131–142.
[Su, 2001] Su, H. Y. (2001). Code-switching between mandarin and Taiwanese in three telephone conversation: The negotiation of interpersonal relationships among bilingual speakers in Taiwan, in Proceedings of The Symposium about Language and Society.
[Suo et al., 2008] Suo, H., Li, M., Lu, P., and Yan, Y. (2008). Automatic language identification with discriminative language characterization based on svm, IEICE - Transactions on Information and Systems, vol. E91-D, no. 3, pp. 567–575.
[Suzuki et al., 2009] Suzuki, M., Minematsu, N., Dean L., Hirose, K (2009). Sub-structure-based estimation of pronunciation proficiency and classification of learners, in Proceedings of ASRU, pp.574-579.
[Suzuki et al., 2010] Suzuki, M., Qiao, Y., Minematsu, N., Hirose, K. (2010). Integration of multilayer regression with structure-based pronunciation assessment, in Proceedings of Interspeech, pp.586-589.
[Suzuki et al., 2012] Suzuki, M., Kurata, G., Nishimura, M., Minematsu, N. (2012). Discriminative reranking for LVCSR leveraging invariant structure,
in Proceedings of Interspeech, CD-ROM.
[TCC-300] Tcc-300edu, (2005). introduction: http://www.aclclp.org.tw/doc/tcc300_brief.pdf.
[Torre et al., 1997] Torre, D., Villarrubia, L., Hernández, L., Elvira, J. (1997). Automatic alternative transcription generation and vocabulary selection for flexible word recognizers, in Proceedings of ICASSP, pp.1463–1466.
[Torres-Carrasquillo et al., 2002 (a)] Torres-Carrasquillo, P. A., Singer, E., Kohler, M. A., Greene, R. J., Reynolds, D. A., and Deller, J. R. Jr. (2002). Approaches to language identification using Gaussian mixture models and shifted delta cepstral features, in Proceedings of Interspeech, pp. 1–1.
[Torres-Carrasquillo et al., 2002 (b)] Torres-Carrasquillo, P. A., Reynolds, D. A., and Deller, J. (2002). Language identification using gaussian mixture model tokenization, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 757– 760.
[Tritschler and Gopinath, 1999] Tritschler, A. and Gopinath, R. A. (1999). Improved speaker segmentation and segments clustering using the bayesian information criterion, in Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH).
[Tsai and Chang, 2002] Tsai, W. and Chang, W. (2002). Discriminative training of gaussian mixture bigram models with application to Chinese dialect identification, Speech Communication, vol. 36, no. 3, pp. 317–326.
[Visceglia et al., 2009] Visceglia, T., Tseng, C.-Y., Kondo, M., Meng, H., Sagisaka, Y. (2009). Phonetic aspects of content design in AESOP (Asian English Speech cOrpus Project), Speech Database and Assessments, Oriental COCOSDA International Conference, pp.60-65.
[Wang, 2012] Wang, Y.B. (2012). Improved approaches of modeling and detecting error patterns with empirical analysis for computer-aided pronunciation training, in Proceedings of ICASSP, pp.5049-5052.
[Waibel, 2002] Waibel, A. (2002). A flexible stream structure for ASR using articulatory features, in Proceedings of Interspeech.
[Weinberger et al.] Weinberger, S. H. Speech accent archive, George Mason University, http://accent.gmu.edu.
[Weiner et al., 2012] Weiner, J., Vu, N., T., Telaar, D., Metze, F., Schultz, T., Lyu, D.-C., Chng, E.-S., Li, H. (2012). Integration of language identification into a recognition system for spoken conversations containing code-switches, in Proceedings of IEEE Workshop of Spoken Language Technology (SLT).
[Wells, 1989] Wells, J.-C. (1989). Computer-coded phonemic notation of individual languages of the European community, Journal of the International Phonetic Association, vol. 19, pp. 31-54.
[Williams and Renals, 1998] Williams, G., Renals, S. (1998), Confidence measures for evaluating pronunciation models, in Proceedings of the ESCA Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition, pp. 151–156.
[Wu and Hsieh, 2006] Wu, C.-H., and Hsieh, C.-H. (2006). Multiple change-point audio segmentation and classification using an mdl-based gaussian model, IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 2, pp. 647–657.
[Wu et al., 2012] Wu, C.-H., Shen, H.-P., Yang, Y.-T. (2012). Phone set construction based on context-sensitive articulatory attributes for code-switching speech recognition, in Proceedings of ICASSP.
[Wu et al., 2011] Wu, C.-H., Su. H.-Y., Shen, H.-P. (2011). Articulation-disordered speech recognition using speaker-adaptive acoustic models and personalized articulation patterns, ACM Transactions on Asian Language Information Processing, vol. 10, no. 2.
[Yang et al., 2008] Yang, J., Wu, P., Xu, D. (2008). Mandarin speech recognition for nonnative speakers based on pronunciation dictionary adaptation, in Proceedings of International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 1-4.
[Yeh et al., 2011] Yeh, C.-F., Huang, C.-Y., Lee, L.-S. (2011). Bilingual acoustic model adaptation by unit merging on different levels and cross-level integration, in Proceedings of Interspeech.
[Young et al., 2006] Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.-Y., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P. (2006). The Hidden Markov model toolkit (HTK) Version 3.4, http://htk.eng.cam.ac.uk/
[Yu et al., 2009] Yu, D., Deng, L., Liu, P., Wu, J., Gong, Y., and Acero, A. (2009). Cross-lingual speech recognition under runtime resource constraints, in Proceedings of ICASSP, pp. 4193-4196.
[Yu et al., 2012] Yu, D., Siniscalchi, S., Deng, L., Lee, C.-H. (2012). Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition, in Proceedings of ICASSP.
[Yu et al., 2004] Yu, S., Zhang, S., Xu, B. (2004). Chinese-English bilingual phone modeling for cross-language speech recognition, in Proceedings of ICASSP, vol.1, Montreal, Canada, I- 917-20.
[Zhang et al., 2011] Zhang, C., Liu, Y., Lee, C.-H. (2011). Detection-based accented speech recognition using articulatory features, in Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).
[Zhang et al., 2008] Zhang, Q., Li, T., Pan, J., Yan, Y. (2008). Nonnative speech recognition based on state-level bilingual model modification, in Proceedings of Third International Conference on Convergence and Hybrid Information Technology (ICCIT), vol. 2, pp. 1220-1225.
[Zhang et al., 2003] Zhang, Z., Chen, C., Sun, J., and Chan, K. L. (2003). EM algorithms for Gaussian mixtures with split-and-merge operation, Pattern Recognition, vol. 36,no. 9, pp. 1973 – 1983.
[Zhao, 2012] Zhao, T., Hoshino, A., Suzuki, M., Minematsu, N., Hirose, K. (2012). Automatic Chinese pronunciation error detection using SVM with structural features,
in Proceedings of Spoken Language Technology, pp.473-476.
[Zissman and Singer, 1994] Zissman, M., and Singer, E. (1994). Automatic language identification of telephone speech messages using phoneme recognition and n-gram modeling, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 305–308.
[Zissman, 1996] Zissman, M. (1996) Comparison of four approaches to automatic language identification of telephone speech, IEEE Transactions on Speech and Audio Processing, vol. 4, no. 1.
[Zou et al,. 2006] Zou, Z. H. Yun, Y. and Sun, J. N. (2006). Entropy method for determination of weight of evaluating in fuzzy synthetic evaluation for water quality assessment indicators, Journal of Environmental Sciences, vol. 18, no. 5, pp. 1020–1023.
校內:2017-06-23公開