簡易檢索 / 詳目顯示

研究生: 施伯宜
Shih, Po-Yi
論文名稱: 強健的多語者與具雜訊語音辨識之研究與應用
Research and Application of Robust Several-Speaker and Noisy Speech Recognition
指導教授: 王駿發
Wang, Jhing-Fa
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2016
畢業學年度: 105
語文別: 英文
論文頁數: 106
中文關鍵詞: 基於諧波強大的語音活動檢測增強的取消延長音多語者調適的語音識別網絡韻律上下文的後處理即時的語音驅動說話系統改善的人機交互流程設計
外文關鍵詞: harmonic-based robust voice activity detection, enhanced lengthening cancellation, several-speaker adaptation speech recognition network, prosodic-contextual post-processing, speech-driven talking face system, improving human-machine interactive flow design
相關次數: 點閱:130下載:18
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 自動語音辨識系統已經漸趨成熟,特別是大辭彙連續語音辨識已經在人類日常生活中展現出它的便利性,比如手機中常用的Google Search語音搜尋和Apple Siri語音助理等。一個實用的語音辨識系統,並不只是在於其語音辨識網路核心,還包括了原始語音訊號攫取,語音辨識結果比對以及確認;此外,外界環境因素的影響以及不同使用者的差異,對於一個實用以及強健性高的語音辨識系統,每個環節的處理都是非常重要的。因此,本論文就整個完整的語音辨系統及架構來做研究與探討,並提出相對應的改善方法。
    在語音信號預處理,面臨的挑戰是如何在連續並且參雜雜訊的連續聲音串流中,區分音頻流是否是屬於語音或非語音段並得到更準確的語音段以及語音參數。因此,本論文提出基於諧波特徵的強健性語音活動檢測(Harmonic-based Robust Voice Activity Detection)演算法可提高識別結果的準確率,以及減少潛在錯誤識別錯誤,並減少在各種噪聲場景的計算複雜度。另外,自然口語語音中,延長音段(Lengthening)是自然口語語音段常見的特點。觀察結果表明,語音段的持續時間取決於揚聲器是否猶豫要表達的話。因此,本論文也提出使用雙向PITCH相似取向的增強延長消除(Enhanced Lengthening Cancellation)來改善提升語音辨識的結果。
    對於單一語音辨識系統須適應多個語者(several speakers)使用性的問題,目前常用的是單一非特定語者聲學模型來做語者調適(speaker adaptation)。然而,隨著語者越來越多,單一非特定語者聲學模型於語者之間的切換會導致負面調適(negative speaker adaptation )。這種情況隨著使用時間越久問題就越大(不可靠的語者調適機制, unreliable speaker adaptation mechanism)。因此本文提出了結合多語者識別模型(speaker identification)以及多語者聲學模型的調適機制(multiple speaker adaptation),解決目前不可靠的語者調適機制問題。
    在為了提高語音辨識正確率(recognition result verification)和篩選識別結果的候選語句,本文亦提出以韻律上下文(prosodic-contextual)為參考的後處理算法。我們採用比較當前字符的音調以及其音節的音調變化,並且基於內容所述的相似性來做取決。
    本文並探討以及提出兩個新的語音識別系統的改善及應用,來提高人機互動中的使用者體驗。一個是即時的語音驅動說話系統,它提供了低的計算複雜度,順利地將語音介面視覺化。我們提出了一種新型嵌入式混淆系統(embedded confusable system)來生成音素分組構建一個高效的音素視覺映射表(phoneme-viseme mapping table),生成的映射表可以簡化對應問題,促進語音視覺化分類的準確性。該系統包括:1)信噪比感知的語音增強及降噪,和基於ICA的強健聲學特徵向量拮取; 2)採用HMM和MCSVM識別網絡處理; 3)語音視覺化處理,安排視位唇形圖像按時間順序,並提出用動態阿爾法不同的alpha值設置混合更具真實性。另一個語音識別系統應用是提高人機交互流程設計,增加了喚醒和確認機制,這樣的行為將忽略除預先定義的語音命令,它阻止了有聲噪聲誤導語音識別模塊的發生,進而減少了使用者輸入錯誤或是執行錯誤辨識結果的機會。

    Today, Automatic Speech Recognition (ASR) system is growing maturity, especially Large Vocabulary Continuous Speech Recognition (LVCSR) has been applied in human daily life to show its convenience, such as Google search by speech and Apple Siri Voice Assistant. Many studies discussed and proposed with more complex and accurate method for recognition network processing; however, a robust and useful ASR system, not only focuses on the recognition network processing, but focuses on original speech signal obtaining or sampling in pre-processing part and recognized result comparing/verifying in post processing part. Moreover, a changeable noise environment and different users are considerations for a robust ASR application. Therefore, a robust speech recognition system for noise and multi-speaker is studied and related improving methods are proposed in this dissertation.
    In speech signal pre-processing, the challenge is how to get more accurate speech parts in a continuous audio sequence to parameterize the input audio stream, and to discriminate whether the audio stream is belong to speech or non-speech segment. Therefore, a well-designed Harmonic-based Robust Voice Activity Detection (H-RVAD) algorithm is proposed which can improve the accuracy rate of recognized result as well as reduce potential mis-recognition errors and reduce computational complexity in various noise scenarios. Lengthening is a durational characteristic of phonetic segments in spontaneous speech. Observation results indicate that the duration of phonetic segments depends on whether the speakers are hesitating about the words to express. Thus, an enhanced lengthening cancellation using bidirectional PITCH similarity alignment is proposed in this study. For robust several-speaker to a problem, the current adaptive mechanisms adapt a single acoustic model for a speaker in speaker-independent speech recognition system. However, as more users use the same speech recognizer, single acoustic model adaptation leads to negative adaptation upon switching between users. Such a situation is problematic (undependable adaptation). This paper, considering the situation of a smart home or an office with staff members, presents the speaker-specific acoustic model adaptation based on a multi-model mechanism, to solve the problem of undependable adaptation.
    For the purpose of increasing the recognition rate and sifting the candidate sentences from recognition results in the post-processing part, a prosodic-contextual post-processing is proposed. We want to compare the pitch variation of syllables with the tones of texts, and score them based on the similarity.
    This dissertation also discuses and presents two robust speech recognition system applications to improve human-machine interactive experience. One is a real-time speech-driven talking face system which provides low computational complexity and smoothly visual sense. A novel embedded confusable system is proposed to generate an efficient phoneme-viseme mapping table constructed by phoneme grouping. The generated mapping table can simplify the mapping problem and promote viseme classification accuracy. The system includes: 1) speech signal pre-processing including SNR-aware speech enhancement for noise reduction and ICA-based feature set extractions of robust acoustic feature vectors; 2) recognition network processing using HMM and MCSVM; 3) visual processing, arranges lip shape image of visemes in time sequence, and presents more authenticity using a dynamic alpha blending with different alpha value settings. The other application is an improving human-machine interactive flow design that integrates the wake up and confirmation mechanism. The idea behavior of wake up and confirmation mechanism will ignore any voice except the pre-defined voice command, which prevent any chance that the voiced noise misleads the speech recognition module.

    摘要 ......i Abstract ......iii 致謝 ......vi Contents ......viii List of Tables ......x List of Figures ......xi Chapter 1 Introduction ......1 1.1 Motivation ......1 1.2 The Contribution of the Dissertation ......2 1.3 The Organization of the Dissertation ......4 Chapter 2 Robust Speech Signal Pre-Processing ......6 2.1 Harmonic-based Robust Voice Activity Detection ......6 2.1.1 Background ......6 2.1.2 The Proposed Method ......9 2.1.3 Robust Harmonic Spectral Local Peak Feature ......11 2.1.4 Experiments and Analysis ......13 2.2 Enhanced Lengthening Cancellation ......18 2.2.1 Background ......18 2.2.2 The Proposed Method ......19 2.2.3 Pitch Prediction Using Autocorrelation Function ......21 2.2.4 Cosine Similarity-based Lengthening Detection ......22 2.2.5 Bidirectional Pitch Contour Alignment ......23 2.2.6 Experimental Result ......24 Chapter 3 Robust Several-Speaker Speech Recognition ......27 3.1 Background ......27 3.2 Proposed Speech Recognition System with SID and SA ......30 3.3 Speaker Training ......32 3.3.1 MLLR Full Regression Matrix Estimation ......32 3.3.2 Multi-Class Support Vector Machine ......33 3.4 Speaker Identification and Speech Recognition ......34 3.4.1 Speaker Identification ......34 3.4.2 Speech Recognition ......35 3.5 Several-Speaker Adaptation ......37 3.5.1 ML Coordinate Estimation ......38 3.5.2 Adaptation Smoothing ......39 3.6 Utterance Verification by Measuring Confidence Score ......40 3.7 Experiment ......43 3.7.1 Experimental Setup ......43 3.7.2 Experimental Results ......46 Chapter 4 Robust Speech Recognition Post-Processing ......50 4.1 Background ......50 4.2 Prosodic-Contextual Post-Processing ......51 4.2.1 Prosodic Analysis ......51 4.2.2 Contextual Analysis ......52 4.3 Experiments ......54 4.3.1 Customizable Cloud Healthcare Dialogue System Overview ......54 4.3.2 Experimental Result ......55 Chapter 5 Robust Speech Recognition System Application ......58 5.1 Speech-Driven Talking Face ......58 5.1.1 Background ......59 5.1.2 SNR-Aware Speech Enhancement ......63 5.1.3 ICA-Transformed MFCCs Feature Extraction ......65 5.1.4 Confusable Phoneme-Viseme Mapping ......67 5.1.5 Experiments ......74 5.2 Enhanced Low SNR Speech Recognition System ......81 5.2.1 Background ......81 5.2.2 Environment Noise Measurement ......84 5.2.3 Voice Command Verification ......84 5.2.4 Speech Recognition Module Building ......86 5.2.5 Evaluation ......87 Chapter 6 Conclusion and Future Work ......91 6.1 Conclusion ......91 6.2 Future Work ......95 References ......96 Publication List ......102

    [1] P.-Y. Shih, P.-C. Lin, J.-F. Wang and Y.-N. Lin, "Robust several-speaker speech recognition with highly dependable online speaker adaptation and identification," Journal of Network and Computer Applications, vol. 34, no. 5, pp.1459–1467, Sept. 2011. DOI:10.1016/j.jnca.2010.08.007
    [2] H. AlShu'eili, G. Sen Gupta and S. Mukhopadhyay, "Voice recognition based wireless home automation system," Proc. 4th International Conf. on Mechatronics, Kuala Lumpur, Malaysia, pp.1-6, May 2011.
    [3] J. Zhu; X. Gao, Y. Yang, H. Li, Z. Ai, and X. Cui, "Developing a voice control system for ZigBee-based home automation networks," Proc. 2nd IEEE International Conf. on Network Infrastructure and Digital Content, Beijing, China, pp.737-741. Sept. 2010.
    [4] Ford, SYNC voice commands, http://owner.ford.com/how-tos/sync-technology/myford-touch/, accessed 2015.
    [5] M. H. Savoji, "A robust algorithm for accurate end pointing of speech," Speech Communication, vol. 8, no. 1, pp. 45–60, June, 1989. DOI:10.1016/0167-6393(89)90067-8
    [6] M. H. Moattar, M. M. Homayounpour and N. K. Kalantari, "A new approach for robust real time voice activity detection using spectral pattern," Proc. 35th IEEE International Conf. on Acoustics, Speech, and Signal Processing, Dallas, Texas, U.S.A., pp. 4478-4481, Mar. 2010.
    [7] K.-H. Woo, T.-Y. Yang, K.-J. Park and C. Lee, "Robust voice activity detection algorithm for estimating noise spectrum," IEEE Electronics Letters, vol. 36, no. 2, pp. 180–181, Jan. 2000.
    [8] A. Benyassine, E. Shlomot, H. Y. Su, D. Massaloux, C. Lamblin and J. P. Petit, "ITU-T recommendation G. 729 Annex B: A silence compression scheme for use with G. 729 optimized for V. 70 digital simultaneous voice and data applications," IEEE Communications Magazine, vol. 35,no. 9, pp. 64–73, Sept. 1997.
    [9] R. Chengalvarayan, "Robust energy normalization using speech/non-speech discriminator for German connected digit recognition," Proc. 6th European Conf. on Speech Communication and Technology, Budapest, Hungary, pp. 61–64, Sept.1999.
    [10] I. C. Yoo, and D. Yook, "Robust voice activity detection using the spectral peaks of vowel sounds," ETRI Journal, vol. 31, no. 4, pp. 451-453, Aug. 2009.
    [11] J. Beh and H. Ko, "A novel spectral subtraction scheme for robust speech recognition: spectral subtraction using spectral harmonics of speech," Proc. 28th IEEE International Conf. on Acoustics, Speech, and Signal Processing, Hong Kong, Hong Kong, pp. 684–687, April 2003.
    [12] T. Fukuda, O. Ichikawa, M.i Nishimura, "Long-term spectro-temporal and static harmonic features for voice activity detection," IEEE Journal of STSP, vol.4, no. 5, pp. 834-844, Oct. 2010. DOI:10.1109/JSTSP.2010.2069750.
    [13] T. Yu and J. H. L. Hansen, "Discriminative training for multiple observation likelihood ratio based voice activity detection," IEEE Signal Processing Letters, vol. 17, no. 11, pp. 897–900, Aug. 2010. DOI:10.1109/LSP.2010.2066561.
    [14] J. W. Shin, H. J. Kwon, S. H. Jin and N. S. Kim, "Voice activity detection based on conditional MAP criterion," IEEE Signal Processing Letters,vol. 15, pp. 257-260, Feb.2008. DOI:10.1109/LSP.2008.917027.
    [15] N. Cho, and E.-K. Kim, "Enhanced Voice Activity Detection Using Acoustic Event Detection and Classification," IEEE Trans. on CE, vol. 57, no. 1, pp. 196-102, Feb. 2011. DOI:10.1109/TCE.2011.5735502.
    [16] J. Wu and X. L. Zhang, "Maximum margin clustering based statistical VAD with multiple observation compound feature,” IEEE ignal Processing Letters, vol. 18, no. 5, pp. 283–286, May 2011. DOI:10.1109/LSP.2011.2119482.
    [17] J. Wu and X. L. Zhang, "Efficient multiple kernel support vector machine based voice activity detection,” IEEE Signal Processing Letters, vol. 18, no. 8, pp. 466–499, Aug. 2011. DOI:10.1109/LSP.2011.2159374.
    [18] X.-L. Zhang and J. Wu, "Deep Belief Networks Based Voice Activity Detection," IEEE Trans. on ASLP, vol.
    98
    21, no. 4, pp. 1-14, April 2013. DOI:10.1109/TASL.2012.2229986.
    [19] J. Ramírez, J. C. Segura, C. Benítez, A. de la Torre and A. Rubio, "An effective subband OSF-based vad with noise reduction for robust speech recognition," IEEE Trans. on SAP, vol. 13, no. 6, pp.1119-1129, Nov. 2005. DOI:10.1109/TSA.2005.853212.
    [20] W. -H. Shin, "Speech/non-speech classification using multiple features for robust endpoint detection," Proc. 25th IEEE International Conf. on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, pp. 1399-1402, June. 2010.
    [21] G. D. Wuand and C. -T. Lin, "Word boundary detection with Mel scale frequency bank in noisy environment," IEEE Trans. on SAP, vol. 8, no. 5, pp. 541-554, Sep. 2000. DOI:10.1109/89.861373.
    [22] Sohn, J., Kim, N. S., and Sung, W., "A statistical model-based voice activity detection,' IEEE SP Letters, Vol. 6, No. 1, pp. 1-3, Jan. 1999.
    [23] M. Fujimoto and K. Ishizuka, "Noise robust voice activity detection based on switching Kalman filter,' IEICE Transactions on Information and Systems, vol. E91-D, no. 3, pp. 467-477, March 2008.
    [24] J. L. Shen, J. . ung and L. . Lee, “ obust entropy-based endpoint detection for speech recognition in noisy environments”, International onference on poken Language Processing, ydney, 1998.
    [25] H. G. Hirsch and D. Pearce, "The AURORA Experimental Framework for the Performance Evaluations of Speech Recognition Systems under Noisy Condition,' Proc. ISCA ITRW ASR'00, pp.18-20, Sept. 2000.
    [26] Y. J. Wu, "A High Accuracy Isolated Word Recognition System with Two-Stage Out-of-Vocabulary Detection Based on Correlational Weight Analysis," Master Thesis, Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan, June, 2014.
    [27] ETSI EN 301 v7.1 Digital cellular telecommunications system; Voice Activity Detector (VAD) for Adaptive Multi-Rate (AMR) speech traffic channels, 1999.
    [28] S. Furui, “ ecent Advances in pontaneous peech Recognition and Understanding,” in Proc. IEEE-ISCA Workshop on Spontaneous Speech Processing and Recognition (SSPR), Tokyo, Japan, 2003, Apr. 13-16, pp. 1–6.
    [29] E. Shriberg, “ pontaneous peech: How People Really Talk and Why Engineers Should Care,” in Proc. European Conf. on Speech Communication and Technology, Lisbon, Portugal, 2005, Sep 4-8, pp. 1781–1784.
    [30] F. Stouten, J. Duchateau, J.P. Martens and P. Wambacq, ” Coping with Disfluencies in Spontaneous Speech Recognition: Acoustic Detection and Linguistic Context Manipulation,” peech ommunication, vol. 48, pp. 1590–1606, Apr. 2006.
    [31] A. E. Turk and L. White, “ tructural influences on accentual lengthening in English,” Journal of Phonetics, vol. 27, pp. 171-206, Apr. 1999.
    [32] A. Stolcke, E. Shriberg, D. Hakkani- ¨ ur, and G. ¨ ur, “Modeling the prosody of hidden events for improved word recognition,” in Proc. European Conf. on Speech Communication and Technology, Budapest, Hungary, 1999, Sep.5-9, pp. 307–310.
    [33] V. Gadde, “Modeling word durations,” in Proc. International Conference of Spoken Language Processing, Beijing, China, 2000, Oct. 16-20, pp. 601–604.
    [34] N. Ma and P. Green, “ onte t-dependent word duration modelling for robust speech recognition,” in Proc. Interspeech, Lisbon, Portugal, 2005, Sep. 5-9, pp. 2609–2612.
    [35] G. hung and . eneff, “A hierarchical duration model for speech recognition based on the ANGIE framework,” peech Communication, vol. 27, pp. 113–134, Nov. 1999.
    [36] T.G. Clarkson, C.C. Christodoulou, Y. Guan, D. Gorse, D.A. Romano-Critchley, J.G. Taylor: Speaker identification for security systems using reinforcement-trained pRAM neural network architectures, IEEE Transactions on Systems, Man and Cybernetics, Part C, vol. 31, no. 1, pp. 65–76, Feb. 2001.
    [37] W.M. Campbell, D.E. Sturim, D.A. Reynolds: Support vector machines using GMM supervectors for speaker verification, IEEE Signal Processing Letters, vol. 13, no. 5, pp. 308–311, May 2006.
    [38] J.-C. Wang, J.-F. Wang, C.-B. Lin, K.-T. Jian, W.-H. Kuok: Content-based audio classification using support vector machines and independent component analysis, ICPR06 (I: 1204-1207)
    99
    [39] Q. Li: A detection approach to search-space reduction for HMM state alignment in speaker verification, IEEE Transactions on Speech and Audio Processing, vol. 9, no. 5, pp. 569–578, July 2001.
    [40] V. Vapnik, Statistical Learning Theory, New York: Wiley, 1998.
    [41] H. Jiang, L. Deng: A Bayesian approach to the verification problem: applications to speaker verification, IEEE Transactions on Speech and Audio Processing, vol. 9, no. 8, pp. 874–884, Nov. 2001.
    [42] B. Xiang, T. Berger: Efficient text-independent speaker verification with structural Gaussian mixture models and neural network, IEEE Transactions on Speech and Audio Processing, vol. 11, no. 5, pp. 447–456, Sept. 2003.
    [43] G.R. Doddington, M.A. Przybocki, A.F. Martin, D.A. Reynolds: The NIST speaker recognition evaluation—overview, methodology, systems, results, perspective, Speech Communication, vol. 31, no. 2–3, pp. 225–254, June 2000.
    [44] A. Ganapathiraju, J.E. Hamaker, J. Picone: Applications of support vector machines to speech recognition, IEEE Transactions on Signal Processing, vol. 52, no. 8, pp.2348–2355. Aug. 2004.
    [45] H.-L. Huanga, F.-L. Chang: ESVM: Evolutionary support vector machine for automatic feature selection and Classification of micro array data. Bio-Ssytems 90 (2007) 516-528
    [46] A. Ganapathiraju, J.E. Hamaker, J. Picone: Applications of support vector machines to speech recognition, IEEE Transactions on Signal Processing, vol. 52, no. 8, pp. 2348–2355. Aug. 2004.
    [47] J.-C. Wang, C.-H. Yang, J.-F. Wang, H.-P. Lee: Robust Speaker Identification and Verification, IEEE Computational Intelligence Magazine, pp. 52-59, May 2007.
    [48] P. Cerva, J. Nouza: Supervised and Unsupervised Speaker Adaptation in Large Vocabulary Continuous peech ecognition of zech, V. Matouˇsek et al. (Eds.): D 2005, pringer-Verlag Berlin Heidelberg, LNAI 3658, pp. 203–210, 2005.
    [49] P. C. Woodland: Speaker Adaptation: Techniques and Challenges, Proc. IEEE Workshop on Automatic Speech Recognition and Understanding, pp.85-90, 2000.
    [50] J. L. Gauvain, C.-H. Lee: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains, IEEE Transaction on Speech and Audio Processing, vol. 2, no. 2, pp. 291-298, April 1994
    [51] C. J. Leggetter, P. C. Woodland: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models, Journal of Computer Speech and Language, vol. 9, pp. 171-185, 1995.
    [52] I.T. Jolloffe, Pricipal Component Analysis, Springer-Verlag, 1986.
    [53] K.-T. Chen, W.-W. Liau, H.-M. Wang, L. S. Lee: Fast speaker adaptation using eigenspace-based maximum-likelihood linear regression, in Proc. ICSLP, 2000, vol. 3, pp. 742–745.
    [54] Brian K.-W. Mak, James T. Kwok, S. Ho: Using Kernel PCA to Improvem Eigenvoice Speaker Adaptation, Proceedings of the Third International Conference on Machine Learning and Cybemetics, Shanghai, 26-29 August 2004.
    [55] Brian K.-W. Mak, Roger W.-H. Hsiao: Kernel Eigenspace-Based MLLR Adaptation, IEEE Transaction on Audio, Speech, and Language Processing, Vol. 15, No. 3, March 2007
    [56] M.-W.Koo, C.-H. Lee, B.-H. Juang: Speech Recognition and Utterance Verification Based on a Generalized Confidence Score, IEEE Transaction on Speech and Audio Processing, Vol. 9, No. 8, November 2001,
    [57] J.-C Wang, H.-P. Lee, J.-F. Wang, C.-H. Yang: Critical Band Subspace-Based Speech Enhancement Using SNR and Auditory Masking Aware Technique, IEICE Transactions on Information and Systems. vol 90; number 7, pages 1055-1062, July 2007.
    [58] Young, S. et al. HTKbook(V3.2), Cambridge University Engineering Dept, 2002.
    [59] U. Kressel: Pairwise classification and support vector machines’, in Advances in Kernel Methods—Support Vector Learning, B. Scholkopf, C. Burges, and A. J. Smola, (Eds) MIT Press, Cambridge, Massachusetts, chapter 15, 1999.
    [60] C. Y. Tseng: A phonetically oriented speech database for Mandarin Chinese, Proc. ICPhS95, Stockholm, pp.326-329, 1995.
    [61] IBM Via Voice V10, http://www.nuance.com/viavoice/.
    100
    [62] Microsoft Speech Recognition Engine, SAPI(5.3), http://msdn.microsoft.com/.
    [63] P.-Y. Shih, J.-F. Wang, Y.-N. Lin, Z.-H. Fu: Multi-Speaker Adaptation for Robust Speech Recognition under Ubiquitous Environment, The 12th Oriental COCOSDA, Beijing, China, 10-12 Aug 2009.
    [64] A. Sanchis, A. Juan, E. Vidal: Improving utterance verification using a smoothed naive bayes model, I A P’2003, vol. 1, pp.592–595, 2003.
    [65] A. Sanchis, A. Juan, E. Vidal: Estimating confidence measures for speech recognition verification using a smoothed naïve Bayesmodel, IbPRIA 2003 Proceedings. Lecture Notes in Computer Science LNCS 2652, pp. 910–918, 2003.
    [66] A. Sanchis, A. Juan, E. Vidal: New features based on multiple word graphs for utterance verification, 8th International Conference on Spoken Language Processing, 2545-2548, 2004.
    [67] M. Rahim, C.-H. Lee, B.-H. Juang: Discriminative utterance verification for connected digits recognition, IEEE Trans. Speech Audio Processing, vol. 5, pp. 266–277, May 1997.
    [68] M. Rahim, C.-H. Lee, B.-H. Juang, W. Chou: Discriminative utterance verification using minimum string verification error (MSVE) training, in Proc. IEEE I A P’96, 1996, pp. 3585–3588.
    [69] P. Modi, M. Rahim: Discriminative utterance verification using multiple confidence measures, in Proc. EU O PEE ’97, 1997, pp.103–106.
    [70] C.-H. Lee: A tutorial on speaker and speech verification, in Proc.NORSIG-98, Vigso, Denmark, June 1998, pp. 9–16
    [71] M.-W. Koo, C.-H. Lee, B.-H Juang: Speech Recognition and Utterance Verification Based on a Generalized Confidence Score, IEEE Trans .on Speech and Audio Processing, vol. 9, No. 8, Nov. 2001.
    [72] R. Kuhn, P. Nguyen, J.-C. Junqua, L. Goldwasser, N. Niedzielski, S. Fincke, K. Field, M. Contolini: Eigenvoices for Speaker Adaptation, Proc. ICSLP’98, pp.1771-1774, 1998.
    [73] . akahashi, . Maeda, N. suruta, and . Morimoto, “A home health care system for elderly people,” in Proc. 7th KORUS Int. Symp. on Science and Technology, Ulsan, Korea, 2003, Jun. 28–Jul. 6, pp. 97–102.
    [74] . Gupta and . Pujari, “A multi-agent system (MAS) based scheme for health care and medical diagnosis system,” in Proc. 19th IAMA Int. Conf. Intelligent Agent & Multi-Agent System, Chennai, India, 2009, Jul. 22–24, pp. 1–3.
    [75] R. Carlos, K. Ferando, W. Carlos, W. Jorge, F. Armando, and S. Giovanni, “A cloud computing solution for patient’s data collection in health care institutions,” in Proc. 2nd Int. onf. e ealth, Telemedicine, and Social Medicine, St. Maarten, Netherlands Antilles, 2010, Feb. 10–16, pp. 95–99.
    [76] R. Zulqarnain, F. Umar, J. Jae-Keun, and P. Seung- un, “ loud computing aware ubiquitous health care system,” in Proc. IEEE Int. Conf. E-Health and Bioengineering, Iasi, Romania, 2011, Nov. 24–26, pp. 1–4.
    [77] E. Ekonomou, L. uchanan, and . huemmler, “An integrated cloud-based healthcare infrastructure,” in Proc. 3rd IEEE Int. Conf. Cloud Computing Technology and Science, Athens, Greece, 2011, Nov. 29–Dec. 01, pp. 532–536.
    [78] . Jiang, Y.K. an, . Li, .Y. Wong, and K.L. Dilip, “Development of Event-Driven dialogue system for social mobile robot,” in Proc. Global Conf. Intelligent Systems, Xiamen, China, 2009, May 19–21, pp.117–121.
    [79] . Pinto, Y. Wilks, . atizon, A. Dingli, “ he senior companion multiagent dialogue system,” in Proc. 7th Int. Conf. Autonomous Agents and Multiagent Systems, Estoril, Portugal, 2008, May 12–16, pp. 1245–1248. 249
    [80] Ostermann, J. and Weissenfeld, A. 2004. Talking faces-technologies and applications. In Proc. of I P ’04, vol. 3, pp. 826–833.
    [81] Theobald, B.J. and Wilkinson, N. 2007. A Real-Time Speech-Driven Talking Head using Active Appearance Models. AVSP 2007, International Conference on Auditory-Visual Speech Processing 2007, Kasteel Groenendael, Hilvarenbeek, The Netherlands.
    [82] Parke, F. and Waters, K.1996. Computer Facial Animation.
    [83] Cosatto, E., Ostermann, J., Graf, H. P. and Schroeter, J. 2003. Lifelike talking faces for interactive services. PROCEEDINGS OF THE IEEE, vol. 91, no. 9, pp. 1406–1428.
    101
    [84] Bregler, C., Covell, M. and Slaney, M. 1997. Video rewrite: Driving visual speech with audio. in Proc. ACM IGG AP ’97.
    [85] Ezzat, T., Geiger, G. and Poggio, T. 2002. Trainable videorealistic speech animation. in Proc. ACM SIGGRAPH'02, vol. 21. Issue 3, pp. 388–397.
    [86] Cosatto, E. and Graf, H. P. 1998. Sample-based synthesis of photo-realistic talking heads. in Proc. IEEE Computer Animation, pp. 103–110.
    [87] Cosatto, E. and Graf, H. P. 2000. Photo-realistic talking heads from image samples. IEEE Trans. Multimedia, vol. 2, pp. 152–163.
    [88] Theobald, B., Bangham, A., Matthews, I. and Cawley, G. 2004. Near-videorealistic synthetic talking faces: Implementation and evaluation. Speech Communication, 44, pp. 127–140.
    [89] Cohen, M. M. and Massaro, D. W. 1993. Modeling coarticulation in synthetic visual speech. in Models and Techniques in Computer Animation, M. Magnenat-Thalmann and D. Thalmann, Eds. Tokyo, Japan: Springer-Verlag, pp. 139–156.
    [90] Morishima ,S. 1998. Real-time talking head driven by voice and its application to communication and entertainment. in Proc. AVSP 98, pp. 195–200.
    [91] Curinga, S., Lavagetto, F. and Vignoli, F. 1996. Lip movements synthesis using time delay neural networks. in Proc. EUSIPCO 96—Systems and Computers, pp. 36–46.
    [92] Koster, B. E., Rodman, R. D. and Bitzer, D. 1994. Automated lip-sync: Direct translation of speech-sound to mouth-shape. in Proc. 28th Annu.Asilomar Conf. Signals, pp. 583–586.
    [93] McAllister, D. V., Rodman, R. D., Bitzer, D. L. and Freeman, A. S. 1997. Lip synchronization for Animation. Proc. SIGGRAPH 97, Los Angeles, CA, pp. 225–228.
    [94] Tamura, M., Masuko, T., Kobayashi, T. and Tokuday, K. 1998. Visual speech synthesis based on parameter generation from HMM: Speech driven and text-and-speech driven approaches. in Proc. Audio-Visual Speech Processing (AVSP 98), pp. 221–226.
    [95] Yamamoto, E., Nakamura, S. and Shikano, K. 1998. Lip movement synthesis from speech based on Hidden Markov models. Speech Communication, vol. 26, no. 1–2, pp. 105–115.
    [96] Chen, T. and Rao, R.R. 1998. Audio-Visual Integration in Multimodal Communication. Processing of the IEEE, vol. 86, issue 5, 837-852.
    [97] Chen, T. 2001. Audiovisual speech processing: Lip reading and lip synchronization. IEEE SIGNAL PROCESSING MAGAZINE, vol. 18, no. 1, pp. 9–21.
    [98] Choi, K., Luo, Y. and Hwang J.-N. 2001. Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system. THE JOURNAL OF VLSI SIGNAL PROCESSING, vol. 29, num 1-2, pp. 51–61.
    [99] Xie, L. and Liu, Z. 2007. Realistic mouth-synching for speech-driven talking face using articulatory modeling. IEEE Trans. Multimedia, vol. 9, no. 3, pp. 500–510.
    [100] Park, J. and Ko, H. 2008. Real-Time Continuous Phoneme Recognition System Using Class-Dependent Tied-Mixture HMM With HBT Structure for Speech-Driven Lip-Sync. IEEE Trans. Multimedia, vol. 10, no. 7, pp. 1299–1306.
    [101] Wang, J.-C., Lee, H.-P., Wang, J.-F. and Lin, C.-B. 2008. Robust Environmental Sound Recognition for Home Automation. IEEE Transaction on Automation Science and Engineering, pp. 25-31, Vol. 5, Issue 1.
    [102] Lucey, P., Martin, T. and Sridharan, S. 2004. Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments. Presented at Tenth Australian International Conference on Speech Science & Technology, Macquarie University, Sydney, 8-10 December, 2004.
    [103] Zhong, D. and Defée, I. 2007. Performance of similarity measures based on histograms of local image feature vectors. Journal of Pattern Recognition Letters, Vol. 28 Issue 15.
    [104] Lee, S. and Yook, D. 2002. Audio-to-visual conversion using hidden Markov models,” In Proceedings of the 7th Pacific im International onference on Artificial Intelligence, pringer-Verlag, pp. 563–570.
    [105] Ephraim, Y. and Trees, H. L. V. 1995. A signal subspace approach for speech enhancement. IEEE Transactions on Speech and Audio Processing. vol. 3, no. 4, pp. 251–266.
    102
    [106] Wang, J.-C., Lee, H.-P., Wang, J.-F. and Yang, C.-H. 2007. Critical Band Subspace-based Speech Enhancement Using SNR and Auditory Masking Aware Technique. IEICE Trans. INF. & SYST., vol.E90-D, no. 7, pp. 1055-1062.
    [107] Ye, J., Yao, H. and Jiang, F. 2004. Based on HMM and SVM Multilayer Architecture Classifier for Chinese ign Language ecognition with Large Vocabulary. Proc. hird Int’l onf. Image and Graphics (I IG ’04), pp. 377- 380.
    [108] Imperl, B. and Horvat, B. 1999. The clustering algorithm for the definition of multilingual set of context dependent speech models. In Proceedings of the European Conference of Speech Communication and Technology, pp. 887–890.
    [109] Zgank, A., Imperl, B. and Johansen, F. 2001. Crosslingual speech recognition with multilingual acoustic models based on agglomerative and tree-based triphone clustering. In Proceedings of the European Conference of Speech Communication and Technology, pp. 2725–2728.
    [110] Wang, H. C. 1997. MAT – A project to collect Mandarin speech data through telephone networks. Computational Linguistics and Chinese Language Processing, Computational Linguistics Society of R.O.C., vol.2, no. 1, pp. 73-90.
    [111] Cambridge University Engineering Dept. HTK Toolkit 3.4. http://htk.eng.cam.ac.uk/
    [112] Turunen, E. 2001. Survey of theory and applications of Lukasiewicz-Pavelka fuzzy logic. Lectures on Soft Computing and Fuzzy Logic. Advances in Soft Computing, pp. 313–337.
    [113] CrazyTalk, V 2.0 Lip-Sync, 2010. Http://www.reallusion.com/Crazytalk/.

    下載圖示 校內:2022-01-12公開
    校外:2022-01-12公開
    QR CODE