| 研究生: | 顏國郎 Yan, Gwo-Lang | 
|---|---|
| 論文名稱: | 口語對話系統中不流暢語音之語音動作型態模型化與驗證之研究 A study on speech act modeling and verification of spontaneous speech with disfluency in a spoken dialogue system | 
| 指導教授: | 吳宗憲 Wu, Chung-Hsien | 
| 學位類別: | 博士 Doctor | 
| 系所名稱: | 電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering | 
| 論文出版年: | 2004 | 
| 畢業學年度: | 92 | 
| 語文別: | 英文 | 
| 論文頁數: | 109 | 
| 中文關鍵詞: | 填充式停頓 、對話系統 、驗證 、口語語音 、潛在式語意分析 、語音動作型態 、分段式拜式模組 、不流暢 | 
| 外文關鍵詞: | Filled Pause, Dialog System, Segmental Baysian Model, Disfluency, Verification, Spontaneous Speech, Latent Semantic Analysis, Speech Act | 
| 相關次數: | 點閱:149 下載:2 | 
| 分享至: | 
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 | 
  自然對話系統在目前資訊爆炸的時代,扮演著很重要的人機介面角色。透過這項先進的資訊科技,人類可以方便地在生活中利用電腦對資料做存取與使用。在實際運用的科技中,很多對話系統如航空資訊系統、氣象播報系統、自動總機系統與訂票系統已經展現出應用上的效果,但在口語對話中,不流暢語音的問題與如何模組化使用者的意圖仍然需要解決後才能使得對話系統真正達到實用的效能。
  本研究的目的為改善辨識器在不流暢語音的辨識率與口語對話中擷取溝通意圖的正確性以輔助口語對話系統效能。為了達到此目的,本研究主要集中研究不流暢語音的參數分析和在語音段出不流暢語音段落,以及語音動作型態模型化與驗證兩大主題。本研究之理論基礎與原理包括樣型識別、自然語言處理、人工智慧與多變量分析。研究之特定目標,包括:1) 分析填充式停頓(filled pause)現象特性的參數與建立鑑別性填充式停頓模組;2) 應用分段式拜氏模組之不流暢語音偵測演算法並整合於語音辨識器以提升不流暢語音的辨識效能;3) 發展在口與對話語音動作型態模組以達成人機互動的目的;4) 發展驗證語音動作型態模組,以減少口語對話系統因為接受錯誤使用者輸入資訊所產生的危機。
  實驗評量主要探討不流暢語音的識別效能、語音動作型態的正確率與語音動作型態的驗證能力。實驗實行於使用本研究所提出方法所建立的航空資訊對話系統。在不流暢語音分析上,根據填充式停頓的特性選出的參數,引用主成份分析(PCA)與線性區別轉換(LDA)來選擇較有表達性的參數個數,並使用高斯混合模組(GMM)與鑑別性訓練來增加填充式停頓偵測效能,最後分段式拜氏模組搭配高斯混合模組適當地將語音的流暢與不流暢的段落切出,並將此資訊結合至辨識器,而得到在不流暢語音辨識效能的提升。在語音動作型態的模組化與驗證上,本研究提出統計性的語音動作型態隱藏式馬可夫模組(SAHMM),有效率的使用語意資訊、語法資訊和詞段類別(fragment class) 識別輸入語音的語音動作型態,並使用內插機制重估轉移機率以解決填充式停頓在語料中缺乏的問題。最後架構在潛在式語意分析的拜氏信賴模組,驗證輸入語音的語音動作型態,實驗結果顯示,在語音動作型態識別率與語意的正確率,都得到讓人值得鼓舞的效能,且所提出的策略也能有效減緩口語語音的不流暢問題。
  本研究未來可以朝向不流暢語音的現象分析和句子在表達意圖上的結構差異。本研究之分析與結果,可提供語言與語音學家重要的基礎研究資訊及電腦科學的學者在人機互動行為上的分析與發展相關的人機介面的技術。
  Spoken dialogue systems have crucial roles to play in human-computer interfaces. Through this information technology, people interact with a computer to access data in daily lives conveniently. Several spoken dialogue systems have been demonstrated in real-world applications, such as air travel information services (ATIS), weather forecast systems, automatic call managers and ticket reservation services. However, the disfluency problem and modeling users’ intentions in spontaneous speech remain to be solved before such dialogue systems are truly robust. 
The purpose of this study is to investigate the improvement of recognition rate for disfluencies and the accuracy of communication intentions for spontaneous speech in the spoken dialogue system. To achieve the goal, this dissertation focuses on two issues: disfluency analysis and segmentation and speech act modeling and verification
Theories in pattern recognition, language model, artificial intelligence and multivariate analysis provide the essential principles for the development of this research. More specifically, the study was aimed to: 1) analyze the features of filled pauses properties as the parameter of discriminative modeling of filled pause, 2) apply disfluency detection algorithm using a segmental Bayesian model and integrate the segmentation results with the speech recognizer to improve the disfluency recognition accuracy, 3) model the speech act (SA) of a sentence in spoken language to interact with the computer agent conveniently, 4) verify descriptive information under the wide utterance variation in the real-world environment, which results in the penalty for false identification when the spontaneous speech usually include extraneous words, hesitations, disfluency and other unexpected expressions.
Experiments were conducted to evaluate the proposed approach using a spoken dialogue system for an air travel information service (ATIS). The discriminant features of disfluencies were selected according to filled pauses properties and transformed by Karhunen-Loéve transform (KLT) and linear discriminant analysis (LDA) to select discriminant features for filled pause detection. Then Gaussian mixture models (GMMs), trained using a gradient decent algorithm, were used to improve the filled pause detection performance. Finally, a segmental Bayesian model is proposed to appropriately segment the input sequence into fluent speech and filled pauses speech using these GMMs. In this issue, the recognition rate gained a further improvement when integrating the speech recognizer with the segmental Bayesian model. In the issue of speech act modeling and verification, it presents an approach to model speech acts and verify spontaneous speech with disfluency in a spoken dialogue system. Semantic information, syntactic structure and fragment class of an input utterance are statistically encapsulated in a proposed speech act hidden Markov model (SAHMM) to characterize the speech act. An interpolation mechanism is exploited to re-estimate the state transition probability in SAHMM, to deal with the problem of disfluency in a sparse training corpus. Finally, a Bayesian belief model (BBM), based on latent semantic analysis (LSA), is adopted to verify the potential speech acts and output the final speech act. Experimental results show the proposed approach gives an encouraging improvement both in speech act identification rate and semantic accuracy rate. The proposed strategy also effectively alleviates the disfluency problem in spontaneous speech.
  The future work is recommended to investigate more phenomena of disfluency and the representation of intention in the sentence structure to improve the performance of the dialogue system in real application. The outcomes are expected to provide helpful information for linguists, phoneticians, and computer scientists to analyze human-machine behavior and develop the relevant human-machine technology
[Allen, 1994] J. Allen, “Natural Language Understanding,” The Benjamin/Cummings Publishing Company, pp.542 and pp.554-557, 1994.
[Arai, 1999] K. Arai, J.H. Wright, G. Riccardi and A.L. Gorin, “Grammar Fragment acquisition using syntactic and semantic clustering,” Speech Communication, Vol. 27, Issue: 1, pp. 43-62, Feb. 1999.
[Baeza-Yates, 1999] R. Baeza-Yates and B. Ribeiro-Neto, “Modern Information Retrieval,” Addison-Wesley, Edinburgh Gate, Harlow, pp.48-49, 1999.
[Beaufays, 1999] F. Beaufays, M. Weintraub and Yochai Konig. “Discriminative mixture weight estimation for large gaussian mixture models.” Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on Page(s): 337 -340 vol.1
[Bellegarda, 2000] J. R. Bellegarda, “Exploiting latent semantic information in statistical language modeling,” in IEEE Proc., vol 88, Issue. 8, pp. 1279-1296, August. 2000.
[Bennacef, 1996] S. Bennacef and L. Lamel (1996), Dialog in the RAILTEL Telephone-Based System. Proceedings of ICSLP’96, vol. 1, pp. 550-553
[Block, 1997] H. U. Block, “The language components in VERBMOBIL,” in Proc. ICASSP, pp. 79 –82, 1997.
[Chen, 1997] M. Y. Chen. “Acoustic correlates of English and French nasalized vowels.” J. Acoust. Soc. Am. 102 (4), 2360-2370, 1997
[Chiang, 1998] T.H. Chiang, C.M. Peng, Y.C. Lin, H.M. Wang and S.C. Chien, (1998), The Design of A Mandarin Chinese Spoken Dialogue System. Proceedings of COTEC’98 , Taipei, pp. E2-5.1~E2-5.7
[Covington, 1997] M.A. Covington (1997). Speech acts in electronic communication with special reference to KQML and ANSI X12. System Sciences, 1997, Proceedings of the Thirtieth Hawaii International Conference on Volume: 4 , 1997 , Page(s): 478 -484 vol.4
 [Dillon, 1984] W.R. Dillon and M. Goldstein (1984). Multivariate Analysis. Wiley, New York, U.S.A., Pp.44-46
[Feng, 1996] G. Feng and E. Castelli. “Some acoustic feature of nasal and nasalized vowels : A target for vowel nasalization.” J. Acoust. Soc. Am., 99(6) : 3694-3706, 1996
[Fujimura, 1962] O. Fujimura. “Analysis of Nasal Consonants.” J. Acoust. Soc. Am. 34, 1865-1875, 1962
[Gabrea, 2000] M. Gabrea and D. O’Shaughnessy. “Detection of filled pauses in spontaneous conversation speech.” Proceedings of ICSLP 2000. 
[Ghaemmaghami, 1997] S. Ghaemmaghami, M. Deriche and B. Boashash, “Hierarchical approach to formant detection and tracking through instantaneous frequency estimation,” Electronics Letters. pp. 17-18, vol. 33, no. 1, 1997.
[Gorin, 2002] A.L. Gorin, A. Abella, T. Alonso, G. Riccardi, and J. H. Wright, “Automated natural spoken dialog,” IEEE Computer Magazine, vol. 35 (4) pp. 51-56, April 2002.
[Heeman, 1996] P.A. Heeman, K.-H. Loken-Kim and J.F. Allen, , “Combining the detection and correction of speech repairs,” ICSLP 96, Page(s): 362 -365 vol.1
[Jelinek, 1990] F. Jelinek, R. Mercer and S. Roukos, “Classifying words for improved statistical language models,” in Proc. ICASSP, pp. 621 –624. 1990.
[Jelinek, 1999] F. Jelinek, “Statistical Methods for Speech Recognition,” The MIT Press, 1999.
[Kai, 1995] A. Kai and S. Nakagawa. Investigation on unknown word processing and strategies for spontaneous speech understanding. In proc. Of Eurospeech’95, pp. 2095-2098, 1995.
[Kawahara, 1998] T. Kawahara, C.H. Lee, and B.H. Juang, “Flexible speech understanding based on combined key-phrase detection and verification,” IEEE Transactions on Speech and Audio Processing, Vol.6, No. 6, pp. 558-568, November. 1998.
[Kim, 1999] H. Kim, J.M. Cho, and J. Seo, “Fuzzy trigram model for speech act analysis of utterances in dialogues,” in Proc. FUZZ-IEEE, Pp. 598 –602, 1999.
[Kitayama, 2003] K. Kitayama, M. Goto, K. Itou and T. Kobayashi, “Speech starter: noise-robust endpoint detection by using filled pauses,” in Proc. Eurospeech , 2003
[Lai, 2000] Y.S. Lai and C.H. Wu (2000). Unknown Word and Phrase Extraction Using a Phrase-Like-Unit-Based Likelihood Ratio. International Journal of Oriental Languages, Vol. 13, No. 1 , Pp. 83-95
[Lee, 1997] C.J. Lee, E.F. Huang and J.K. Chen (1997). A Multi-Keyword Spotter for the Application of the TL Phone Directory Assistant Service. Proceedings of 1997 Workshop on Distributed System Technologies & Applications, pp. 197-202
[Levin, 2000] E. Levin, R. Pieraccini and W. Eckert, “A stochastic model of human-machine interaction for learning dialog strategies,” IEEE Transactions on Speech and Audio Processing, Vol. 8, Issue. 1, pp. 11-23, Jan. 2000.
[Li, 2002] L. Li and W. Chou, “Improving Latent semantic indexing based classifier with information gain,” in Proc. ICSLP, pp. 1141-1144, 2002
[Lickley, 1996] R.J. Lickley and E.G. Bard, “On not Recognizing Disfluencies in Dialogue,” ICSLP 96, Page(s): 1876 -1879 vol.3
[Liu, 2003] Y. Liu, E. Shriberg, A. Stolcke, “Automatic disfluency identification in conversational speech using multiple knowledge sources,” in Proc. Eurospeech , 2003
[Manning, 1999] C.D. Manning, and H. Schütze, “Foundations of Statistical Natural Language Processing,” The MIT Press 1999.
[Martin, 1998] S. Martin, J. Liermann and H. Ney, (1998). Algorithms for bigram and trigram word clustering. Speech Communication. vol 24. issue 1. pp 19-37.
[Martinez, 2001] A. M. Martinez, A. C. Kak; “PCA verus LDA.” IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol 23, NO. 2. February 2001
[Meng, 1996] H. Meng, S. Busayapongchai, and V. Zue. “WHEELS: A Conversational System in the Automobile Classification Domain,” Proceedings of ICSLP ’96, vol. 1, pp. 542-545
[Meng, 1999] H. Meng, W. Lam and K. F. Low, “Learning Belief Networks for Language Understanding”, Proceedings of ASRU, 1999.
[Meng, 2002] H. Meng and K.C. Siu, “Semiautomatic scquisition of semantic structures for understanding domain-specific nature language queries,” IEEE Trans. Knowledge and Data Engineering, Vol. 14, No. 1, pp. 172-181, Jan. 2002.
[O'Shaughnessy, 1992] D. O'Shaughnessy, “Recognition of Hesitations in Spontaneous speech.” Proceedings of ICASSP-92 , Page(s): 521 -524 vol.1
[Patterson, 1990] D. W. Patterson, “Introduction to Artificial Intelligence & Expert System,” Prentice Hall, Englewood Cliffs, New Jersey, pp.107-125, 1990.
[Pieraccini, 1992] R. Pieraccini, E. Tzoukermann, Z. Gorelov, J.L. Gauvain, E. Levin, C.H. Lee and J.G. Wilpon, (1992). A speech understanding system based on statistical representation of semantics. Processing of ICASSP92, vol. 1, pp. 193 -196
[Rabiner, 1993] L.R. Rabiner and B.H. Juang, “Fundamentals of Speech Recognition,” Englewood Cliffs, NJ: Prentice Hall, 1993.
[Recasens, 1983] D. Recasens. “Place cues for nasal consonants with special reference to Catalan.” J. Acoust. Soc. Am. 99, 3694-3706, 1983
[Rencher, 1998] A.C. Rencher, “Multivariate Statistical Inference and Applications,” John Wiely & Sons, 1998.
[Riccardi, 2000] G. Riccardi and A.L. Gorin, “Stochastic language adaptation over time and state in natural spoken dialog systems,” IEEE Transactions on Speech and Audio Processing, Vol. 8, Issue. 1, pp. 3-10, Jan. 2000.
[Rose, 1998] R.C. Rose, H. Yao, G. Riccardi, and J. Wright, “Integration of utterance verification with statistical language modeling and spoken language understanding,” in Proc. ICASSP, pp. 237 –240, 1998.
[Rossato, 2002] S. Rossato, H. Blanchon, and L. Besacier, “Speech-to-speech translation system evaluation: results for french for the NESPOLE! Project first showcase,” in Proc. ICSLP, pp. 1905-1908, 2002.
[Rubén, 2000]S.S. Rubén, B. Pellom, W. Ward, and J.M. Prado, “Confidence measures for dialogue management in the CU communicator system,” in Proc. ICASSP, pp. II1237 -II1240, 2000.
[Saeki, 1996] M. Saeki, K. Matsumura, J. Shimoda, and H. Kaiya, (1996). Structuring utterance records of requirements elicitation meetings based on speech act theory. Requirements Engineering, 1996., Proceedings of ICRE, 1996 , Pp. 21 –30
[Savova, 2003] G. Savova and J. Bachenko, “Designing for errors: similarities and differences of disfluency rates and prosodic characteristics across domains,” in Proc. EuroSpeech, pp. 229-232, 2003.
[Seide, 1997] F. Seide and A. Kellner, (1997). Toward an Automated Directory Information System. Proceedings of EuroSpeech’97, vol. 3, Pp. 1327-1330
[Shriberg, 1996] E. Shriberg and A. Stolcke, “Word Predictability After Hesitations : A corpus-based study,” ICSLP 96, Page(s): 1868 -1871 vol.3
[Shriberg, 2000] E. Shriberg, A. Stolcke, D. Hakkani-Tur, and G. Tur, “Prosody-based automatic segmentation of speech into sentences and topics,” Speech Communication 32(1-2), pp. 127-154, 2000
[Shriberg, 2002] E. Shriberg and A. Stolcke, “Prosody modeling for automatic speech recognition and understanding,” in Proc. Workshop on Mathematical Foundations of Natural Language Modeling, 2002.
[Siu, 1996] M. Siu and M. Ostendorf, “Modeling Disfluencies in Conversation speech,” Proc. of ICSLP-96, vol.1, pp. 386-389, 1996.
[Siu, 2000] M. Siu and M. Ostendorf, “Variable N-Grams and extensions for conversational speech Language Modeling.” Speech and Audio Processing, IEEE Transactions on Volume: 8 1 , Jan. 2000 , Page(s): 63 –75
[Stolcke, 1996] A. Stolcke, and E. Shriberg, “Statistical Language Model for speech disfluencies.” Proceedings of ICASSP-96 , Page(s): 405 -408 vol. 1
[Swerts, 1996] M. Swerts, A. Wichmann, and R. J. Beun, “Filled Pauses as Markers of Discourse Structure,” Proc. ICSLP-96, vol.2, pp. 1033-1036, 1996.
[Swets, 1996] L. D. Swets and J. Weng, “Using Discriminant Eigenfeatures for Image Retrieval,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol 18, no. 8, 1996.
[Tatsuoka, 1988] M.M. Tatsuoka, Multivariate Analysis-Techniques for Educational and Psychological Research, Macmillan Publishing Company, New York, 
[Tomokiyo, 2000] L. M. Tomokiyo, “Linguistic Properties of Non-Native Speech.” Proceedings of ICASSP 2000. Page(s): 1335 -1338 vol.3
[Tran, 1998 a] D. Tran, M. Wagner and T.V. Le (1998 a). A Proposed Decision Rule for Speaker Recognition Based on Fuzzy C-means Clustering. Proceedings of ICSLP98, Sydney, Australia.
[Tran, 1998 b] D. Tran, T.V. Le and M. Wagner (1998 b). Fuzzy Gaussian Mixture Models for Speaker Recognition. Proceedings of ICSLP98, Sydney, Australia.
[Wang, 1997] H.C. Wang, J. F. Wang and Y.N. Liu (1997). A Conversational Agent for Food ordering Dialog Based on Venus Dictate. Proceedings of ROCLING X International Conference ,Pp.325-334
[Wang, 2003] Y.Y. Wang and A. Acero, “Combination of CFG and N-gram modeling in semantic grammar learning,” in Proc. EuroSpeech, pp. 2809-2812, 2003.
[Ward, 1991] W. Ward. “Understanding spontaneous speech: The phoenix system,” In proc. Of ICASSP 91, pp. 365-367, 1991.
[Wright, 1997] J.H. Wright, A.L. Gorin and G. Riccardi.(1997). Automatic Acquisition of Salient Grammar Fragments For Call-Type Classification. Proceedings of Eurospeech97, Greece. Sept. 1997. Pp.1419-1422
[Wu, 1998] C.H. Wu, G.L. Yan and C.L. Lin (1998). Spoken Dialogue System Using Corpus-Based Hidden Markov Model. Proceedings of ICSLP98, Sydney, Australia.
[Wu, 1999] C.H. Wu, and J.H. Chen, “Template-driven generation of prosodic information for chinese concatenative synthesis,” in Proc. ICASSP, pp. 65-68. 1999. 
[Wu, 2001] C.H. Wu and G.L. Yan, “Discriminative disfluency modeling for spontaneous speech recognition,” in Proc. EuroSpeech, pp. 1955-1958, 2001.
[Wu, 2002] C.H. Wu, G.L. Yan, and C.L. Lin, “Speech act modeling in a spoken dialog system using a fuzzy fragment-class Markov model”, Speech Communication 38, pp. 183-199, 2002.
[Wu, 2004] C.H. Wu and G.L. Yan, “Acoustic feature analysis and discriminative modeling of filled pauses for spontaneous speech recognition,” Journal of VLSI Signal Processing, 36, pp. 87-99, 2004.
[Zimmermann, 1991] H. J. Zimmermann, “Fuzzy Set Theory and Its Applications,” Kluwer Academic Publishers, Pp. 230-236, 1991.
[Zue, 2000] V. Zue, S. Seneff, J.R. Glass, J. Polifroni, C. Pao, T.J. Hazen, and L. Hetherington, “JUPITER: a telephone-based conversational interface for weather information,” IEEE Transactions on Speech and Audio Processing, Vol. 8, Issue. 1, pp. 85-96, Jan. 2000.