簡易檢索 / 詳目顯示

研究生: 梁維彬
Liang, Wei-Bin
論文名稱: 應用語意表示和情緒辨識於口述對話系統之研究
A Study on Semantic Representation and Emotion Recognition in Spoken Dialogue Systems
指導教授: 吳宗憲
Wu, Chung-Hsien
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2011
畢業學年度: 99
語文別: 英文
論文頁數: 97
中文關鍵詞: 對話系統發音變異不流暢語流語意特徵對話行為偵測情緒辨識
外文關鍵詞: spoken dialogue system, pronunciation variation, disfluent speech, semantic features, dialogue act detection, emotion recognition
相關次數: 點閱:113下載:8
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在過去的十年間,口述語言對話系統的研究有著長足的努力。然而,即席性語音的多樣性,嚴重地降低了自動語音辨識的效能,進而使得辨識結果缺乏可應用性的語意表達,因而阻礙了口述對話系統的發展。本論文之研究主要著重於目標導向的口述語言對話系統中之語者意圖表示方式和語者情緒辨識。為了開發此類的系統,本論文包含了許多主題如即席性語音辨識中的音節縮減字辨識和不流暢語流中斷點偵測、自然語言理解中的語意特徵參數擷取和表示方法、和對話控管中的語者意圖偵測和語者情緒辨識。
    在口述語言對話系統中,具潛在錯誤的自動語音語音辨識器往往造成的語意上錯誤了解和任務失敗。造成語音辨識錯誤的兩個主要原因為發音變異和不流暢語流。在本論文中,對於發音變異的問題,我們探討了音節縮減字的問題並且提出一種資料驅動的重新計分方法。基本語音辨識器在第一階段產生前$N$名的辨識結果,在第二階段中,綜合資訊用來對前$N$名的辨識結果中可能存在的音節縮減字進行重新計分。這兩種綜合資訊包括資料驅動替代發音串的轉換分數和音框校準而得的持續時間分數。
    即席性語音往往會伴隨著語音不流暢,而中斷點是不流暢語音中一種發生在音節邊界上的重要特徵。為了使用音節邊界上的音韻特徵來偵測中斷點,本論文提出一種機率式方法來整合包括具潛在錯誤的即席性語音辨識所估計而得的信心分數、目前潛在中斷點的音韻特徵相似度和使用群聚後音韻特徵之條件隨機域模型為基礎的中斷點相似度計算。此外,基頻重設和拖長音也被應用來改善中斷點偵測的效能。
    自動語音辨識器的辨識結果將送到自然語言理解元件中進行語意特徵參數的萃取。然而,不完美的語音辨識往往會降低語者意圖偵測的效能。在本論文中,部分句子樹被用來做為自動語音辨識器所輸出句子的強健性表示方法。此外,發展自然語言理解模組可說是非常耗費人力。因此,我們引用了語言的衍生規則將輸入語句表示成向量空間中的某一軸度。
    在目標導向的口述對話系統中,對話管理的一個主要功能為根據語者現在的語句和對話歷史資訊循序地進行語者意圖偵測。因此,我們提出以資料驅動的語者意圖類型來描述使用者與系統之間的對話行為,並且模化了語者意圖和語言衍生規則間的關係。這個模型被用來語者意圖偵測的產生語意分數。
    在目前口述對話系統的應用仍然被侷限於簡單資訊查詢的對話系統,然而,口述對話系統有時候應該根據語者情緒而被賦予產生不同回應的能力。因此,本研究發表一種使用聲學韻律資訊和語意標籤的情緒語音之情緒辨識。對於聲學韻律資訊,特徵參數萃取自情緒顯住音段並以包括混合高斯模型、支持向量機和多層感知器來模型化作為基層辨識器,再以綜合變換決策樹來融合基層辨識器。對於語意標籤為基礎的情緒辨識而言,從中文知識庫挑選出的語意標籤被用來從語音辨識輸出中自動產生情緒相關法則,然後使用最大熵模型來模型化情緒相關法則和情緒狀態間的關係。最後,權重乘積融合法被採用來整合聲學韻律資訊和語意標籤的情緒辨識結果。
    最後,在各章節的實驗結果顯示這些方法可以在對話系統中得到了改善。

    In the past decades, significant effort has been made on the research of spoken dialogue system (SDS). However, the spontaneous speech variety acutely degrades the performance of automatic speech recognition (ASR). Further, the lack of applicable semantic interpretation for spontaneous speech hedges the spoken dialogue development. This dissertation research focuses on semantic representation for speaker’s dialogue act (DA) and emotion recognition for dialogue controlling in goal-oriented spoken dialogue system. To develop such system, several helpful events are illustrated in this dissertation – syllable contracted words recognition for pronunciation variation (PV) and interruption point (IP) detection for disfluent speech in ASR, semantic feature extraction and representation in natural language understanding (NLU), DA detection and emotional recognition in dialogue controlling.
    In an SDS, error-prone ASR output may lead to misunderstanding and task failures. PV and disfluent speech are key factors contributing to the high error rate of current ASR. In this dissertation, we investigate the problem of syllable-contracted (SC)-words and propose a data-driven based approach to rescore the coarse speech recognition results for PV. The basic ASR output $N$-best recognized word strings in first pass. Then, the meta-information will be employed to rescore the recognized words which are possible SC-words in second pass. The meta-information comprises the transformation score of data-driven alternative pronunciations and duration score of frame-alignment.
    Spontaneous speech often accompanies speech disfluencies and the IP is an important characteristic at inter-syllable boundaries in disfluent speech. To detect the IP based on inter-syllable boundary-based prosodic features, this dissertation presents a probabilistic scheme to integrate the confidence score of recognized speech estimated by the error-prone spontaneous speech, the similarity measurement of the prosodic features at the current potential IP, and IP likelihood measure estimated by the Conditional Random Field (CRF) model using the clustered prosodic features. In addition, pitch reset and prolong are also applied to improve the performance of IP detection.
    Then, the ASR output is passed to NLU component to extract the semantic features. However, imperfect ASR output often degrade the performance of DA detection. In this dissertation, our system uses partial sentence tree as a robust representation of an ASR output sentence. This representation is further processed to extract the semantic information, which is indicative of the DA.
    In a goal-oriented spoken dialogue system, one major function of the spoken language understanding unit is to sequentially detect the speaker's DA given the current utterance and the dialogue history. Thus, we use the data-driven DA types to describe the dialogue behaviors. we model the relationship between the data-driven DA types and the derivation rules. The model is used to generate a semantic score for DA detection.
    Current applications of SDS are still limited to simple information dialogue systems. In order to endue an SDS with affective interaction, this study presents an approach to emotion recognition of affective speech using acoustic-prosodic information (AP) and semantic labels (SLs). For AP-based recognition, acoustic and prosodic features are extracted from the detected emotional salient segments of the input speech. Three types of models GMMs, SVMs, and MLPs are adopted as the base-level classifiers. A Meta Decision Tree (MDT) is then employed for classifier fusion to obtain the AP-based emotion recognition confidence. For SL-based recognition, semantic labels derived from an existing Chinese knowledge base called HowNet are used to automatically extract Emotion Association Rules (EARs) from the recognized word sequence of the affective speech. The maximum entropy model (MaxEnt) is thereafter utilized to characterize the relationship between emotional states and EARs for emotion recognition. Finally, a weighted product fusion method is used to integrate the AP-based and SL-based recognition results for final emotion decision.
    Finally, these experimental results show that the proposed approaches can achieve improvement in semantic representation and dialogue controlling for SDS.

    Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The Approach of this Dissertation . . . . . . . . . . . . . . . . . . . 2 1.3 Structure of the Dissertation . . . . . . . . . . . . . . . . . . . . . . 4 2 Automatic Speech Recognition 5 2.1 Speech Recognition Fundamental . . . . . . . . . . . . . . . . . . . 5 2.2 Pronunciation Variation . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Related Work of Pronunciation Variation . . . . . . . . . . . 7 2.2.2 Framework of SC-Word Recognition . . . . . . . . . . . . . . 9 2.2.3 Meta-information for Acoustic Level Rescoring . . . . . . . . 9 2.3 Disfluent Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Related Work of IP Detection . . . . . . . . . . . . . . . . . 16 2.3.2 Framework of IP Detection . . . . . . . . . . . . . . . . . . . 17 2.3.3 Probabilistic Model for Detecting IP . . . . . . . . . . . . . 18 3 Natural Language Processing 27 3.1 Semantic Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Extraction of Lexical Feature for Dialogue Act Detection . . . . . . 29 3.2.1 Robust Representation of ASR result . . . . . . . . . . . . . 29 3.2.2 Extraction of Derivation Rules with Stochastic Parsing . . . 31 3.2.3 Sentence Representation in a Vector Space . . . . . . . . . . 32 3.3 Emotionally Acoustic Cue . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.1 Emotionally Salient Segment . . . . . . . . . . . . . . . . . . 33 3.4 Emotionally Textual Cue . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.1 Emotion Generation Rules . . . . . . . . . . . . . . . . . . . 35 3.4.2 Semantic Labels . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.3 Emotion Association Rules . . . . . . . . . . . . . . . . . . . 39 4 Dialogue Management 41 4.1 Dialogue Act Detection . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.1 Framework of Dialogue Act Detection . . . . . . . . . . . . . 41 4.1.2 Models for Dialogue Act Detection . . . . . . . . . . . . . . 42 4.2 Emotion Recognition for Affective Interaction . . . . . . . . . . . . 47 4.2.1 Acoustic-Prosodic Information-based Classifiers . . . . . . . 48 4.2.2 Semantic Label-based Classifier Using MaxEnt . . . . . . . . 50 4.2.3 Integration of AP- and SL-based Approaches . . . . . . . . . 52 4.2.4 Consideration of Personality Trait . . . . . . . . . . . . . . . 53 5 Evaluations 56 5.1 Evaluations and Discussion of ASR . . . . . . . . . . . . . . . . . . 56 5.1.1 Evaluation Corpus . . . . . . . . . . . . . . . . . . . . . . . 56 5.1.2 Implementation of Speech Recognizer . . . . . . . . . . . . . 57 5.1.3 Evaluations of SC-word Recognition . . . . . . . . . . . . . . 59 5.2 Evaluations of Interruption Point Detection . . . . . . . . . . . . . 62 5.2.1 Feature Extraction for IP Detection . . . . . . . . . . . . . . 62 5.2.2 Analysis of Prosodic Features at Interruption Points . . . . . 63 5.2.3 Experimental Setup and Measurement Methods . . . . . . . 64 5.2.4 Results of Interruption Point Detection . . . . . . . . . . . . 66 5.2.5 Comparison of CRF-model with assumptions . . . . . . . . . 70 5.3 Evaluations of Dialogue Act Detection . . . . . . . . . . . . . . . . 73 5.3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3.2 ASR Module . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3.3 The z-Score Threshold . . . . . . . . . . . . . . . . . . . . . 75 5.3.4 The Number of DA Types . . . . . . . . . . . . . . . . . . . 76 5.3.5 Evaluation of Feature Sets . . . . . . . . . . . . . . . . . . . 76 5.3.6 Evaluation of the History Score . . . . . . . . . . . . . . . . 78 5.4 Evaluations of Emotion Recognition . . . . . . . . . . . . . . . . . . 78 5.4.1 Corpora for Emotion Recognition . . . . . . . . . . . . . . . 78 5.4.2 Experimental Setup of Emotion Recognition . . . . . . . . . 79 5.4.3 Evaluation Results of Emotion Recognition . . . . . . . . . . 80 6 Conclusion and Future Work 87 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Appendix 97 List of Dialogue Act Types . . . . . . . . . . . . . . . . . . . . . . . . . . 97 List of Figures 1.1 Typical chain of an SDS . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Model of an automatic speech recognizer . . . . . . . . . . . . . . . 6 2.2 Illustration of factors of syllable contraction. . . . . . . . . . . . . . 7 2.3 Framework of the proposed approach for syllable contrasted-word recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Illustration of alternative pronunciation selection . . . . . . . . . . . 11 2.5 Illustration of duration score and modeling. . . . . . . . . . . . . . 13 2.6 Framework of the probabilistic model for IP detection . . . . . . . . 18 2.7 Prosodic Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 20 2.8 An example illustrating the IP detection process based on tagging strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.9 Illustration of feature clustering . . . . . . . . . . . . . . . . . . . . 23 2.10 Illustration of feature clustering employed in the observation function of CRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 Diagram of Derivation Rule Extraction . . . . . . . . . . . . . . . . 29 3.2 Construction of the partial sentence tree for the sentence Where is the Anping-Fort. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Examples of the parse result (left) and the extracted derivation rules (right) corresponding to the four partial sentences in Fig. 3.2 . 32 3.4 An illustration of the definition and extraction of emotionally salient segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5 Tree structure of the Semantic Labels [WCL06] . . . . . . . . . . . 37 4.1 Details of the SLU and DM modules. . . . . . . . . . . . . . . . . . 42 4.2 An overview of training and testing flowchart of the acoustic-prosodic information-based recognition, the semantic label-based recognition and the personality trait . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3 Example of a meta decision tree . . . . . . . . . . . . . . . . . . . . 51 4.4 Some example questions in EPQ [Sim] . . . . . . . . . . . . . . . . 54 4.5 Personality trait chart [Dag] . . . . . . . . . . . . . . . . . . . . . . 54 5.1 The effect of pronunciation lexicon size. . . . . . . . . . . . . . . . . 60 5.2 Precision and recall rate . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3 Illustration of common speech events such as pitch reset and lengthening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.4 Illustration of HMM-based approach to IP detection. . . . . . . . . 65 5.5 Comparison results of the CRF model using pitch reset and lengthening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.6 Comparison between the proposed approach and Liu’s work . . . . 72 5.7 The environmental setting (bottom) and the beginning part of a collected dialogue (top). The operator acts like an SDS, and the user acts like s/he is interacting with an SDS. . . . . . . . . . . . . 75 5.8 Evaluation results using weighted product fusion as a function of the weight value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 List of Tables 2.1 An example bridging the SC and three error types in speech recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Examples of transformation score of different alternative pronunciations (AP) for dictionary entry “這樣” . . . . . . . . . . . . . . . . 12 2.3 Definition of edit disfluency . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Examples of disfluent speech . . . . . . . . . . . . . . . . . . . . . . 16 3.1 Some examples of EGRs . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Some example words belonging to Specific SLs [WCL06] . . . . . . 38 3.3 Some example words belonging to Negative SLs . . . . . . . . . . . 39 3.4 Some example words belonging to Disjunctive SLs . . . . . . . . . . 39 4.1 Meta-data in MDT . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 The relationship between emotion state and two dimensions defined in EPQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.1 Ratio of canonical pronunciation and pronunciation variation . . . . 57 5.2 Ratio of various length of contracted syllable . . . . . . . . . . . . . 57 5.3 CP of 47 most frequent SC-words and their most frequent AP. . . . 58 5.4 Average Recognition Results at Syllable Level . . . . . . . . . . . . 60 5.5 Ratio of mis-recognition to SC-word . . . . . . . . . . . . . . . . . . 60 5.6 Recognition performance using different weighted scores . . . . . . . 61 5.7 Classification performance of Duration Score . . . . . . . . . . . . . 62 5.8 Classification accuracy (%) using the probability distribution of the prosodic features based on the GMM with different mixture numbers (M: mixture number) . . . . . . . . . . . . . . . . . . . . . . . 66 5.9 Percentage accuracy (%) for CRF-based IP likelihood measure under various context lengths against single CRF . . . . . . . . . . . . 67 5.10 Evaluation performance for different conditions without additional information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.11 Evaluation performance of FINAL part lengthening . . . . . . . . . 69 5.12 Evaluation performance with pitch reset event . . . . . . . . . . . . 70 5.13 Syllable accuracy (%) of speech recognition with or without the proposed probabilistic IP detection model . . . . . . . . . . . . . . 73 5.14 Examples of named entity classes (NEC) and semantic classes . . . 75 5.15 Word-level ASR recognition accuracy with Clean and three artificial speech - Foot for footfall, Human for human speech, and Both for both noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.16 Detection accuracies with varying numbers of DA types. . . . . . . 76 5.17 Detection Accuracies with Varying Feature Sets under varying ASR conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4.1 CorporaforEmotionRecognition ............... 78 5.4.2 Experimental Setup of Emotion Recognition . . . . . . . . . 79 5.4.3 Evaluation Results of Emotion Recognition . . . . . . . . . . 80 6 Conclusion and Future Work 87 6.1 Conclusion................................ 87 6.2 FutureWork............................... 88 Bibliography 89 Appendix 97 ListofDialogueActTypes.......................... 97

    [AIS93] R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In ACM SIGMOD international conference on Management of data, 1993.
    [Ass] The Association for Computational Linguistics and Chinese Language Processing (ACLCLP). Brief Introduction to TCC-300 Corpus.
    [AZC03] N. Amir, S. Ziv, and R. Cohen. Characteristics of authentic anger in hebrew speech. In European Conference on Speech Communication and Technology, number 713-716, 2003.
    [Ban09] S. Banerjee. Nist conducts rich transcription evaluation. IEEE Speech and Language Processing Technical Committee Newsletter (SLTC), 2009.
    [BDS92] J. Bear, J. Downding, and E. Shriberg. Integrating multiple knowledge sources for detection and correction of repairs in human-computer dialog. In Annual Meeting on Association for Computational Linguistics (ACL), pages 56–63, 1992.
    [BPP96] A. Berger, S. Della Pietra, and V. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71, 1996.
    [Bur76] R. R. Burton. Semantic Grammar: A Technique for Efficient Language Understanding in limited Domains. PhD thesis, University of California, Irvine, 1976.
    [BW09] P. Boersma and D. Weenink. Praat: Doing phonetics by computer. Computer Program, May 1 2009. Version 5.1.05.
    [Chu97] R.-F. Chung. Syllable contraction in chinese. In F.-F. Tsao & S. H. Wang, editor, Chinese Languages and Linguistics III. Morphology and Lexicon. Symposium Series of the Institute of History and Philology, pages 199–235. Academia Sinica, Taiwan, 1997.
    [CJ04] E. Charniak and M. Johnson. A tag-based noisy channel model of speech repairs. In 42nd Annual Meeting on Association for Computational Linguistics (ACL), 2004.
    [CL] C.-C. Chang and C.-J. Lin. Libsvm – a library for support vector machines. Computer Program.
    [CR99] S. F. Chen and R. Rosenfeld. A gaussian prior for smoothing maximum entropy models. Technical report, CMU, 1999.
    [CS99] M. G. Core and L. K. Schubert. A syntactic framework for speech repairs and other disruptions. In Annual Meeting on Association for Computational Linguistics (ACL), pages 413–420, 1999.
    [CW69] S. C. Choi and R. Wette. Maximum likelihood estimation of the parameters of the gamma distribution and their bias. Technometrics, 11(4):683–690, 1969.
    [Dag] R. Dagan. Temperament: a brief survey, with modern applications. Online.
    [DD98] Z. Dong and Q. Dong. Hownet. Online, 1998.
    [DHS01] R. O. Duda, P. E. Hart, and D. G. Stork, editors. Pattern Recognition. Wiley Interscience Publication, 2 edition, 2001.
    [DLV03] L. Devillers, L. Lamel, and I. Vasilescu. Emotion detection in taskoriented spoken dialogues. In IEEE International Conference on Multimedia and Expo (ICME), pages 549–552, 2003.
    [dSN00] C. de Silva and P. C. Ng. Bimodal emotion recognition. In IEEE International Conference on Automatic Face and Gesture Recognition (ICAFGR), pages 332–335, 2000.
    [EE75] H. J. Eysenck and S. B. G. Eysenck. Manual of the Eysenck Personality Questionnaire. London: Hodder and Stoughton, 1975.
    [Ent00] Exploiting Latent Semantic Information in Statistical Language Modeling, volume 88, 2000.
    [FWW+96] E. Fosler, M. Weintraub, S. Wegmann, Y.-H. Kao, S. Khudanpur, C. Galles, and M. Saraclar. Automatic learning of word pronunciation from data. In International Conference on Spoken Language Processing (ICSLP), 1996.
    [GHO+01] A. Ganapathiraju, J. Hamaker, M. Ordowski, G. Doddington, and J. Picone. Syllable-based large vocabulary continuous speech recognition. IEEE Trans. on Speech and Audio Processing, 9(4):358–366, 2001.
    [GRW97] A.L. Gorin, G. Riccardi, and J. H. Wright. How may I help you? Speech Communication vol.23, 23:113–127, 1997.
    [HA99] P. A. Heeman and J. F. Allen. Speech repairs, intonational phrases, and discourse markers: Modeling speakers’ utterances in spoken dialogue. Computational Linguistics, 25(4):527–571, 1999.
    [HAH05] X. Huang, A. Acero, and H.-W. Hon. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, chapter 15, pages 753–755. Prentice Hall PTR, 1 edition, 2005.
    [Hai05] T. Hain. Implicit modelling of pronunciation variation in automatic speech recognition. Speech Communication, 46(2):171–188, 2005.
    [HBB06] A. H¨am¨al¨ainen, L. Bosch, and L. Boves. Pronunciation variant-based multi-path hmms for syllables. In International Conference on Spoken Language Processing (INTERSPEECH), number 1579-1582, 2006.
    [HBB07] A. H¨am¨al¨ainen, L. Bosch, and L. Boves. Modeling pronunciation variation using multi-path hmms for syllables. In International Conference on Acoustic, Speech, and Signal Processing (ICASSP), pages 781–784, 2007.
    [HCH06] Z. Huang, L. Chen, , and M. Harper. An open source prosodic feature extraction tool. In Language Resources and Evaluation Conference (LREC’06), 2006.
    [HHSL02] T. J. Hazen, I. Lee Hetherington, H. Shu, and K. Livescu. Pronunciation modeling using a finite-state transducer representation. In ISCA Workshop on Pronunciation Modeling and Lexicon Adaptation, 2002.
    [HOM+08] Chiori Hori, Kiyonori Ohtake, Teruhisa Misu, Hideki Kashioka, and Satoshi Nakamura. Dialog management using weighted finite-state transducers. In Proc. INTERSPEECH-2008, pages 211–214, 2008.
    [HOM+09a] C. Hori, K. Ohtake, T. Misu, H. Kashioka, and S. Nakamura. Recent advances in wfst-based dialog system. In INTERSPEECH’2009, page 268–271, 2009.
    [HOM+09b] C. Hori, K. Ohtake, T. Misu, H. Kashioka, and S. Nakamura. Statistical dialog management applied to wfst-based dialog systems. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), page 4793–4796, 2009.
    [HOM+09c] C. Hori, K. Ohtake, T. Misu, H. Kashioka, and S. Nakamura. Weighted finite state transducer based statistical dialog management. In ASRU, 2009.
    [HWL+08] C.-L. Huang, C.-H. Wu, H.-Z. Li, C.-H. Hsieh, and Bin Ma. Unsupervised pronunciation grammar growing using knowledge-based and data-driven approaches. In IEEE International Conference of Multimedia & Expo (ICME), number 1097 - 1100, 2008.
    [JBMR01] D. Jurafsky, A. Bell, M.Gregory, and W.D. Raymond. The effect of language model probability on pronunciation reduction. In International Conference on Acoustic, Speech, and Signal Processing (ICASSP’2001), pages 801–804, 2001.
    [JM09] D. Jurafsky and J. H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Pearson Education, 2 edition, 2009.
    [KSO04] J. Kim, S. F. Schwarm, and M. Ostendorf. Detecting structural metadata with decision tree and transformation based learning. In North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT/NAACL’04), number 137-144, 2004.
    [Kud09] T. Kudo. CRF++: Yet another CRF toolkit. Computer Program, 2009.
    [LCCK+08] R. L´opez-C´ozar, Z. Callejas, M. Kroul, J. Nouza, and J. Silovsk´y. Two-level fusion to improve emotion classification in spoken dialogue systems. In TSD’08, pages 617–624, 2008.
    [Le] Z. Le. Maximum entropy modeling toolkit for python and c++. Computer program.
    [LF04] Yi Liu and P. Fung. Pronunciation modeling for spontaneous mandarin speech recognition. International Journal of Speech Technology, 7(2-3):155–172, 2004.
    [LL96] R.S. Lazarus and B.N. Lazarus. Passion and Reason: Making Sense of Our Emotions. Oxford University Press, New York, 1996.
    [LL09] C.-K. LIN and L.-S. Lee. Improved features and models for detecting edit disfluencies in transcribing spontaneous mandarin speech. IEEE Trans. on Acoustic, Speech, and Language Processing, 17(7):1263–1278, 2009.
    [LM00] R. J. Larsen and M. L. Marx. An Introduction to Mathematical Statistics and Its Applications. Prentice Hall, 3 edition, 2000.
    [LM03] R. Levy and C. Manning. Is it harder to parse chinese, or the chinese treebank? In 41st Annual Meeting on Association for Computational Linguistics, volume 1, 2003.
    [LMP01] J. Lafferty, A. Mccallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning (ICML’01), pages 282–289, 2001.
    [LN89] D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical Programming, 45(3):503–528, 1989.
    [LN05] C.-M. Lee and S. Narayanan. Toward detecting emotions in spoken dialogs. IEEE Trans. on Speech and Audio Processing, 13(2):293–303, MARCH 2005.
    [LNH09] I. Luengo, E. Navas, and I. Hern´aez. Combining spectral and prosodic information for emotion recognition in the interspeech 2009 emotion challenge. In INTERSPEECH, pages 332–335, 2009.
    [LSS+06] Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf, and M. Harper. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Trans. on Acoustic, Speech, and Language Processing, 14(5):1526–1540, 2006.
    [LW05] Y.-S. Lo and Y.-R. Wang. An implementation of spontaneous mandarin speech recognition baseline system. Master’s thesis, Dept. of Communication Engineering, NCTU, Taiwan, 2005.
    [LW10] C.-H. Liu and C.-H. Wu. Semantic role labeling with discriminative feature selection for spoken language understanding. In INTERSPEECH’2010, 2010.
    [LWK08] W.-B. Liang, C.-H. Wu, and Y.-K. Kang. Recognition of syllable-contracted words in spontaneous speech using word expansion and duration information. In International Symposium on Chinese Spoken Language Processing (ISCSLP’2008), pages 225–228, 2008.
    [LYWL08] W.-B. Liang, J.-F. Yeh, C.-H. Wu, and C.-C Liou. Interruption point detection of spontaneous speech using prior knowledge and multiple features. In IEEE Conference on Multimedia and Expo (ICME’2008), pages 1457–1460, 2008.
    [MK10] T. Misu and T. Kawahara. Bayes risk-based dialogue management for document retrieval system with speech interface. Speech Communication, 52(1):61–71, 2010.
    [NFS03] T. Nwe, S. Foo, and L. De Silva. Speech emotion recognition using hidden markov models. Speech Communication, 41(4):603–623, 2003.
    [NH94] C. NakatanI and J. Hirschberg. A corpus-based study of repair cues in spontaneous speech. Journal of the Acoustical Society of America, 95(3):1603–1616, 1994.
    [NIS04] NIST. Rich transcription (RT-04F) evaluation plan, 2004.
    [NK04] H. Nanjo and T. Kawahara. Language model and speaking rate adaptation for spontaneous presentation speech recognition. IEEE Trans. on Speech and Audio Processing, 12(4):391–400, 2004.
    [Pri90] P. J. Price. Evaluation of spoken language systems: the atis domain. In Proc. the workshop on Speech and Natural Language, 1990.
    [PS00] A. Paeschke and W. Sendlmeier. Prosodic characteristics of emotional speech: Measurements of fundamental frequency movements. In International Speech Communication Association Tutorial and Research Workshop (ITRW) on Speech and Emotion, pages 75–80, 2000.
    [Qui93] J. R. Quinlan. C4.5:Programs for Machine Learning, Morgan Kaufmann. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 1993.
    [RG00] Venkata Ramana and Rao Gadde. Modeling word duration for better speech recognition. In International Conference on Spoken Language Processing (ICSLP), volume 1, pages 601–604, 2000.
    [Rij79] C. J. Van Rijsbergen, editor. Information Retrieval. Butterworths, London, 2 edition, 1979.
    [SDS04] M. Snover, B. Dorr, and R. Schwartz. A lexically-driven algorithm for disfluency detection. In North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT/NAACL’04), pages 157–160, 2004.
    [Sim] Similarminds.com. Personality test. Online.
    [SLN04] F.K. Song, W.-K. Lo, and S. Nakamura. Generalized word posterior probability for measuring reliability of recognized word. In SWIM2004, 2004.
    [SM03] M. Slaney and G. McRoberts. A recognition system for affective vocalization. Speech Communication, 39:367–384, 2003.
    [SP03] F. Sha and F. Pereira. Shallow parsing with conditional random fields. In North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLT/NAACL’03), pages 134–141, 2003.
    [SRL04] B. Schuller, G. Rigoll, and M. Lang. Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 17–21, 2004.
    [SSB09] B. Schuller, S. Steidl, and A. Batliner. The interspeech 2009 emotion challenge. In INTERSPEECH, pages 312–315, 2009.
    [SSHTT00] E. Shriberg, A. Stolcke, D. Hakkani-T¨ UR, and G. T¨ UR. Prosodybased automatic segmentation of speech into sentences and topics.Speech Communication, 32(1):127–154, 2000.
    [Sto02] A. Stolcke. SRILM - an extensible language modeling toolkit. In International Conference on Spoken Language Processing (ICSLP), page 901–904, 2002.
    [Str] S. Strassel. Simple Metadata Annotation Specification Version 6.2. Linguistic Data Consortium.
    [Sun02] X. Sun. The Determination Analysis and Synthesis of Fundamental Frequency. PhD thesis, Northwestern University, 2002.
    [SW00] H. Soltau and A. Waibel. Acoustic models for hyperarticulated speech. In International Conference on Spoken Language Processing (ICSLP), 2000.
    [TCL07] M.-Y. Tsai, F.-C. Chou, and L.-S. Lee. Pronunciation modeling with reduced confusion for mandarin chinese using three-stage framework. IEEE Trans. Audio, Speech and Language Processing, 15(2):661–675, 2007.
    [TL02] S.-C. Tseng and Y.-F. Liu. Annotation manual of mandarin conversational dialogue corpus. Technical Report 02-01, Chinese Knowledge Information Processing Group, Academia Sinica. Taiwan, 2002.
    [TL04] C.-Y. Tseng and Y.-L. Lee. Speech rate and prosody units: Evidence of interaction from mandarin chinese. In International Conference on Speech Prosody (SP’04), pages 215–254, 2004.
    [TRS98] D. T. Toledano, M. A. C. Rodriguez, and J. G. E. Sardina. Try to mimic human segmentation of speech using hmm and fuzzy logic post-correction rules. In 3rd ESCA/COCOSDA Workshop on Speech Synthesis, pages 1263–1266., 1998.
    [Tse05a] S.-C Tseng. Contracted syllables in mandarin: Evidence from spontaneous conversation. Journal of Language and Linguistics, pages 153–180, 2005.
    [Tse05b] S.-C Tseng. Syllable contraction in mandarin conversation dialogue corpus. International Journal of Corpus Linguistics, 10(1):63–83, 2005.
    [VA09] T. Vogt and E. Andre. Exploring the benefits of discretization of acoustic features for speech emotion recognition. In INTERSPEECH, pages 328–331, 2009.
    [vL07] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4), 2007.
    [Wal] Richard Wallace. The Artificial Linguistic Internet Computer Entity(A. L. I. C. E.).
    [WC01] C.-H. Wu. and J.-H. Chen. Automatic generation of synthesis units and prosodic information for chinese concatenative synthesis. Speech Communication, 35:219–237, 2001.
    [WC07] C.-H. Wu and Z.-J. Chuang. Emotion recognition from speech using ig-based feature compensation. International Journal of Computational Linguistics and Chinese Language Processing, 12(1):65–78, 2007.
    [WCL06] C.-H. Wu, Z.-J. Chuang, and Y.-C. Lin. Emotion recognition from text using semantic label and separable mixture model. ACM Trans.on Asian Language Information Processing, (2):165–182, June 2006.
    [Wes03] M. Wester. Pronunciation modeling for asr-knowledge-based and data-driven methods. Journal of Computer Speech and Language, 17:69–85, 2003.
    [WLY11] C.-H. Wu, W.-B. Liang, and J.-F. Yeh. Interruption point detection of spontaneous speech using inter-syllable boundary based prosodic features. ACM Trans. on Asian Language Information Processing, 10(1), March 2011.
    [YCXS01] F. Yu, E. Chang, Y.-Q. Xu, and H.-Y. Shum. Emotion detection from speech to enrich multimedia content. In IEEE Pacific-Rim Conference on Multimedia (PCM), 2001.
    [YKO+06] S. J. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland. The HTK Book. Cambridge University Press, 3.4 edition, 2006.
    [YW06] J.-F. Yeh and W C.-H. Wu. Edit disfluency detection and correction using a cleanup language model and an alignment model. IEEE Trans. on Acoustic, Speech, and Language Processing, 14(5):1574–1583, 2006.
    [YWW07] J.-F. Yeh, C.-H. Wu, and W.-Y. Wu. Disfluency correction of spontaneous speech using conditional random fields with variable-length features. In INTERSPEECH’07, pages 2157–2160, 2007.
    [ZML06] Z.-Y. Zhou, Helen Meng, and W.-K. Lo. A multi-pass error detection and correction framework for mandarin lvcsr. In International Conference of Speech Language Processing (ICSLP), 2006.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE