| 研究生: |
劉啟權 Liou, Chi-Chiuan |
|---|---|
| 論文名稱: |
應用條件隨機域於口語對話中不流暢語流之偵測 Disfluency Detection in Spontaneous Speech using Conditional Random Field |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2007 |
| 畢業學年度: | 95 |
| 語文別: | 中文 |
| 論文頁數: | 59 |
| 中文關鍵詞: | 模糊分群 、語速 、能量 、音高 、不流暢語流 、口語 、條件隨機域 |
| 外文關鍵詞: | pitch, disfluency, conditional random field, spontaneous speech, energy, fuzzy c-means |
| 相關次數: | 點閱:130 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在日常生活對話的語境,由於需要構思文句,所以和朗讀文章相比,需要更多的時間。在「口語」這種需要在短時間內回應的情況,在語料中我們發現,語者能利用「拉長音」、「停頓」以便爭取時間構思語句。而由於構思的時間短,發生錯誤的機會也多,所以也常伴隨著「重覆語段」、「重新開始」或「修正」的情形。若可以有效偵測不流暢語流,並提供相關訊息給語音辨識器,便可以有效提升口語語音辨識成效。
本論文利用模糊條件隨機域來偵測不流暢語流並判斷語流中斷點(IP)之位置並以中研院收錄並標記的「現代漢語口語對話語料庫」來進行這些特性的分析。嘗試利用聲學與音韻特性偵測這些語段發生的位置,以期經過這個步驟之後,可以提供相關資訊來輔助聲學模型及語言模型的調適。本論文首先應用隱式馬可夫模型來建立基本辨識器,用來切割音節的斷點。再觀察對應音段在能量及音高的重新拉升,以及音長的變化。並利用條件隨機域,對各種不同的語音參數進行組合,嘗試尋找人工切出的測試語段中,發生語流中斷點的位置。為了避免強制分群時誤差的影響,本文採用非監控式模糊分類演算法針對觀測參數進行分群,並加以量化,以提供狀態特性函式。由實驗結果可以得知,不同的量化方式確實會對 IP 的辨識帶來影響。
More time was needed to make utterances in daily life conversation while comparing to read speech. In our corpus, we observed that speaker could make use of lengthening words or producing pauses to earn some time to organize the utterance while under a real-time response condition such as spontaneous speech. While speaker has shorter time of composing words, the opportunity of making mistakes is higher. Therefore, the utterances often conveyed repetitions, restarts, and repairs.
In this thesis, we analyzed prosodic features of these disfluencies in “Mandarin Conversational Dialogue Corpus”, which is collected and annotated by Sinica in Taiwan. We try to using acoustic and prosodic features on detecting disfluent segments, in the hope of providing information for adapting acoustic and language models.
HTK was used to build sub-syllable level HMM as basic speech recognizer for extracting syllable and sub-syllable boundaries from speech signal. With boundaries information, we observed energy, pitch, and duration changes in respective segments. We take advantages of using Conditional Random Field (CRF) to integrate different acoustic and prosodic features to detect interruption points (IP) in human labeled testing segments. In order to eliminate the effects of misclassification in quantization, we concatenated Fuzzy c-means to provide state feature functions used in CRF.
According to the experimental results, we can find that different quantization model would make different effects on IP detection.
[1] 孫立諺, 王逸如, “自發性對話語音音節合併現象之分析及辨識改進,” 交大電信碩士論文,June. 2004
[2] 羅應順, 陳信宏, “自發性中文語音基本辨認系統之建立,” 交大電信碩士論文,June. 2005
[3] 曾淑娟, 劉怡芬, “現代漢語口語對話語料庫標註系統說明,” 中央研究院語言學研究所籌備處, September. 2002
[4] S.-C. Tseng, "Linguistic markings of units in spontaneous Mandarin," ISCSLP 2006 vol. Artificial Intelligence p. 12, 2006
[5] J. Bear, J. Dowding, and E. Shriberg, "Integrating multiple knowledge sources for detection and correction of repairs in human-computer dialog," in Annual Meeting of the Association for Computational Linguistics, Newark, Delaware, 1992, pp. 56-63.
[6] E. E. Shriberg, “Preliminaries to a Theory of Speech Disfluencies.” PhD thesis, University of California at Berkeley. 1994
[7] E. Shriberg, A. Stolcke, D. Hakkani-Tur, and G. Tur, "Prosody-Based Automatic Segmentation of Speech into Sentences and Topics," Speech Communication, vol. 32(1-2), pp. 127-154, Sep. 2000.
[8] E. Charniak and M. Johnson, "Edit detection and parsing for transcribed speech," in Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, Pittsburgh, Pennsylvania, 2001
[9] Yang Liu, “Word Fragment Identification Using Acoustic-Prosodic Features in Conversational Speech,” in Proceedings of the HLT-NAACL 2003 Student Research Workshop, Edmonton, 2003, pp. 37-42.
[10] Yang Liu, “Structural Event Detection For Rich Transcription of Speech”, A Thesis Submitted to in Electrical and Computer Engineering. vol. Ph.D. West Lafayette, Indiana: Purdue University, 2004
[11] Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Dustin Hillard, Mari Ostendorf, Mary Harper, “Enriching Speech Recognition With Automatic Detection of Sentence Boundaries and Disfluencies,” IEEE Transactions On Audio, Speech, and Language Processing, vol. 14, Sep. 2006.
[12] Chung-Hsien Wu and Gwo-Lang Yan, “Acoustic Feature Analysis and Discriminative Modeling of Filled Pauses for Spontaneous Speech Recognition,” Journal of VLSI Signal Processing, 36, 2004, pp.87-99
[13] C.-K. Lin, S.-C. Tseng, and L.-S. Lee, “Important and New Features with Analysis for Disfluency Interruption Point (IP) Detection in Spontaneous Mandarin Speech,” in DiSS'05 Aix-en-Provence, France, 2005.
[14] C.-K. Lin, S.-C. Tseng, and L.-S. Lee, “Spontaneous Mandarin Speech Recognition with Disfluencies Detected by Latent Prosodic Modeling (LPM),” Proceedings of International Symposium on Linguistic Patterns in Spontaneous Speech, pp. 159-173, 2006.
[15] T. Kudo, K. Yamamoto, and Y. Matsumoto, "Applying Conditional Random Fields to Japanese Morphological Analysis," EMNLP 2004.
[16] T. Kudo, "CRF++: Yet Another CRF toolkit," 0.48 ed, 2007.
[17] K. Hong-Kwang Jeff and G. Yuqing, “Maximum entropy direct models for speech recognition,” Audio, Speech and Language Processing, IEEE Transactions on vol. 14, pp. 873-881, Apr. 18 2006.
[18] John Lafferty, Andrew McCallum, Fernando Pereira, “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data” in Proc. 18th International Conf. on Machine Learning, 2001, pp. 282-289.
[19] H. Wallach, "Efficient Training of Conditional Random Fields," in Proceedings of the 6th Annual CLUK Research Colloquium, Edinburgh, U.K., 2003.
[20] Hidden Markov Model Toolkit (Version 3.3) [Computer program]. Retrieved August 20, 2005, from http://htk.eng.cam.ac.uk/
Hidden Markov Model Toolkit (Version 3.4 alpha) [Computer program]. Retrieved June 23, 2006, from http://htk.eng.cam.ac.uk/
[21] Boersma, Paul & Weenink, David (2006). Praat: doing phonetics by computer (Version 4.4.20) [Computer program]. Retrieved May 3, 2006, from http://www.praat.org/
[22] Zhongqiang Huang, Lei Chen, Mary P. Harper, “An Open Source Prosodic Feature Extraction Tool” In LREC2006 Conference, 2006
[23] Zhongqiang Huang, Lei Chen, Mary P. Harper, “Purdue Prosodic Feature Extraction Tool on Praat,” 0.1.1 ed: Spoken Language Processing Lab, School of Electrical and Computer Engineering, Purdue University West Lafayette, 2006.
[24] Zhongqiang Huang, Lei Chen, Mary P. Harper, Purdue Prosodic Feature Extraction Tool on Praat (Version 0.1.1) [Computer program]. Retrieved May 15, 2006, Original Package Timestamp April 28, 2006, Latest Package Timestamp: September 14, 2006, from ftp://ftp.ecn.purdue.edu/harper/praat-prosody.tar.gz
[25] Shriberg, A. Stolcke, D. Hakkani-Tur, and G. Tur. “Prosody-based automatic segmentation of speech into sentences and topics,” Speech Communication, vol.32(1-2), pp.127--154, 2000
[26] LDC Simple MDE https://secure.ldc.upenn.edu/intranet/Annotation/MDE/guidelines/2004/index.shtml
[27] Filled Pause http://www.is.cs.cmu.edu/trl_conventions/projects/filled_pauses.html
[28] Piecewise Linear Fit
Non Linear trend Fit (http://www.dfisica.ubi.pt/~artome/linearstep.html)
A. R. Tomé, P. M. A. Miranda, “Piecewise linear fitting and trend changing points of climate parameters,” Geophysical Research Letters, VOL. 31, L02207, doi:10.1029/2003GL019100, 2004
[29] K. Hong-Kwang Jeff and G. Yuqing, “Maximum entropy direct models for speech recognition,” Audio, Speech and Language Processing IEEE Transactions on Audio, Speech and Language Processing, vol. 14, pp. 873-881, Apr. 18 2006
[30] P. A. Heeman and J. F. Allen, “Speech repairs, intonational phrases, and discourse markers: modeling speakers' utterances in spoken dialogue,” Computational Linguistics, vol. 25, pp. 527-571, 1999.
[31] M. Johnson and E. Charniak, "A TAG-based noisy channel model of speech repairs," in Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain, 2004
[32] Y. Jui-Feng and W. Chung-Hsien, "Edit Disfluency Detection and Correction Using a Cleanup Language Model and an Alignment Model," IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, pp. 1574-1583, Sep. 2006
[33] J. Kim, S. E. Schwarm, and M. Ostendorf, "Detecting structural metadata with decision trees and transformation based learning.," in HLT/NAACL, 2004
[34] X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing: Prentice Hall PTR Upper Saddle River, New Jersey 07458, 2001
[35] 現代漢語口語對話語料庫。(MCDC, Mandarin Conversational Dialogue Corpus)
[36] 台大、交大、成大 300 語者麥克風語料庫。(TCC 300)