| 研究生: | 沈涵平 Shen, Han-Ping | 
|---|---|
| 論文名稱: | 應用多空間機率模型及語者相關音素群組模型於語者聚類之研究 Speaker Clustering Using Speaker-Dependent Phone Cluster Models and MSD-HMM | 
| 指導教授: | 吳宗憲 Wu, Chung-Hsien | 
| 學位類別: | 碩士 Master | 
| 系所名稱: | 電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering | 
| 論文出版年: | 2007 | 
| 畢業學年度: | 95 | 
| 語文別: | 中文 | 
| 論文頁數: | 51 | 
| 中文關鍵詞: | 語者聚類 、音素群組 、多空間機率 | 
| 外文關鍵詞: | speaker, clustering | 
| 相關次數: | 點閱:63 下載:1 | 
| 分享至: | 
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 | 
近幾年來,語音文件(如廣播新聞、會議紀錄等)急遽的增加,使得語音文件的擷取與管理變得日益的重要。音訊的聚類(Audio Clustering)為對同性質或同類的音訊去做聚類,使聽者能夠容易的知道某段的音訊屬於何種類別,比如說將包含不同語者的音訊分成不同類別(Speaker Clustering)、將有音樂背景的音訊與無背景音樂的音訊分開、將含男生或女生的音訊作一個分類的動作等等。
本論文提出一個整合聲學(acoustics) ,語音學(phonetic)與韻律學(prosody)的方法去對語者做聚類。在訓練階段我們會對訓練語料建立背景與個別語者的音素群組模型以模擬出不同語者的發音混淆資訊。為了同時模擬MFCC與音高(pitch),本研究也特地使用了多空間機率分佈的方法來同時模擬不同語者的MFCC與音高。而在測試階段系統的前處理會先用多斷點滑動視窗之最小描述長度(MDL)的方法去對句子做切割。接著會對切割過的音訊去做聲學分類並且根據聲學分類結果做批次調適(adaptation),之後再以辨識器去對各片段音訊去做語音辨識(Speech Recognition),利用辨識結果與最大似然值線性迴歸(MLLR)調適去建立各音段的以語者相關(Speaker-dependent)多空間機率分佈-隱藏式馬可夫模型(MSD-HMM)為基礎的音素群組模型。如此就能同時整合聲學,語音學與韻律學資訊來做聚類。
	在評估本論文提出方法的部份,我們使用公視廣播新聞(MATBN)做為訓練以及測試語料。由實驗數據證實使用語者相關音素群組模型模擬語者發音混淆資訊與運用多空間機率分佈模擬音高來實作語者聚類系統是可行的並且比起單使用低階聲學特性的語者聚類系統效能會是較好的。
The drastic increase in recent years in the amount of spoken documents, such as broadcast news and meeting recordings, has led to the retrieval and management of spoken documents becoming more and more significant. Audio clustering is used to cluster an input audio stream with similar fragments, such as speaker, foreground or background audio types. Speaker clustering can improve the performance of speech recognition and speaker identification.
This paper presents an approach to speaker clustering. In the training phase, we build a phone cluster model to extract phonetic features – confusion phone information from different speakers, and we use speaker-dependent MSD-HMMs to model speaker prosody. In the testing phase, audio segmentation using an MDL-based method is performed first. Then speaker grouping based on acoustic features is adopted on the segmented speech fragments. A speech recognition system with unsupervised adaptation is applied. Finally, bottom-up agglomerative clustering is performed based on acoustic, phonetic and prosodic features.
For the evaluation of the proposed method, the Mandarin Chinese Broadcast News Corpus (MATBN) is used as the spontaneous corpus. Experimental results reveal that the phone cluster model is useful to model the pronunciation confusion between different speakers, and MSD is useful to model MFCC and pitch simultaneously. And combining these two kinds of information can improve the performance of a speaker clustering system.
[Akita, 2003] Y. Akita and T. Kawahara. “Unsupervised Speaker Indexing using Anchor Models and Automatic Transcription of Discussions”. In Proc. of EUROSPEECH, 2985-2988, 2003 
[Anguera, 2006] X. Anguera, C. Wooters and J. Hernando. “Purity Algorithms for Speaker Diarization of Meetings Data”. In Proc. of ICASSP, 1025-1028, 2006
[Barras, 2006] C. Barras, X. Zhu, S. Meignier and J. L. Gauvain. “Multistage Speaker Diarization of Broadcast News”. In Proc. of IEEE TRASACTION ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 5, 1505-1512, September 2006
[Collet, 2006] M. Collet, D. Charlet and F. Bimbot. “Speaker Tracking by Anchor Models using Speaker Segment Cluster Information”. In Proc. of ICASSP, 1009-1012, 2006
[Doddington, 2001] G. Doddington, “Speaker Recognition based on Idiolectal Differences between Speakers”. In Proc. EUROSPEECH, vol. 4, 2517-2520, 2001
[Kwon, 2005] S. Kwon and S. Narayanan. “Unsupervised Speaker Indexing using Generic Models”. IEEE TRASACTION ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 13, NO. 5, 1505-1512, September 2005
[Lapidot, 2003] I. Lapidot. “SOM as Likelihood Estimator for Speaker Clustering”. In Proc. of EUROSPEECH, 3001-3004, 2003
[Liao, 2005] Y. F. Liao, Z. X. Zhuang, Z. H. Chen and Y. T. Juang. “Combination of Acoustic and Prosodic Information for Robust Speaker Identification”. 2005
[Lawson, 2006] A. D. Lawson, M. C. Huggins, J. J. Grieco, S. A. Galligan and D. M. Harris. “Automatic Speech Recognition Fusion Approach to Unsupervised Speaker Clustering and Labeling”. In Proc. of Aerospace Conference, 2006
[Nishida, 1998] M. Nishida and Y. Ariki. “Real Time Speaker Indexing Based on Subspace Method – Application to TV News Articles and Debate”. In Proc. of ICSLP, 1998
[Nishida, 2004] M. Nishida and T. Kawahara. “Speaker Indexing and Adaptation using Speaker Clustering based on Statistical Model Selection”. In Proc. of ICASSP, 353-356, 2004
[Padrell, 2005] J. Padrell, D. Macho and C. Nadeu. “Robust Speech Activity Detection using LDA Applied to FF Parameters”. In Proc. of ICASSP, 557-560, 2005
[Reynolds, 2000] D. Reynolds, T. Quatieri and R. Dunn, “Speaker Verification Using Adapted Mixture Models”. In Proc. Digital Signal Processing, vol. 10, 181-202, 2000
[Reynolds, 2003] D. Reynolds, W. Andrew and J. Campbell. “The SuperSID Project: Exploiting High-level Information for High-accuracy Speaker Recognition”. In Proc. of International Conference on Acoustics, Speech, and Signal Processing, 784-787, 2003
[Reynolds, 2003] D. Reynolds, J. Campbell and J. Campbell. “Beyond Cepstra: Exploiting High-Level Information in Speaker Recognition”. in Proc. of Workshop on Multimodal User Authentication, 223–229, December 2003
[Reynolds, 2004] D. Reynolds and P. Torres-Carrasquillo. “The MIT Lincoln Laboratory RT-04F Diarization Systems: Applications to Broadcast Audio and Telephone Conversations”. NIST Rich Transcription Workshop, November 2004
[Reynolds, 2005] D. Reynolds and P. Torres-Carrasquillo. “Approaches and Applications of Audio Diarization”. In Proc. of ICASSP, 953-956, 2005
[Sinha, 2005] R. Sinha, S. E. Tranter, M. J. F. Gales and P. C. Woodland. “The Cambridge University March 2005 Speaker Diarization System“. In Proc. of Interspeech, 2437-2440, 2005 
[Sonmez, 1998] K. Sonmez, E. Shriberg, L. Heck, and M. Weintraub. “Modeling Dynamic Prosodic Variation for Speaker Verification”, In Proc. of ICSLP, vol. 7, 3189-3192, 1998
[Stadelmann, 2006] T. Stadelmann and B. Freisleben.“ Fast and Robust Speaker Clustering using the Earth Mover’s Distance and Mixmax Models”. In Proc. of ICASSP, 989-992, 2006
[Tokuda, 2002] K. Tokuda, T. Masuko and N. Miyazaki. “Multi-Space Probability Distribution HMM”. In Proc. of IEICE TRANS. INF. & SYST, vol. E85-D, NO.3, March 2002 
[Wang, 2005] H. M. Wang, B. L. Chen, J. W. Kuo and S. S. Cheng. “MATBN-A Mandarin Chinese Broadcast News Corpus”. Computational Linguistics and Chinese Language Processing, vol. 10, NO.2, 219-236, June 2005
[Wang, 2006] B. Wang, J. Zhao, X. Peng, and B. C. Li. “A Novel Speaker Clustering Algorithm in Speaker Recognition System“. In Proc. of International Conference on Machine Learning and Cybernetics, 3298-3302, August 2006
[Weber, 2002] F. Weber, L. Manganaro, B. Peskin and E. Shriberg, “Using Prosodic and Lexical Information for Speaker Identification“. In Proc. of ICASSP, 2002
[Wooters, 2004] C. Wooters, J. Fung, B. Peskin and X. Anguera. “Toward Robust Speaker Segmentation the ICSI-SRI Fall 2004 Diarization System”. In Proc. of RT-04F Workshop, November 2004
[Wu, 2007] C. H. Wu and C. H. Hsieh. “Multiple Change-Point Audio Segmentation and Classification Using An MDL-Based Gaussian Model“. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, vol. 14, No. 2, 647-657, March 2006
[Xu, 2006] L. Xu, B. Qian, W. Cheng and Z. Tang. “Research on Automatic Speaker Recognition Based on Speech Clustering”. In Proc. of First International Conference on Innovative Computing, Information and Control, vol. 2, 105-108, 2006
[Zen, 2004] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi and T. Kitamura. “Hidden Semi-Markov Model Based Speech Synthesis“. In Proc. of ICSLP, 1185-1180, 2004
[Zhang, 2005] S. Zhang, S. Zhang and B. Xu. “A Robust Unsupervised Speaker Clustering of Speech Utterances“. In Proc. of NLP-KE, 115-120, 2005