簡易檢索 / 詳目顯示

研究生: 林博川
Lin, Po-chuan
論文名稱: 自動語音會議紀錄之語者切換點偵測與語句檢索演算法
Speaker Change Detection and Spoken Sentence Retrieval for Automatic Minute Taking
指導教授: 王駿發
Wang, Jhing-Fa
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 英文
論文頁數: 138
中文關鍵詞: 語音資料檢索會議語音處理自動會議紀錄器語者切換點偵測語句檢索部分比對支援向量機動態時間校準
外文關鍵詞: Spoken Document Retrieval, Automatic Minute Taking, Partial Matching, DTW, MPEG-7 LLD, SVM
相關次數: 點閱:82下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著數位錄音技術的廉價與普及,會議資料被更加廣泛地被記錄。為了有效存取這些大量的音訊資料,設計一個自動會議紀錄器讓這些語音訊息更容易被人們搜尋或檢索是非常迫切需要的工作。
    本論文之第一目的在於提出一個新穎的語者切換偵測演算法用來將會議語音切割成為不同語者的語音段,而每個語音段僅含有單一語者的語音資料。該演算法定義一個「SVM訓練分類錯誤率」來判斷語者之料之間的可分離性,藉此判斷所蒐集的兩個語音窗是否為同一個語者所發出的聲音。根據使用NIST Rich Transcription 2005 Spring Evaluation (RT-05S)之會議語料; 實驗證明我們提出的演算法比貝氏訊息準則(Bayesian information criterion, BIC)或是其他常用的距離包括: Kullback-Leibler distance (KL, KL2)、generalized likelihood ratio (GLR)、Mahalanobis distance,或是Bhattacharyya distance皆具更好的偵測能力,同時對於兩秒鐘以下的短語音段也可以有效的加以偵測。
    本論文之第二目的在提出兩個具部分比對能力的語句檢索演算法,分別是「全比對平面whole-matching-plane-based (WMPB)累加法」與「以行-列為基礎column-based row-based (CBRB)之累加法」。使用者可以使用整句輸入之方式來檢索資料庫語句。該演算法可以在詢問語句與資料庫語句之間僅有部分關鍵字相同的情況下做檢索,從實驗結果顯示所提出的部份比對機制確實在PC 、Samsung S3C2410X嵌入式模擬板以及HP iPAQ H5550 PDA有效運作。
    本論文的第三目的;一個兩階段的關鍵字比對方法被提出用來減低大約lq (一個query之音框數)係數比例之運算量。在第一階段中,一個相似音框標註方法被提出用來找尋可能與輸入關鍵字相符的語音段; 接著在第二階段中,這些可能的語音段皆採用一個較精確的相似度評估方式來做二次確認。另外,除了傳統常用的梅爾倒頻譜係數(conventional Mel frequency cepstrum coefficients, MFCCs) 之外,許多MPEG-7 低階聲音描述子(audio low-level descriptors, LLDs)也被評估用來提升語句檢索效果。實驗結果證明單獨使用MPEG-7 LLD之檢索效能與使用MFCC之檢索效能相近(降低約百分之四之準確率) 。
    另外,混合參數的使用方式則可以提升檢索效能。由於使用語音參數直接比對,上述的語句檢索演算法都不需要語音或是語言模型,也因此所提出之語句檢索方法為語言非相關(並非限定在特定語言之應用中)且不需要事先訓練。

    With the digital recording technology getting inexpensive and popular, there has been a tremendous increase in the availability of meeting data. For feasible access to this huge amount of audio data, there is a pressing need for efficient automatic minute taking (AMT) system that enable easier search and retrieval of meeting information to humans.
    The first goal of this dissertation is to develop speaker change detection (SCD) algorithm for segmenting the meeting audio stream into intervals, each containing only one speaker. For evaluating the data separability between different speakers, an SVM training misclassification rate (STMR) is proposed to determine whether two collected speech windows uttered from the same speaker or not. Compared to Bayesian information criterion (BIC) and other commonly used distances (Kullback-Leibler distance, generalized likelihood ratio, Mahalanobis distance, and Bhattacharyya distance), the STMR can identify speaker changes more effectively with less speech data collection and thus is capable of detecting speaker segments of shorter duration less than two seconds according to experiments on the NIST Rich Transcription 2005 Spring Evaluation (RT-05S) meeting corpus.
    The second goal of this dissertation is to develop two partial sentence matching algorithms, the whole-matching-plane-based (WMPB) algorithm and a column-based row-based (CBRB) algorithm, for retrieving spoken sentences. Users can speak sentences as the query inputs to get the similarity ranks of a spoken-sentence database. With the database sentences that only partial matched to the query sentence inputs, the experimental results show that the proposed algorithms can efficiently work on PC, a Samsung S3C2410X embedded evaluation board, and HP iPAQ H5550 PDA.
    The third goal of this dissertation, a feature-based spoken sentence retrieval (SSR) algorithm using two-level matching is proposed to reduce the computational load by around a factor of lq (frame number of query). In the first level, a similar frame tagging scheme is proposed to locate possible segments of the spoken sentences that are similar to the user’s query utterance. In the second level, a fine similarity between the query and each possible segment is evaluated. In addition to the conventional Mel frequency cepstrum coefficients (MFCCs), several MPEG-7 audio low-level descriptors (LLDs) are also used as the features to exploit their ability in SSR. Experimental results revealed that the retrieval performance using MPEG-7 audio LLDs is close to that of the MFCCs (< 4% precision rate). Moreover, the feature combination of MPEG-7 audio LLDs and the MFCCs can improve the retrieval precision. Based on the feature level comparison, all the proposed SSR algorithms do not require any acoustic or language model, thus the proposed manners are language independent and training free.

    ABSTARACT(Chinese)………...I ABSTRACT (English) ……… …………..III ACKNOWLEDGEMENT……… ………V CONTENTS………… ……….....…VI CHAPTER 1 INTRODUCTION 1.1 Context and Motivations of this Dissertation……..………… …..……………1 1.1.1 Introduction to AMT System…… .…………1 1.1.1.1 Why Do We Focus on Meetings? …….. ...…1 1.1.1.2 Automatic Meeting Taking (AMT) System……..…1 1.1.1.3 Related Works of AMT……..……..………3 1.1.1.4 System Diagram (Main Components) …….…4 1.2 Objective 1: SCD Algorithm Development for AMT………7 1.2.1 Related Works of Speaker Diarization—“Who Spoken When?” ……..……..7 1.2.2 SCD Algorithm Development for AMT…………..9 1.3 Objective 2: Personal Spoken Sentence Retrieval Algorithms Development for AMT….……9 1.3.1 Why do We Need Meeting Data Retrieval? ………9 1.3.2 Model-Based Spoken Document Retrieval…………10 1.3.3 Featrure-Based Spoken Document Retrieval……..……………………12 1.3.3.1 Featrure-Based Spoken Document Retrieval Using Sentence as Query Input…………13 1.3.3.2 Featrure-Based Spoken Document Retrieval Using Keyword as Query Input…….…..13 1.4 Outline of the Dissertation…………..14 CHAPTER 2 UNSUPERVISED SPEAKER CHANGE DETECTION FOR SHORT SPEECH SEGMENTS 2.1 Introduction to Speaker Change Detection..……..…15 2.2 Two Essential SCD Characteristics for AMT System..…………………17 2.2.1 Unsupervised SCD..……………...…17 2.2.2 SCD for Short Speech Segments..………17 2.3 Background: Existing Algorithms for SCD..…………..…18 2.3.1 Decoder-Based Segmentation..……………19 2.3.2 Model Selection-Based Segmentation..……19 2.3.3 Metric-Based Segmentation..………………20 2.3.4 Model-Based Segmentation..………21 2.4 Background: Existing Meeting Browser Systems with SCD..………………23 2.5 SCD Algorithm Development for Meeting..………24 2.5.1 Introduction to SVM..……….…24 2.5.2 SCD Algorithm Based on SVM Training Misclassification Rate..…………..26 2.6 Experimental Results..…………31 2.6.1 Meeting-Domain Experiments Setup..……….31 2.6.2 Evaluation of SVM Kernel Functions..……32 2.6.3 Finding the Best Setting of C..…….36 2.6.4 Finding the Best Setting of Window Size and STMR Threshold..……….37 2.6.5 Window Size Test..……41 2.6.6 SCD Evaluation for Speaker Segments Less Than Two Seconds..…………….43 2.6.7 Overall Experiments with RT-05s Database..………….47 2.7 Conclusions..…….…51 CHAPTER 3 FEATURE-BASED PARTIAL MATCHING ALGORITHMS FOR SPOKEN SENTENCE RETRIEVAL IN MEETINGS 3.1 Spoken Sentence Retrieval for AMT..………52 3.2 Related Works..………...…53 3.2.1 Retrieving Speech Data by Text Input- Model-Based Recognition..…………53 3.2.2 Retrieving Speech Data by Speech Input- Feature-Based Comparison..………..55 3.3 Partial Matching Spoken Sentence Retrieval (PMSSR) ..…………..57 3.3.1 The Partial Matching Concept..………………57 3.3.2 Related Works for Dealing with the Term Mismatch Problem..……………...58 3.3.3 Whole-Matching-Plane Based Algorithm..……….60 3.3.3.1 Sentence Matching For Spoken Sentence Retrieval..………….60 3.3.3.2 Feature-Level Partial Matching Spoken Sentence Retrieval ..………………61 3.3.4 Column-Based Row-Based Algorithm..………69 3.3.4.1 Basic Concept..………….69 3.3.4.2 Consideration for Second Similar FPUs..…………70 3.3.4.3 Consideration for Row-Based Matching..…………70 3.3.4.4 CBRB Algorithm..……………71 3.4 Experimental Results..……………76 3.4.1 Experimental Environments..……………76 3.4.2. Finding the Best FPU Sizes and IDW Functions for WMPB Algorithm..……….78 3.4.3 Finding the Best FPU Size and IDE Function for CBRB Algorithm..………………80 3.4.4 Finding the Best Alpha and Beta for CBRB Algorithm..……………...81 3.4.5 Evaluation Phase..……..…82 3.4.6 Extending our Approaches to a HP IPAQ H5550 PDA..…………84 3.4.6.1 System Overview..………84 3.4.6.2 Computational Analysis..……………85 3.4.6.3 Experimental Results..……………..86 3.4.7 Extending our Approaches to a Samsung S3c2410x Embedded System..…………87 3.4.7.1 System Overview..…………..87 3.4.7.2 Memory Reduction..……..89 3.4.7.3 Parameter Setting and the Experimental Results..…………………89 3.5 Conclusions..………...91 CHAPTER 4 SPOKEN SENTENCE RETRIEVAL USING TWO-LEVEL FEATURE MATCHING and MPEG-7 AUDIO LLDS 4.1 Introduction.…………92 4.2 Related Works…………….94 4.3 MPEG-7 Audio Low-Level Descriptors for Spoken Sentence Retrieval………….94 4.3.1 MPEG-7 Audio Low-Level Descriptors……………...94 4.3.1.1 Audio Spectrum Centroid (ASC) …………….96 4.3.1.2 Audio Spectrum Spread (ASS) ………….97 4.3.1.3 Audio Spectrum Flatness (ASF) …………...97 4.3.1.4 Instantaneous Harmonic Spectral Centroid (IHSC) ………………98 4.3.1.5 Instantaneous Harmonic Spectral Spread (IHSS) ……………98 4.3.2 Complexity Analysis of MPEG-7 LLDs and MFCCs………….…99 4.4 Two-Level Matching Algorithm for Spoken Sentence Retrieval…………..101 4.4.1 Possible Segment Extraction Level…………….102 4.4.1.1 Highly Possible Segment Extraction Using Rectangular Window Scanning……105 4.4.1.2 Highly Possible Segment Extraction Using Hamming Window Scanning………106 4.4.2 Fine Similarity Evaluation Level…………107 4.5 Computational Analysis…………….110 4.5.1 Direct Matching Method………………….110 4.5.1.1 Local Distance………110 4.5.1.2 Path Selection…………………111 4.5.2 Two-Level Feature Matching Method………………113 4.5.2.1 Similar Frame Tagging……….113 4.5.2.2 Possible Segment Extraction Using Window Scanning………..…114 4.5.2.3 Matching Possible Segments with Queries for Ranking Sentences Outputs…….114 4.6 Experimental Results…………116 4.6.1 Individual Feature Evaluations by Exact Matching………………116 4.6.2 Experimental Results………………117 4.6.2.1 Experimental Environment Setup……………………117 4.6.2.2 Spoken Sentence Retrieval Using Single Feature………………….…118 4.6.2.3 Spoken Sentence Retrieval Using Multi-Feature……………….…119 4.6.2.4 Comparisons and Discussions……………..120 4.7 Conclusions……...………123 CHAPTER 5 CONCLUSIONS and FUTURE WORKS 5.1 Principal Contributions…………124 5.2 Future Research Works…………………126 REFERENCES……….129 PUBLICATION LIST A: Journal Papers……………………137 B: Conference Papers………..137 C: Patents………………………….137 VITA…….….138

    [1]. NIST meeting recognition project: overview, http://www.nist.gov/speech/test_beds/mr_proj/
    [2]. R. Stiefelhagen, J. Yang, and A. Waibel, “Modeling focus of attention for meeting indexing based on multiple cues,” IEEE Trans. Neural Networks, vol. 13, no. 4, pp. 928–938, 2002.
    [3]. D. Lee, B. Erol, J. Graham, J. J. Hull, and N. Murata, “Portable Meeting Recorder,” in Proc. ACM Multimedia, pp. 493–502, 2002.
    [4]. S. Reiter, B. Schuller, G. Rigoll, “A combined LSTM-RNN - HMM - Approach for Meeting Event Segmentation and Recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), vol. II, 2006, pp. II.393–II.396.
    [5]. M. Al-Hames and G. Rigoll, “A Multi-Modal Mixed-State Dynamic Bayesian Network for Robust Meeting Event Recognition from Disturbed Data,” in Proc. IEEE Int. Conf. Multimedia and Expo (ICME), 2005, pp. 45–48.
    [6]. T. Pfau, D.P.W. Ellis, and A. Stolcke, “Multispeaker Speech Activity Detection for the ICSI Meeting Recorder,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2001, pp. 107–110.
    [7]. D. S. Lee, J. J. Hull, B. Erol, J. Graham, and N. Murata, “MinuteAid: multimedia note-taking in an intelligent meeting room,” in Proc. IEEE Int. Conf. Multimedia and Expo (ICME), 2004, pp. 1759–1762.
    [8]. D. G. Kimber, L. D. Wilcox, F. R. Chen, and T. Moran, “Speaker segmentation for browsing recorded audio”, in Proc. Int. Conf. Human Factors in Computing System (CHI), 1995. pp. 212–213.
    [9]. G. Lathoud and I. McCowan, “Location based speaker segmentation,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), 2003, pp. 621–624..
    [10]. G. Lathoud, I. A. McCowan, and D. C. Moore, “Segmenting multiple concurrent speakers using microphone arrays,” in Proc. Eur. Conf. Speech Commun. Tech. (EUROSPEECH), 2003, pp. 2889-2892.
    [11]. S. Renals and D. Ellis “Audio information access from meeting rooms,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), vol. 4, pp. 744–747.
    [12]. H. Yu, C. Clark, R. Malkin, and A. Waibel, “Experiments in automatic meeting transcription using jrtk,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, 1998, pp. 921–924.
    [13]. B. Chen, , H. M. Wang, and L. S. Lee, “Discriminating capabilities of syllable-based features and approaches of utilizing them for voice retrieval of speech information in Mandarin Chinese,” IEEE Trans. Speech and audio Processing, vol. 10, pp. 303–314, 2002.
    [14]. K. Ng, and V. Zue, “Phonetic recognition for spoken document retrieval,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1998, pp. 325–328.
    [15]. E. Chang, F. Seide, H. M. Meng, Z. Chen, Y. Shi, and Y. C. Li, “A system for spoken query information retrieval on mobile devices,” IEEE Trans. Speech and audio Processing, vol. 10, no. 8, pp. 531–541, 2002.
    [16]. H. M. Meng, and P. Y. Hui, “Spoken document retrieval for the languages of Hong Kong,” in Proc. 2001 International Symposium on Intelligent Multimedia, Video and Speech Processing, 2001, pp. 201–204.
    [17]. S. E. Johnson, P. Jourlin, G. L. Moore, K. Spärck Jones, and P. C. Woodland, “The Cambridge university spoken document retrieval system,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1999, pp. 49–52.
    [18]. M. Wechsler, “Spoken document retrieval based on phoneme recognition,” Ph.D. Dissertation, Swiss Federal Institute of Technology (ETH), Zurich, 1998.
    [19]. S. Srinivasan, and D. Petkovic, “Phonetic confusion matrix based spoken document retrieval,” in Proc. Int. Annual ACM SIGIR Conf. Research and Development in Information Retrieval Proceedings, 2000, pp. 81–87.
    [20]. A. Singhal, and F. Pereira, “Document expansion for speech retrieval,” in Proc. Int. Annual ACM SIGIR Conf. Research and Development in Information Retrieval, 1999, pp. 34–41.
    [21]. F. Crestani, “Towards the use of prosodic information for spoken document m retrieval,” in Proc. Int. Annual ACM SIGIR Conf. Research and Development in Information Retrieval, 2001, pp. 420–421.
    [22]. B. R. Bai, B. Chen, and H. M. Wang, “Syllable-based Chinese text/spoken document retrieval using text/speech queries,” International Journal of Pattern Recognition and Artificial Intelligence, pp. 603–616, 2000.
    [23]. D. S. Lee, J. J. Hull, and B. Erol, “Meeting video retrieval using dynamic hmm model similarity,” in Proc. IEEE Int. Conf. Multimedia and Expo (ICME), 2005.
    [24]. L. Couvreur and J. M. Boite, “Speaker tracking in broadcast audio material in the framework of the THISL project,” in Proc. Workshop on Accessing Information in Spoken Audio (ESCA-ETRW), 1999, pp. 84–89.
    [25]. J. F. Bonastre, P. Delacourt, C. Fredouille, T. Merlin, and C. Wellekens, “A speaker tracking system based on speaker turn detection for NIST evaluation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2000, pp. 1177–1180.
    [26]. L. Lu and H. J. Zhang, “Speaker change detection and tracking in real-time news broadcasting analysis,” in Proc. 10th ACM Int. Conf. Multimedia, Dec. 2002, pp. 602–610.
    [27]. J. Gauvain, L. Lamel, and G. Adda, “The LIMSI broadcast news transcription system,” Speech Communication, vol. 37, no. 1-2, pp. 89–108, 2002.
    [28]. S. Wegmann, P. Zhan, and L. Gillick, “Progress in broadcast news transcription at Dragon systems,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process (ICASSP), 1999, vol. 1, pp. 33–36.
    [29]. D. Liu, and F. Kubala, “Fast speaker change detection for broadcast news transcription and indexing,” in Proc. Eur. Conf. Speech Commun. Tech. (EUROSPEECH), Budapest, Hungary, Sep. 1999, pp. 1031–1034.
    [30]. M. Nishida and T. Kawahara, “Speaker model selection based on the Bayesian information criterion applied to unsupervised speaker indexing,” IEEE Trans. Speech and Audio Processing, vol. 13, no. 4, pp. 583–592, Jul. 2005.
    [31]. M. Nishida and T. Kawahara, “Speaker indexing and adaptation using speaker clustering based on statistical model selection,” in Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing. (ICASSP), May 2004, vol. 1, pp. I-353–I-356.
    [32]. T. Kemp, M. Schmidt, M. Westphal, and A. Waibel, “Acoustics, strategies for automatic segmentation of audio data,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Jun. 2000, vol. 3, pp. 1423–1426.
    [33]. S. Chen and P. Gopalakrishnan, “Speaker, environment and channel change detection and clustering via the Bayesian information criterion,” in Proc. DARPA Broadcast News Transcription Understanding Workshop, 1998, pp. 127–132.
    [34]. B. Zhou and J. H. L. Hansen, “Unsupervised audio stream segmentation and clustering via the Bayesian Information Criterion,” in Proc. Int. Conf. Spoken Language Processing (ICSLP), vol. 1, 2000, pp.714–717.
    [35]. A. Tritschler and R. Gopinath, “Improved speaker segmentation and segments clustering using the Bayesian Information Criterion,” in Proc. European Conf. Speech Communication Tech. (EUROSPEECH), 1999, pp. 679–682.
    [36]. M. Siegler, U. Jain, B.Raj, and R. Stern, “Automatic segmentation, classification and clustering of broadcast news audio,” in Proc. DARPA Speech Recognition Workshop, 1997, pp. 97–99.
    [37]. H. Meinedo and J. A. Neto, “Audio segmentation, classification and clustering in a broadcast news task,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP). vol.2, Apr. 2003, pp. II-5–8.
    [38]. M. Cettolo, “Segmentation, classification and clustering of an Italian broadcast news corpus,” in Proc. The 6th RIAO-Content-Based Multimedia Information Access Conference, 2000, pp. 372–281.
    [39]. J. Ajmera and C. Wooters, “A robust speaker clustering algorithm,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2003, pp. 411–416.
    [40]. K. Mori and S. Nakagawa, “Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing. (ICASSP), May 2001, vol. 1, pp. 413–416.
    [41]. X. Zhong, M. Clements, and S. Lim, “Acoustic change detection and segment clustering of two-way telephone conversation,” in Proc. European Conf. Speech Commun. Tech. (EUROSPEECH), 2003, pp. 2925–2928.
    [42]. M. Viswanathan, H. S. M. Beigi, S. Dharanipragada, and A. Tritschler, “Rtrieval from spoken documents using content and speaker information,” in Proc. The Fifth International Conference on Document Analysis and Recognition (ICDAR), 1999, pp. 567–572.
    [43]. P. Delacourt and C. J. Wellekens, “DISTBIC: A Speaker-based segmentation for Audio Data Indexing,” Speech Communication, vol. 32, no.1 - 2, pp 111–126, Sep. 2000.
    [44]. P. Delacourt and C. J. Wellekens, “Audio data indexing: use of second-order statistics for speaker-based segmentation,” in IEEE Int. Conf. Multimedia Computing and Systems (ICMCS), 1999, pp. 959–963.
    [45]. S. Meignier, J. F. Bonastre, and S. Igounet, “E-HMM approach for learning and adapting sound models for speaker indexing,” in Proc. Speaker Odyssey-The Speaker Recognition Workshop, 2001, pp. 175–180.
    [46]. D. Moraru, S. Meignier, C. Fredouille, L. Besacier, and J. F. Bonastre, “The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation,” in Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing. (ICASSP), May 2004, vol. 1, pp. I-373–I-376.
    [47]. S. Kwon and S. Narayanan, “Speaker change detection using a new weighted distance measure,” in Int. Conf. Spoken Language Processing (ICSLP), vol. 4, 2002, pp. 2537–2540.
    [48]. Q. Jin and T. Schultz, “Speaker segmentation and clustering in meetings,” in Proc. Int. Conf. Spoken Language Processing (ICSLP), Oct. 2004, pp. 597–600.
    [49]. L. Lu, S. Z. Li, and H. J. Zhang, “Content-based audio segmentation using support vector machines,” in Proc. IEEE Int. Conf. Multimedia and Expo (ICME), Aug. 2001, pp. 956–959.
    [50]. G. Lu and T. Hankinson, “An investigation of automatic audio classification and segmentation”, in Proc. Int. Conf. Spoken Language Processing (ICSLP), 2000, pp. 776–781.
    [51]. G. Schwarz, “Estimating the dimension of a model,” The Annals of Statistics, vol. 6, no. 2, pp. 461–464, 1978.
    [52]. S. S. Cheng and H. M. Wang, “METRIC-SEQDAC: a hybrid approach for audio segmentation,” in Proc. Int. Conf. Spoken Language Processing (ICSLP), 2004. pp. 1617–1620.
    [53]. S. S. Cheng and H. M. Wang, “A sequential metric-based audio segmentation method via the Bayesian Information Criterion,” in Proc. European Conf. Speech Commun. Tech. (EUROSPEECH), 2003. pp. 945–948.
    [54]. B. Zhou and J. H. L. Hansen, “Efficient audio stream segmentation via the combined T^2 statistic and Bayesian Information Criterion,” IEEE Trans. Speech and Audio Processing, vol. 13, no.4, pp. 467–474, July 2005.
    [55]. M. Cettolo, M. Vescovi, and R. Rizzi, “Evaluation of BIC-based algorithms for audio segmentation,” Computer Speech & Language, vol. 19, pp. 147–170, Apr. 2005.
    [56]. G. M. Foody and A. Mathur, “A relative evaluation of multiclass image classification by support vector machines,” IEEE Trans. Geosci. Remote. Sensing, vol. 42, no. 99, pp. 1335–1343, Jun. 2004.
    [57]. J. W. Hung, H. M. Wang, and L. S. Lee, “Automatic metric based speech segmentation for broadcast news via principal component analysis,” in Proc. Int. Conf. Spoken Language Processing (ICSLP), 2000. pp. IV-121–124.
    [58]. P. F. Luis and G. M. Carmen, “A multimedia approach for audio segmentation in TV broadcast news,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), vol. 1, May 2004, pp. I-369–I-372.
    [59]. R. Huang and J.H.L. Hansen, “Unsupervised audio segmentation and classification for robust spoken document retrieval,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), May 2004. pp. 741–744.
    [60]. R. Huang and J. H. L. Hansen, “Advances in unsupervised audio segmentation for the broadcast news and NGSW corpora,” in Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing. (ICASSP), vol.1, May 2004, pp. I-741–I-744.
    [61]. P. Sivakumaran, J. Fortuna, and A. M. Ariyaeeinia, “On the use of the bayesian information criterion in multiple speaker detection,” in Proc. Eur. Conf. Speech Commun. Tech. (EUROSPEECH), vol. 2, Sep. 2001, pp.795–798.
    [62]. L. Wilcox, D. Kimber, and F. Chen, “Audio indexing using speaker identification,” in Proc. SPIE Conf. Automatic Systems for the Inspection and Identification of Humans, Jul. 1994, pp. 149–157.
    [63]. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, pp.19–41, 2000.
    [64]. V. Vapnik, Statistical Learning Theory, New York: John Wiley, 1998.
    [65]. R. Fletcher, Practical methods of optimization, Second Edition, Chichester and New York: John Wiley, 1987.
    [66]. M. Aizerman, E. Braverman, and L. Rozonoer, “Theoretical foundations of the potential function method in pattern recognition learning,” Automations and Remote Control, vol. 25, 1964, pp. 821–837.
    [67]. N. J. Nilsson, Learning machines: Foundations of trainable pattern classifying systems, New York: McGraw-Hill, 1965.
    [68]. R. Courant and D. Hilbert. Methods of Mathematical Physiacs, volume 1. New York: Interscience, 1953.
    [69]. V. Kartik, D. Srikrishna Satish, and C. Chandra Sekhar, “Speaker change detection using support vector machines,” in proc. Int. Conf. Non-Linear Speech Processing (NOLISP), 2005, pp. 130–136.
    [70]. NIST Rich Transcription 2005 Spring Meeting Recognition Evaluation (RT-05S), http://www.nist.gov/speech/tests/rt/rt2005/spring
    /index.htm.
    [71]. J. Ma, Y. Zhao, and S. Ahalt. OSU SVM Classifier Matlab Toolbox (ver 3.00) Available: http://eewww.eng.ohio-state.edu/~maj/osu_svm/
    [72]. G. Lathoud, I. A. McCowan, and J. M. Odobez, “Unsupervised location-based segmentation of multi-party speech,” in Proc. IEEE Int. Conf. on Acoust., Speech, and Signal Processing – Meeting Recognition Workshop (ICASSP-NIST'04), Montreal, Quebec, Canada, May 2004, pp. 4–14.
    [73]. S. Salcedo-Sanz, A. Gallardo-Antolin, J. M. Leiva-Murillo, and C. Bousono-Calzon, “Offline speaker segmentation using genetic algorithms and mutual information,” IEEE Trans. Evolutionary Computation, vol. 10(2), pp. 175–186, Apr. 2006.
    [74]. W. B. Frakes and R. Baeza-Yates, Information Retrieval: Data Structures and Algorithms, Prenctice-Hall, Englewood Cliffs, NJ, 1992.G. Eason, B. Noble, and I. N. Sneddon, "On certain integrals of Lipschitz-Hankel type involving products of Bessel functions," Phil. Trans. Roy. Soc. London, vol. A247, pp. 529–551, Apr. 1955.
    [75]. G. Lathoud and I. A. McCowan, “Location based speaker segmentation,” in Proc. IEEE Int. Conf. on Acoust., Speech, and Signal Processing – Meeting Recognition Workshop (ICASSP-NIST'03), 2003, vol. 1, pp. 176–179.
    [76]. G. Lathoud, I. A. McCowan, and D. C. Moore, “Segmenting multiple concurrent speakers using microphone arrays,” in Proc. Eur. Conf. Speech Commun. Tech. (EUROSPEECH), 2003, pp. 2889–2892.
    [77]. D. Hindus and C. Schmandt, “Ubiquitous audio: capturing spontaneous collaboration,” in Proc. ACM Conf. Computer-Supported Cooperative Work (CSCW), 1992, pp. 210–217.
    [78]. S. Tucker and S. Whittaker, “Reviewing multimedia records: current approaches,” Int. Workshop on Multimodal Multiparty Meeting Processing (ICMI), 2005.
    [79]. A. Lisowska, M. Rajman, and T. Bui, “Archivus: A system for accessing the content of recorded multimodal meetings,” in: Bengio, S., Bourlard, H. (Eds.) Lecture Notes in Computer Science, 3361, 2004, pp 291–304.
    [80]. M4 Project, http://www.m4project.org/
    [81]. M. G. Christel, M.A. Smith., C. R. Taylor and D. B. Winkler, “Evolving video skims into useful multimedia abstractions,” in Proc. Int. Conf. Human Factors in Computing System (CHI), 1998, pp.18–23.
    [82]. J. F. Wang, P. C. Lin, J. J. Huang, and L. C. Wen, “Spoken sentence retrieval based on MPEG-7 low-level descriptors and two level matching approach,” in Proc. the 8th Australian and New Zealand Conf. on Intelligent Information Systems, 2003, pp. 397–402.
    [83]. H. K. Xie, “A study on voice caption search for arbitrarily defined keywords,” Master Thesis, National Taiwan University of Science and Technology, Taiwan, R.O.C., 2000.
    [84]. Y. Itoh, “A matching algorithm between arbitrary sections of two speech data sets for speech retrieval,” in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 2001, pp. 593–596.
    [85]. Y. Itoh and K. Tanaka, “Speech labeling and the most frequent phrase extraction using same section in a presentation speech,” in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 2002. pp. I-737–I-740.
    [86]. M. Tomczak, “Spatial interpolation and its uncertainty using automated anisotropic inverse distance weighting (IDW) cross-validation/jackknife approach,” Journal of Geographic Information and Decision Analysis, vol. 2, no. 2, pp. 18–30, 1998.
    [87]. K. Ng and V. W. Zue, “Subword-based approaches for spoken document retrieval,” Speech Communication, vol. 32, no. 3, pp. 157–186, 2000.
    [88]. L. Rabiner and B. H. Juang, “Fundamentals of Speech Recognition,” Prentice-Hall, New Jersey, 1993.
    [89]. R. Baeza-Yates and B. Ribeiro-Neto, “Modern Information Retrieval,” New York: ACM Press, 1999.
    [90]. W. K. Lo, H. Meng, and P. C. Ching, “Multi-scale and multi-model integration for improved performance in Chinese spoken document retrieval,” in Proc. Int. Conf. Spoken Language Processing, 2002, pp. 1513–1516.
    [91]. Intel PXA255 Processors Developer's Manual, http://www.intel.com, accessed 2004.
    [92]. F. Crestani, “Combination of similarity measures for effective spoken document retrieval,” Journal of Information Science, vol. 29, no. 2, pp. 87–96, 2003.
    [93]. S3C2410X User's Manuel, http://www.samsungsemi.com
    [94]. O. B. Tuzun, M. Demirekler, and K. B. Nakiboglu, “Comparison of parametric and non-parametric representations of speech for recognition,” in Proc. MELECON. Mediterranean Electrotechnical Conference, 1994, pp. 65–68.
    [95]. J. P. Openshaw, Z. P. Sun, and J. S. Mason, “A comparison of composite features under degraded speech in speaker recognition,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1993, vol. 2, pp. 371–374.
    [96]. R. Vergin, D. O’Shaughnessy, and V. Gupta, “Compensated mel frequency cepstrum coefficients,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1996, pp. 323–326.
    [97]. O. Avaro and P. Salembier, “MPEG-7 Systems: overview,” IEEE Trans. Circuits and Systems for Video Technology, vol. 11, no. 6, pp. 760–764, June 2001.
    [98]. Information Technology—Multimedia Content Description Interface—Part 4: Audio, ISO/IEC CD 15938-4, 2001.
    [99]. J. R. Deller, J. H. L. Hansen, and J. G. Proakis, Discrete-Time Processing of Speech Signals, New York: IEEE Press, 2000.
    [100]. J. F. Wang, J. C. Wang, H. C. Chen, T. L. Chen, C. C. Chang, and M. C. Shih, “Chip design of portable speech memopad suitable for persons with visual disabilities,” IEEE Trans. Speech and Audio Processing , vol. 10, no. 8, pp. 644–658, November 2002.
    [101]. E. M. Voorhees and D. K. Harman, “Appendix: evaluation techniques and measures,” in Proc. the English Text Retrieval Conference (TREC 8), NIST, 2000.
    [102]. “MPEG-7 content and objectives,” MPEG-7 Requirements Group, Sevilla, Spain, ISO/IEC JTC1/SC29/WG11, 1997.

    下載圖示 校內:2008-08-02公開
    校外:2010-08-02公開
    QR CODE