| 研究生: |
林博川 Lin, Po-chuan |
|---|---|
| 論文名稱: |
自動語音會議紀錄之語者切換點偵測與語句檢索演算法 Speaker Change Detection and Spoken Sentence Retrieval for Automatic Minute Taking |
| 指導教授: |
王駿發
Wang, Jhing-Fa |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2007 |
| 畢業學年度: | 95 |
| 語文別: | 英文 |
| 論文頁數: | 138 |
| 中文關鍵詞: | 語音資料檢索 、會議語音處理 、自動會議紀錄器 、語者切換點偵測 、語句檢索 、部分比對 、支援向量機 、動態時間校準 |
| 外文關鍵詞: | Spoken Document Retrieval, Automatic Minute Taking, Partial Matching, DTW, MPEG-7 LLD, SVM |
| 相關次數: | 點閱:82 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著數位錄音技術的廉價與普及,會議資料被更加廣泛地被記錄。為了有效存取這些大量的音訊資料,設計一個自動會議紀錄器讓這些語音訊息更容易被人們搜尋或檢索是非常迫切需要的工作。
本論文之第一目的在於提出一個新穎的語者切換偵測演算法用來將會議語音切割成為不同語者的語音段,而每個語音段僅含有單一語者的語音資料。該演算法定義一個「SVM訓練分類錯誤率」來判斷語者之料之間的可分離性,藉此判斷所蒐集的兩個語音窗是否為同一個語者所發出的聲音。根據使用NIST Rich Transcription 2005 Spring Evaluation (RT-05S)之會議語料; 實驗證明我們提出的演算法比貝氏訊息準則(Bayesian information criterion, BIC)或是其他常用的距離包括: Kullback-Leibler distance (KL, KL2)、generalized likelihood ratio (GLR)、Mahalanobis distance,或是Bhattacharyya distance皆具更好的偵測能力,同時對於兩秒鐘以下的短語音段也可以有效的加以偵測。
本論文之第二目的在提出兩個具部分比對能力的語句檢索演算法,分別是「全比對平面whole-matching-plane-based (WMPB)累加法」與「以行-列為基礎column-based row-based (CBRB)之累加法」。使用者可以使用整句輸入之方式來檢索資料庫語句。該演算法可以在詢問語句與資料庫語句之間僅有部分關鍵字相同的情況下做檢索,從實驗結果顯示所提出的部份比對機制確實在PC 、Samsung S3C2410X嵌入式模擬板以及HP iPAQ H5550 PDA有效運作。
本論文的第三目的;一個兩階段的關鍵字比對方法被提出用來減低大約lq (一個query之音框數)係數比例之運算量。在第一階段中,一個相似音框標註方法被提出用來找尋可能與輸入關鍵字相符的語音段; 接著在第二階段中,這些可能的語音段皆採用一個較精確的相似度評估方式來做二次確認。另外,除了傳統常用的梅爾倒頻譜係數(conventional Mel frequency cepstrum coefficients, MFCCs) 之外,許多MPEG-7 低階聲音描述子(audio low-level descriptors, LLDs)也被評估用來提升語句檢索效果。實驗結果證明單獨使用MPEG-7 LLD之檢索效能與使用MFCC之檢索效能相近(降低約百分之四之準確率) 。
另外,混合參數的使用方式則可以提升檢索效能。由於使用語音參數直接比對,上述的語句檢索演算法都不需要語音或是語言模型,也因此所提出之語句檢索方法為語言非相關(並非限定在特定語言之應用中)且不需要事先訓練。
With the digital recording technology getting inexpensive and popular, there has been a tremendous increase in the availability of meeting data. For feasible access to this huge amount of audio data, there is a pressing need for efficient automatic minute taking (AMT) system that enable easier search and retrieval of meeting information to humans.
The first goal of this dissertation is to develop speaker change detection (SCD) algorithm for segmenting the meeting audio stream into intervals, each containing only one speaker. For evaluating the data separability between different speakers, an SVM training misclassification rate (STMR) is proposed to determine whether two collected speech windows uttered from the same speaker or not. Compared to Bayesian information criterion (BIC) and other commonly used distances (Kullback-Leibler distance, generalized likelihood ratio, Mahalanobis distance, and Bhattacharyya distance), the STMR can identify speaker changes more effectively with less speech data collection and thus is capable of detecting speaker segments of shorter duration less than two seconds according to experiments on the NIST Rich Transcription 2005 Spring Evaluation (RT-05S) meeting corpus.
The second goal of this dissertation is to develop two partial sentence matching algorithms, the whole-matching-plane-based (WMPB) algorithm and a column-based row-based (CBRB) algorithm, for retrieving spoken sentences. Users can speak sentences as the query inputs to get the similarity ranks of a spoken-sentence database. With the database sentences that only partial matched to the query sentence inputs, the experimental results show that the proposed algorithms can efficiently work on PC, a Samsung S3C2410X embedded evaluation board, and HP iPAQ H5550 PDA.
The third goal of this dissertation, a feature-based spoken sentence retrieval (SSR) algorithm using two-level matching is proposed to reduce the computational load by around a factor of lq (frame number of query). In the first level, a similar frame tagging scheme is proposed to locate possible segments of the spoken sentences that are similar to the user’s query utterance. In the second level, a fine similarity between the query and each possible segment is evaluated. In addition to the conventional Mel frequency cepstrum coefficients (MFCCs), several MPEG-7 audio low-level descriptors (LLDs) are also used as the features to exploit their ability in SSR. Experimental results revealed that the retrieval performance using MPEG-7 audio LLDs is close to that of the MFCCs (< 4% precision rate). Moreover, the feature combination of MPEG-7 audio LLDs and the MFCCs can improve the retrieval precision. Based on the feature level comparison, all the proposed SSR algorithms do not require any acoustic or language model, thus the proposed manners are language independent and training free.
[1]. NIST meeting recognition project: overview, http://www.nist.gov/speech/test_beds/mr_proj/
[2]. R. Stiefelhagen, J. Yang, and A. Waibel, “Modeling focus of attention for meeting indexing based on multiple cues,” IEEE Trans. Neural Networks, vol. 13, no. 4, pp. 928–938, 2002.
[3]. D. Lee, B. Erol, J. Graham, J. J. Hull, and N. Murata, “Portable Meeting Recorder,” in Proc. ACM Multimedia, pp. 493–502, 2002.
[4]. S. Reiter, B. Schuller, G. Rigoll, “A combined LSTM-RNN - HMM - Approach for Meeting Event Segmentation and Recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), vol. II, 2006, pp. II.393–II.396.
[5]. M. Al-Hames and G. Rigoll, “A Multi-Modal Mixed-State Dynamic Bayesian Network for Robust Meeting Event Recognition from Disturbed Data,” in Proc. IEEE Int. Conf. Multimedia and Expo (ICME), 2005, pp. 45–48.
[6]. T. Pfau, D.P.W. Ellis, and A. Stolcke, “Multispeaker Speech Activity Detection for the ICSI Meeting Recorder,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2001, pp. 107–110.
[7]. D. S. Lee, J. J. Hull, B. Erol, J. Graham, and N. Murata, “MinuteAid: multimedia note-taking in an intelligent meeting room,” in Proc. IEEE Int. Conf. Multimedia and Expo (ICME), 2004, pp. 1759–1762.
[8]. D. G. Kimber, L. D. Wilcox, F. R. Chen, and T. Moran, “Speaker segmentation for browsing recorded audio”, in Proc. Int. Conf. Human Factors in Computing System (CHI), 1995. pp. 212–213.
[9]. G. Lathoud and I. McCowan, “Location based speaker segmentation,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), 2003, pp. 621–624..
[10]. G. Lathoud, I. A. McCowan, and D. C. Moore, “Segmenting multiple concurrent speakers using microphone arrays,” in Proc. Eur. Conf. Speech Commun. Tech. (EUROSPEECH), 2003, pp. 2889-2892.
[11]. S. Renals and D. Ellis “Audio information access from meeting rooms,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), vol. 4, pp. 744–747.
[12]. H. Yu, C. Clark, R. Malkin, and A. Waibel, “Experiments in automatic meeting transcription using jrtk,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, 1998, pp. 921–924.
[13]. B. Chen, , H. M. Wang, and L. S. Lee, “Discriminating capabilities of syllable-based features and approaches of utilizing them for voice retrieval of speech information in Mandarin Chinese,” IEEE Trans. Speech and audio Processing, vol. 10, pp. 303–314, 2002.
[14]. K. Ng, and V. Zue, “Phonetic recognition for spoken document retrieval,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1998, pp. 325–328.
[15]. E. Chang, F. Seide, H. M. Meng, Z. Chen, Y. Shi, and Y. C. Li, “A system for spoken query information retrieval on mobile devices,” IEEE Trans. Speech and audio Processing, vol. 10, no. 8, pp. 531–541, 2002.
[16]. H. M. Meng, and P. Y. Hui, “Spoken document retrieval for the languages of Hong Kong,” in Proc. 2001 International Symposium on Intelligent Multimedia, Video and Speech Processing, 2001, pp. 201–204.
[17]. S. E. Johnson, P. Jourlin, G. L. Moore, K. Spärck Jones, and P. C. Woodland, “The Cambridge university spoken document retrieval system,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1999, pp. 49–52.
[18]. M. Wechsler, “Spoken document retrieval based on phoneme recognition,” Ph.D. Dissertation, Swiss Federal Institute of Technology (ETH), Zurich, 1998.
[19]. S. Srinivasan, and D. Petkovic, “Phonetic confusion matrix based spoken document retrieval,” in Proc. Int. Annual ACM SIGIR Conf. Research and Development in Information Retrieval Proceedings, 2000, pp. 81–87.
[20]. A. Singhal, and F. Pereira, “Document expansion for speech retrieval,” in Proc. Int. Annual ACM SIGIR Conf. Research and Development in Information Retrieval, 1999, pp. 34–41.
[21]. F. Crestani, “Towards the use of prosodic information for spoken document m retrieval,” in Proc. Int. Annual ACM SIGIR Conf. Research and Development in Information Retrieval, 2001, pp. 420–421.
[22]. B. R. Bai, B. Chen, and H. M. Wang, “Syllable-based Chinese text/spoken document retrieval using text/speech queries,” International Journal of Pattern Recognition and Artificial Intelligence, pp. 603–616, 2000.
[23]. D. S. Lee, J. J. Hull, and B. Erol, “Meeting video retrieval using dynamic hmm model similarity,” in Proc. IEEE Int. Conf. Multimedia and Expo (ICME), 2005.
[24]. L. Couvreur and J. M. Boite, “Speaker tracking in broadcast audio material in the framework of the THISL project,” in Proc. Workshop on Accessing Information in Spoken Audio (ESCA-ETRW), 1999, pp. 84–89.
[25]. J. F. Bonastre, P. Delacourt, C. Fredouille, T. Merlin, and C. Wellekens, “A speaker tracking system based on speaker turn detection for NIST evaluation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2000, pp. 1177–1180.
[26]. L. Lu and H. J. Zhang, “Speaker change detection and tracking in real-time news broadcasting analysis,” in Proc. 10th ACM Int. Conf. Multimedia, Dec. 2002, pp. 602–610.
[27]. J. Gauvain, L. Lamel, and G. Adda, “The LIMSI broadcast news transcription system,” Speech Communication, vol. 37, no. 1-2, pp. 89–108, 2002.
[28]. S. Wegmann, P. Zhan, and L. Gillick, “Progress in broadcast news transcription at Dragon systems,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process (ICASSP), 1999, vol. 1, pp. 33–36.
[29]. D. Liu, and F. Kubala, “Fast speaker change detection for broadcast news transcription and indexing,” in Proc. Eur. Conf. Speech Commun. Tech. (EUROSPEECH), Budapest, Hungary, Sep. 1999, pp. 1031–1034.
[30]. M. Nishida and T. Kawahara, “Speaker model selection based on the Bayesian information criterion applied to unsupervised speaker indexing,” IEEE Trans. Speech and Audio Processing, vol. 13, no. 4, pp. 583–592, Jul. 2005.
[31]. M. Nishida and T. Kawahara, “Speaker indexing and adaptation using speaker clustering based on statistical model selection,” in Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing. (ICASSP), May 2004, vol. 1, pp. I-353–I-356.
[32]. T. Kemp, M. Schmidt, M. Westphal, and A. Waibel, “Acoustics, strategies for automatic segmentation of audio data,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Jun. 2000, vol. 3, pp. 1423–1426.
[33]. S. Chen and P. Gopalakrishnan, “Speaker, environment and channel change detection and clustering via the Bayesian information criterion,” in Proc. DARPA Broadcast News Transcription Understanding Workshop, 1998, pp. 127–132.
[34]. B. Zhou and J. H. L. Hansen, “Unsupervised audio stream segmentation and clustering via the Bayesian Information Criterion,” in Proc. Int. Conf. Spoken Language Processing (ICSLP), vol. 1, 2000, pp.714–717.
[35]. A. Tritschler and R. Gopinath, “Improved speaker segmentation and segments clustering using the Bayesian Information Criterion,” in Proc. European Conf. Speech Communication Tech. (EUROSPEECH), 1999, pp. 679–682.
[36]. M. Siegler, U. Jain, B.Raj, and R. Stern, “Automatic segmentation, classification and clustering of broadcast news audio,” in Proc. DARPA Speech Recognition Workshop, 1997, pp. 97–99.
[37]. H. Meinedo and J. A. Neto, “Audio segmentation, classification and clustering in a broadcast news task,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP). vol.2, Apr. 2003, pp. II-5–8.
[38]. M. Cettolo, “Segmentation, classification and clustering of an Italian broadcast news corpus,” in Proc. The 6th RIAO-Content-Based Multimedia Information Access Conference, 2000, pp. 372–281.
[39]. J. Ajmera and C. Wooters, “A robust speaker clustering algorithm,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2003, pp. 411–416.
[40]. K. Mori and S. Nakagawa, “Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing. (ICASSP), May 2001, vol. 1, pp. 413–416.
[41]. X. Zhong, M. Clements, and S. Lim, “Acoustic change detection and segment clustering of two-way telephone conversation,” in Proc. European Conf. Speech Commun. Tech. (EUROSPEECH), 2003, pp. 2925–2928.
[42]. M. Viswanathan, H. S. M. Beigi, S. Dharanipragada, and A. Tritschler, “Rtrieval from spoken documents using content and speaker information,” in Proc. The Fifth International Conference on Document Analysis and Recognition (ICDAR), 1999, pp. 567–572.
[43]. P. Delacourt and C. J. Wellekens, “DISTBIC: A Speaker-based segmentation for Audio Data Indexing,” Speech Communication, vol. 32, no.1 - 2, pp 111–126, Sep. 2000.
[44]. P. Delacourt and C. J. Wellekens, “Audio data indexing: use of second-order statistics for speaker-based segmentation,” in IEEE Int. Conf. Multimedia Computing and Systems (ICMCS), 1999, pp. 959–963.
[45]. S. Meignier, J. F. Bonastre, and S. Igounet, “E-HMM approach for learning and adapting sound models for speaker indexing,” in Proc. Speaker Odyssey-The Speaker Recognition Workshop, 2001, pp. 175–180.
[46]. D. Moraru, S. Meignier, C. Fredouille, L. Besacier, and J. F. Bonastre, “The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation,” in Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing. (ICASSP), May 2004, vol. 1, pp. I-373–I-376.
[47]. S. Kwon and S. Narayanan, “Speaker change detection using a new weighted distance measure,” in Int. Conf. Spoken Language Processing (ICSLP), vol. 4, 2002, pp. 2537–2540.
[48]. Q. Jin and T. Schultz, “Speaker segmentation and clustering in meetings,” in Proc. Int. Conf. Spoken Language Processing (ICSLP), Oct. 2004, pp. 597–600.
[49]. L. Lu, S. Z. Li, and H. J. Zhang, “Content-based audio segmentation using support vector machines,” in Proc. IEEE Int. Conf. Multimedia and Expo (ICME), Aug. 2001, pp. 956–959.
[50]. G. Lu and T. Hankinson, “An investigation of automatic audio classification and segmentation”, in Proc. Int. Conf. Spoken Language Processing (ICSLP), 2000, pp. 776–781.
[51]. G. Schwarz, “Estimating the dimension of a model,” The Annals of Statistics, vol. 6, no. 2, pp. 461–464, 1978.
[52]. S. S. Cheng and H. M. Wang, “METRIC-SEQDAC: a hybrid approach for audio segmentation,” in Proc. Int. Conf. Spoken Language Processing (ICSLP), 2004. pp. 1617–1620.
[53]. S. S. Cheng and H. M. Wang, “A sequential metric-based audio segmentation method via the Bayesian Information Criterion,” in Proc. European Conf. Speech Commun. Tech. (EUROSPEECH), 2003. pp. 945–948.
[54]. B. Zhou and J. H. L. Hansen, “Efficient audio stream segmentation via the combined T^2 statistic and Bayesian Information Criterion,” IEEE Trans. Speech and Audio Processing, vol. 13, no.4, pp. 467–474, July 2005.
[55]. M. Cettolo, M. Vescovi, and R. Rizzi, “Evaluation of BIC-based algorithms for audio segmentation,” Computer Speech & Language, vol. 19, pp. 147–170, Apr. 2005.
[56]. G. M. Foody and A. Mathur, “A relative evaluation of multiclass image classification by support vector machines,” IEEE Trans. Geosci. Remote. Sensing, vol. 42, no. 99, pp. 1335–1343, Jun. 2004.
[57]. J. W. Hung, H. M. Wang, and L. S. Lee, “Automatic metric based speech segmentation for broadcast news via principal component analysis,” in Proc. Int. Conf. Spoken Language Processing (ICSLP), 2000. pp. IV-121–124.
[58]. P. F. Luis and G. M. Carmen, “A multimedia approach for audio segmentation in TV broadcast news,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), vol. 1, May 2004, pp. I-369–I-372.
[59]. R. Huang and J.H.L. Hansen, “Unsupervised audio segmentation and classification for robust spoken document retrieval,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), May 2004. pp. 741–744.
[60]. R. Huang and J. H. L. Hansen, “Advances in unsupervised audio segmentation for the broadcast news and NGSW corpora,” in Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing. (ICASSP), vol.1, May 2004, pp. I-741–I-744.
[61]. P. Sivakumaran, J. Fortuna, and A. M. Ariyaeeinia, “On the use of the bayesian information criterion in multiple speaker detection,” in Proc. Eur. Conf. Speech Commun. Tech. (EUROSPEECH), vol. 2, Sep. 2001, pp.795–798.
[62]. L. Wilcox, D. Kimber, and F. Chen, “Audio indexing using speaker identification,” in Proc. SPIE Conf. Automatic Systems for the Inspection and Identification of Humans, Jul. 1994, pp. 149–157.
[63]. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, pp.19–41, 2000.
[64]. V. Vapnik, Statistical Learning Theory, New York: John Wiley, 1998.
[65]. R. Fletcher, Practical methods of optimization, Second Edition, Chichester and New York: John Wiley, 1987.
[66]. M. Aizerman, E. Braverman, and L. Rozonoer, “Theoretical foundations of the potential function method in pattern recognition learning,” Automations and Remote Control, vol. 25, 1964, pp. 821–837.
[67]. N. J. Nilsson, Learning machines: Foundations of trainable pattern classifying systems, New York: McGraw-Hill, 1965.
[68]. R. Courant and D. Hilbert. Methods of Mathematical Physiacs, volume 1. New York: Interscience, 1953.
[69]. V. Kartik, D. Srikrishna Satish, and C. Chandra Sekhar, “Speaker change detection using support vector machines,” in proc. Int. Conf. Non-Linear Speech Processing (NOLISP), 2005, pp. 130–136.
[70]. NIST Rich Transcription 2005 Spring Meeting Recognition Evaluation (RT-05S), http://www.nist.gov/speech/tests/rt/rt2005/spring
/index.htm.
[71]. J. Ma, Y. Zhao, and S. Ahalt. OSU SVM Classifier Matlab Toolbox (ver 3.00) Available: http://eewww.eng.ohio-state.edu/~maj/osu_svm/
[72]. G. Lathoud, I. A. McCowan, and J. M. Odobez, “Unsupervised location-based segmentation of multi-party speech,” in Proc. IEEE Int. Conf. on Acoust., Speech, and Signal Processing – Meeting Recognition Workshop (ICASSP-NIST'04), Montreal, Quebec, Canada, May 2004, pp. 4–14.
[73]. S. Salcedo-Sanz, A. Gallardo-Antolin, J. M. Leiva-Murillo, and C. Bousono-Calzon, “Offline speaker segmentation using genetic algorithms and mutual information,” IEEE Trans. Evolutionary Computation, vol. 10(2), pp. 175–186, Apr. 2006.
[74]. W. B. Frakes and R. Baeza-Yates, Information Retrieval: Data Structures and Algorithms, Prenctice-Hall, Englewood Cliffs, NJ, 1992.G. Eason, B. Noble, and I. N. Sneddon, "On certain integrals of Lipschitz-Hankel type involving products of Bessel functions," Phil. Trans. Roy. Soc. London, vol. A247, pp. 529–551, Apr. 1955.
[75]. G. Lathoud and I. A. McCowan, “Location based speaker segmentation,” in Proc. IEEE Int. Conf. on Acoust., Speech, and Signal Processing – Meeting Recognition Workshop (ICASSP-NIST'03), 2003, vol. 1, pp. 176–179.
[76]. G. Lathoud, I. A. McCowan, and D. C. Moore, “Segmenting multiple concurrent speakers using microphone arrays,” in Proc. Eur. Conf. Speech Commun. Tech. (EUROSPEECH), 2003, pp. 2889–2892.
[77]. D. Hindus and C. Schmandt, “Ubiquitous audio: capturing spontaneous collaboration,” in Proc. ACM Conf. Computer-Supported Cooperative Work (CSCW), 1992, pp. 210–217.
[78]. S. Tucker and S. Whittaker, “Reviewing multimedia records: current approaches,” Int. Workshop on Multimodal Multiparty Meeting Processing (ICMI), 2005.
[79]. A. Lisowska, M. Rajman, and T. Bui, “Archivus: A system for accessing the content of recorded multimodal meetings,” in: Bengio, S., Bourlard, H. (Eds.) Lecture Notes in Computer Science, 3361, 2004, pp 291–304.
[80]. M4 Project, http://www.m4project.org/
[81]. M. G. Christel, M.A. Smith., C. R. Taylor and D. B. Winkler, “Evolving video skims into useful multimedia abstractions,” in Proc. Int. Conf. Human Factors in Computing System (CHI), 1998, pp.18–23.
[82]. J. F. Wang, P. C. Lin, J. J. Huang, and L. C. Wen, “Spoken sentence retrieval based on MPEG-7 low-level descriptors and two level matching approach,” in Proc. the 8th Australian and New Zealand Conf. on Intelligent Information Systems, 2003, pp. 397–402.
[83]. H. K. Xie, “A study on voice caption search for arbitrarily defined keywords,” Master Thesis, National Taiwan University of Science and Technology, Taiwan, R.O.C., 2000.
[84]. Y. Itoh, “A matching algorithm between arbitrary sections of two speech data sets for speech retrieval,” in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 2001, pp. 593–596.
[85]. Y. Itoh and K. Tanaka, “Speech labeling and the most frequent phrase extraction using same section in a presentation speech,” in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 2002. pp. I-737–I-740.
[86]. M. Tomczak, “Spatial interpolation and its uncertainty using automated anisotropic inverse distance weighting (IDW) cross-validation/jackknife approach,” Journal of Geographic Information and Decision Analysis, vol. 2, no. 2, pp. 18–30, 1998.
[87]. K. Ng and V. W. Zue, “Subword-based approaches for spoken document retrieval,” Speech Communication, vol. 32, no. 3, pp. 157–186, 2000.
[88]. L. Rabiner and B. H. Juang, “Fundamentals of Speech Recognition,” Prentice-Hall, New Jersey, 1993.
[89]. R. Baeza-Yates and B. Ribeiro-Neto, “Modern Information Retrieval,” New York: ACM Press, 1999.
[90]. W. K. Lo, H. Meng, and P. C. Ching, “Multi-scale and multi-model integration for improved performance in Chinese spoken document retrieval,” in Proc. Int. Conf. Spoken Language Processing, 2002, pp. 1513–1516.
[91]. Intel PXA255 Processors Developer's Manual, http://www.intel.com, accessed 2004.
[92]. F. Crestani, “Combination of similarity measures for effective spoken document retrieval,” Journal of Information Science, vol. 29, no. 2, pp. 87–96, 2003.
[93]. S3C2410X User's Manuel, http://www.samsungsemi.com
[94]. O. B. Tuzun, M. Demirekler, and K. B. Nakiboglu, “Comparison of parametric and non-parametric representations of speech for recognition,” in Proc. MELECON. Mediterranean Electrotechnical Conference, 1994, pp. 65–68.
[95]. J. P. Openshaw, Z. P. Sun, and J. S. Mason, “A comparison of composite features under degraded speech in speaker recognition,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1993, vol. 2, pp. 371–374.
[96]. R. Vergin, D. O’Shaughnessy, and V. Gupta, “Compensated mel frequency cepstrum coefficients,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1996, pp. 323–326.
[97]. O. Avaro and P. Salembier, “MPEG-7 Systems: overview,” IEEE Trans. Circuits and Systems for Video Technology, vol. 11, no. 6, pp. 760–764, June 2001.
[98]. Information Technology—Multimedia Content Description Interface—Part 4: Audio, ISO/IEC CD 15938-4, 2001.
[99]. J. R. Deller, J. H. L. Hansen, and J. G. Proakis, Discrete-Time Processing of Speech Signals, New York: IEEE Press, 2000.
[100]. J. F. Wang, J. C. Wang, H. C. Chen, T. L. Chen, C. C. Chang, and M. C. Shih, “Chip design of portable speech memopad suitable for persons with visual disabilities,” IEEE Trans. Speech and Audio Processing , vol. 10, no. 8, pp. 644–658, November 2002.
[101]. E. M. Voorhees and D. K. Harman, “Appendix: evaluation techniques and measures,” in Proc. the English Text Retrieval Conference (TREC 8), NIST, 2000.
[102]. “MPEG-7 content and objectives,” MPEG-7 Requirements Group, Sevilla, Spain, ISO/IEC JTC1/SC29/WG11, 1997.