簡易檢索 / 詳目顯示

研究生: 黃俊憬
Huang, Jun-Jin
論文名稱: 以MPEG-7低階聲音特徵值為基準之語句搜尋研究
Spoken Sentence Retrieval Based on MPEG-7 Audio Low-Level Descriptors
指導教授: 王駿發
Wang, Jhing-Fa
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電機工程學系
Department of Electrical Engineering
論文出版年: 2003
畢業學年度: 91
語文別: 英文
論文頁數: 77
中文關鍵詞: 語句搜尋MPEG-7
外文關鍵詞: Spoken sentence retrieval, MPEG-7 audio low-level descriptors
相關次數: 點閱:77下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 這篇論文提出一個以MPEG-7低階聲音特徵值為基準的語句檢索系統. 不採用一般大量詞彙辨識器, 我們可以減少運算的需求量, 如此可以更適用於手持式的裝置. 我們的方法主要分做兩部分, 首先, 我們找出在語句資料中與使用者查詢相近的區段. 再利用rank-based的方法從可能區段中選出頭N筆. 至於在MPEG-7低階聲音特徵值中, 我們說明它們低運算量的優點並且在檢索實驗結果中, 我們可以發現他有和MFCC匹配的檢索效果.

    In this thesis, we propose a speech retrieval system based on MPEG-7 audio low-level descriptors (LLDs). Without using the large-vocabulary recognizer, we are able to greatly reduce the computational power and make it more suitable for hand-held devices. Therefore, we propose a sentence-matching method. In our proposed method, there are two main steps. First, we locate several possible segments in spoken documents that are similar with the user’s query. Secondly, we rank the candidates with rank-based method and retrieve top N from these candidate segments. Besides, we investigate MPEG-7 audio LLDs as the features for spoken sentence retrieval. We show their low-complexity advantage and the use of MPEG-7 based features is proven comparable with the MFCCs (Mel-Frequency Cepstrum Coefficients) in the experiment results

    ABSTRACT I ACKNOWLEDGEMENT II CONTENTS IV LIST OF FIGURES VII LIST OF TABLES IX CHAPTER 1. INTRODUCTION 1 1.1. BACKGROUND 1 1.2. MOTIVATION 2 1.3. OUTLINES OF THIS THESIS 3 CHAPTER 2. MPEG-7 AUDIO LOW-LEVEL DESCRIPTORS 5 2.1. AUDIO SPECTRUM DESCRIPTORS 5 2.1.1. Audio Spectrum Envelope Descriptors 6 2.1.2. Audio Spectrum Centroid Descriptors 7 2.1.3. Audio Spectrum Spread Descriptors 7 2.1.4. Audio Spectrum Flatness Descriptors 8 2.2. TIMBRE DESCRIPTORS 9 2.2.1. Harmonic Peaks Detection 11 2.2.2. Harmonic Spectral Centroid Descriptors 12 2.2.3. Harmonic Spectral Spread Descriptors 13 2.2.4. Harmonic Spectral Deviation Descriptors 14 2.2.5. Harmonic Spectral Variation Descriptors 14 2.3. THE FEASIBILITY FOR ADOPTING MPEG-7 AUDIO LLDS TO DESCRIBING SPEECH SOUND 15 2.3.1. Computation Complexity of MPEG-7 Audio Spectrum and Instantaneous Harmonic Descriptors 16 2.3.2. Keywords/Sentences Matching Results 18 CHAPTER 3. SYSTEM ARCHITECTURE OVERVIEW 21 3.1. APPLICABLE AUDIO/SPEECH FEATURES FOR SPEECH EXTRACTION 24 3.2. SIMILAR FRAMES TAGGING 25 3.3. POSSIBLE SEGMENTS EXTRACTION 28 3.3.1. Method 1 : Using a Unit Window 29 3.3.2. Method 2 : Using a Hamming Window 31 3.4. POSSIBLE SEGMENTS RANKING 33 3.5. OUTPUT OF CORRESPONDING SENTENCES 34 3.6. COMPUTATIONAL ANALYSIS 37 CHAPTER 4. EXPERIMENTAL RESULTS 40 4.1. DEMONSTRATION SYSTEM INTERFACE 41 4.2. RETRIEVAL RESULTS OF SINGLE FEATURE USING KEYWORD QUERIES 44 4.3. RETRIEVAL RESULTS OF COMBINATION OF THE FEATURES USING KEYWORD QUERIES 47 4.4. RETRIEVAL RESULTS OF COMBINATION OF THE FEATURES USING A SENTENCE 50 CHAPTER 5. CONCLUSIONS AND FUTURE WORKS 52 REFERENCES 53 APPENDIX 57 List of Figures FIGURE 2.1 ILLUSTRATION OF AUDIO SPECTRUM ENVELOPE BANDS [18] 7 FIGURE 2.2 TIMBRE HARMONIC DESCRIPTORS ESTIMATION 10 FIGURE 2.3 HARMONIC PEAKS DETECTION 10 FIGURE 2.4 COMPUTATIONAL COMPLEXITY OF FRAME-BASED FEATURES 18 FIGURE 3.1 RECORD PROCESS 22 FIGURE 3.2 RETRIEVAL PROCESS 23 FIGURE 3.3 EXTRACTING FRAMES BY OVERLAPPED HAMMING WINDOWS 24 FIGURE 3.4 SIMILAR FRAMES TAGGING 26 FIGURE 3.5 PSEUDO-CODE FOR SIMILAR FRAMES TAGGING 27 FIGURE 3.6 WINDOW SCANNING 29 FIGURE 3.7 UTILIZING UNIT WINDOW SCANNING TO EXTRACT POSSIBLE SEGMENTS 30 FIGURE 3.8 PSUDO-CODE FOR POSSIBLE SEGMENT EXTRACTION BY METHOD 1 31 FIGURE 3.9 THE TAGGED DATA AFTER CONVOLUTION WITH A HAMMING WINDOW 32 FIGURE 3.10 PSUDO-CODE FOR POSSIBLE SEGMENT EXTRACTION BY METHOD 2 33 FIGURE 3.11 AN EXAMPLE FOR OVERALL RETRIEVAL PROCESS 36 FIGURE 3.12 THE DIRECTLY MATCHING METHOD 37 FIGURE 4.1 THE DEMO INTERFACE OF THE SENTENCE RETRIEVAL SYSTEM 42 FIGURE 4.2 OPEN THE TARGET DATABASE 42 FIGURE 4.3 LOAD THE QUERY KEYWORD 43 FIGURE 4.4 THE RETRIEVAL RESULTS 43 FIGURE 4.5 PRECISION-RECALL RELATION OF METHOD 1 BY USING A SINGLE FEATURE 45 FIGURE 4.6 PRECISION-RECALL RELATION OF METHOD 2 BY USING A SINGLE FEATURE 45 FIGURE 4.7 PRECISION-RECALL RELATION OF DIRECT MATCHING METHOD BY USING A SINGLE FEATURE 46 FIGURE 4.8 PRECISION-RECALL RELATION OF METHOD 1 BY USING MULTI-FEATURE 48 FIGURE 4.9 PRECISION-RECALL RELATION OF METHOD 2 BY USING MULTI-FEATURE 48 FIGURE 4.10 PRECISION-RECALL RELATION OF THE DIRECT MATCHING METHOD BY USING MULTI-FEATURE 49 List of Tables TABLE 2.1 BAND OVERLAPS 9 TABLE 2.2 COMPUTATIONAL COMPLEXITY OF FRAME-BASED FEATURES 17 TABLE 2.3 MAP FOR SENTENCE MATCHING ON (A) NAMES, ABOUT 1S, (B) ORAL SENTENCES, ABOUT 3 S, (C) NEWS TITLES, ABOUT 5 S. 19 TABLE 3.1 CHARACTERISTIC OF OUR PROPOSED RETRIEVAL SYSTEM 22 TABLE 3.2 SPECIFICATIONS OF FEATURE EXTRACTION 25 TABLE 3.3 COMPUTATIONAL COMPLEXITY OF THE DIRECTLY MATCHING METHOD 38 TABLE 3.4 THE COMPUTATIONAL COMPLEXITY OF OUR PROPOSED METHOD 39 TABLE 3.5 THE AVERAGE NUMBERS OF QUERY FRAMES, SENTENCE FRAMES AND POSSIBLE SEGMENTS 39 TABLE 3.6 AN EXAMPLE FOR COMPUTATIONAL COMPLEXITY 39 TABLE 4.1 SPECIFICATION OF THE QUERIES AND TESTING SPEECH DATA USED IN THIS EXPERIMENT 41 TABLE 4.2 THE RETRIEVAL RESULTS OF METHOD 1 AND METHOD 2 46 TABLE 4.3 THE RETRIEVAL RESULTS OF COMBINATION OF THE FEATURES AND MFCC 49 TABLE 4.4 THE RESULTS OF USING SENTENCES AS QUERIES 50

    [1] Berlin Chen; Hsin-min Wang; Lin-shan Lee; “Discriminating capabilities of syllable-based features and approaches of utilizing them for voice retrieval of speech information in Mandarin Chinese”, Speech and Audio Processing, IEEE Transactions on , Volume: 10 Issue: 5 , Jul 2002, Page(s): 303 -314
    [2] Meng, H.M., Pui Yu Hui, “Spoken document retrieval for the languages of Hong Kong”, Intelligent Multimedia, Video and Speech Processing, 2001. Proceedings of 2001 International Symposium on , 2001, Page(s): 201 -204
    [3] Johnson, S.E.; Jones, K.S.; Jourlin, P.; Moore, G.L.; Woodland, P.C.; ”The Cambridge University spoken document retrieval system”, Acoustics, Speech, and Signal Processing, 1999. ICASSP '99. Proceedings., 1999 IEEE International Conference on , Volume: 1 , 15-19 Mar 1999, Page(s): 49 -52 vol.1
    [4] Matthew A. Siegler, “Integration of Continuous Speech Recognition and Information Retrieval for Mutually Optimal Performance”, Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, Pennsylvania 15213, 1999 December 15
    [5] Ng, K, Zue, VW, “Phonetic recognition for spoken document retrieval”, Acoustics, Speech, and Signal Processing, 1998. ICASSP '98. Proceedings of the 1998 IEEE International Conference on , Volume: 1 , 12-15 May 1998, Page(s): 325 -328 vol.1
    [6] Wechsler, “Spoken Document retrieval based on phoneme recognition”, a dissertation submitted to the SWISS FEDERAL INSTITUTE of TECHNOLOGY (ETH)ZURICH 1998
    [7] J. Foote., “An overview of audio information retrieval.”, ACM Multimedia Systems, 7:2 10, 1999.
    [8] Savitha Srinivasan, Dragutin Petkovic, “Phonetic confusion matrix based spoken document retrieval”, Annual ACM Conference on Research and Development in Information Retrieval Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval 2000 , Athens, Greece
    [9] Amit Singhal, Fernando Pereira ,”Document expansion for speech retrieval”, Annual ACM Conference on Research and Development in Information Retrieval Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval 1999 , Berkeley, California, United States.
    [10] Fabio Crestani Univ. of Strathclyde, Glasgow, Scotland, “Towards the use of prosodic information for spoken document retrieval”, Annual ACM Conference on Research and Development in Information Retrieval Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval2001 , New Orleans, Louisiana, United States
    [11] H.K Xie, “A Study on Voice Caption Search for Arbitrarily Defined Keywords.” Master Thesis, National Taiwan University of Science and Technology, Taiwan, R.O.C., July 2000.
    [12] Itoh, Y, “A matching algorithm between arbitrary sections of two speech data sets for speech retrieval”; Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP '01). 2001 IEEE International Conference on , Volume: 1 , 2001 , Page(s): 593 -596 vol.1
    [13] Martinez, J.M.; Koenen, R.; Pereira, F, “MPEG-7: the generic multimedia content description standard, part 1” IEEE Multimedia , Volume: 9 Issue: 2, April-June 2002, Page(s): 78 -87
    [14] Martinez, J.M. ”Standards - MPEG-7 overview of MPEG-7 description tools, part 2”, IEEE Multimedia , Volume: 9 Issue: 3 , Jul.-Sept. 2002, Page(s): 83 -93
    [15] Avaro, O.; Salembier, P. “MPEG-7 Systems: overview”, Circuits and Systems for Video Technology, IEEE Transactions on , Volume: 11 Issue: 6 , June 2001 Page(s): 760 -764
    [16] Hunter, J.”An overview of the MPEG-7 description definition language (DDL)”, Circuits and Systems for Video Technology, IEEE Transactions on , Volume: 11 Issue: 6 , June 2001 Page(s): 765 -772
    [17] Salembier, P.; Smith, J.R., “MPEG-7 multimedia description schemes”, Circuits and Systems for Video Technology, IEEE Transactions on , Volume: 11 Issue: 6, June 2001 Page(s): 748 -759
    [18] “ISO/IEC FDIS 15938-4 Multimedia Interface Description Interface Part 4 audio”
    [19] Paliwal, K.K.; ”Spectral subband centroid features for speech recognition”, Acoustics, Speech, and Signal Processing, 1998. ICASSP '98. Proceedings of the 1998 IEEE International Conference on , Volume: 2 , 12-15 May 1998 Page(s): 617 -620 vol.2
    [20] Y. S. Weng, “The chip design of Mel frequency cepstrum coefficient for HMM Speech Reconition,” Master Thesis, National Cheng Kung University, Taiwan, R.O.C., June 1998.
    [21] Richard.B, Berthier.R”Modern Information Retrieval”, New York: ACM Press, 1999.
    [22] L. Rabiner, B. Huang Juan.”Fundamentals of speech recognition”, published by Prentice Hall, 1993

    下載圖示 校內:立即公開
    校外:2003-08-14公開
    QR CODE