簡易檢索 / 詳目顯示

研究生: 李偉銓
Lee, Wei-Chuan
論文名稱: 應用語音註解與音節轉換影像於照片檢索
Photo Retrieval via Speech Annotation Using Syllable-Transformed Images
指導教授: 吳宗憲
Wu, Chung-Hsien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2006
畢業學年度: 94
語文別: 中文
論文頁數: 66
中文關鍵詞: 音節轉換影像外辭彙多元尺度化
外文關鍵詞: Out-of-vocabulary, Multidimensional Scaling, Syllable-transformed image
相關次數: 點閱:105下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  •   近年來隨著數位相機的快速普及化之下,使用者在日常生活中所拍攝到的照片大量增加,對於使用者而言,如何從這些大量的照片裡快速且精確地找到他們所想要的照片是重要的。因此,發展出一些精緻有效率的照片檢索方法是逐漸被需要的。
      本論文提出了以語音為基礎之照片檢索方法,為了處理語音資訊檢索中外詞彙(out-of-vocabulary)以及語音辨識中語音辨識錯誤問題,並且基於在相同或是相似的音節,其辨認出來的候選音節序列會很相似的假設之下,本研究提出了將前n名候選音節圖形(syllable candidate pattern)序列經由多元尺度化轉換成音節轉換影像序列,接著使用向量量化(vector quantization)將每個音節轉換影像量化成一個語碼(codeword)來表示,建立出語碼層次索引特徵(codeword level indexing feature)。此外,為了改善音節辨識率,我們採用了非監督式語者調適。最後,一個結合了詞、音節、語碼層次索引特徵之向量空間模型被用來作為照片檢索之研究。
      本實驗在800張經由語音註解後之照片中進行,以mAP (mean average precision)當作檢索效能之評估準則。實驗結果顯示,使用傳統方法(詞、音節層次索引特徵)之mAP為77%;結合傳統方法與所提出方法(語碼層次索引特徵)之mAP為80%,mAP有了3%之改善。

      With the rapid popularity of digital cameras in recent years, the amount of digital photos taken by users in daily life has skyrocketed. It is crucial for users to find photos they desired quickly and accurately from a large amount of digital photos. Hence, there is a growing need for more sophisticated means of retrieving photos.
      In this thesis, a speech-based photo retrieval method is proposed. To deal with out-of-vocabulary (OOV) in speech information retrieval and speech recognition error in speech recognition, the recognized top-n syllable candidate sequence is transformed into the syllable-transformed image sequence using multidimensional scaling (MDS) based on the assumption that the same or similar syllables will have similar syllable candidate patterns. Vector quantization is then applied to quantize each syllable-transformed image into a codeword as a codeword-level indexing feature. In addition, we adopt unsupervised speaker adaptation in order to improve syllable recognition rate. Finally, a vector space model (VSM) considering word-, syllable-, and codeword-level indexing features is investigated for photo retrieval.
      Experiments were conducted on a collection of 800 speech annotated digital photos. mAP (mean average precision) is adopted for retrieval performance evaluation. The experimental results show that mAP is 77% when using conventional (word- and syllable-level indexing features) method, and mAP is 80% when combining conventional method and the proposed method (codeword-level indexing feature). There is a 3% statistically significant improvement in mAP.

    中文摘要 iii 英文摘要 iv 誌謝 v 目錄 vi 圖目錄 viii 表目錄 ix 第一章 緒論 1 第一節 研究背景與動機 1 第二節 文獻回顧與探討 3 第三節 研究目的 5 第四節 研究方法簡介 7 第五節 章節概要 8 第二章 系統架構 9 第三章 非監督式語者調適 11 第一節 以音節為基礎之信賴度量測 11 第二節 語者調適法 13 3.2.1 最大事後機率調適法 13 3.2.2 最大相似度線性迴歸調適法 16 3.2.3 結合最大相似度線性迴歸調適法與最大事後機率調適法 19 第四章 音節轉換影像 22 第一節 音節轉換 22 4.1.1 候選音節圖形比對 22 4.1.2 類影像單元 24 第二節 音節混淆分析 27 4.2.1 次音節距離量測 27 4.2.2 音節距離量測 28 第三節 多元尺度化 30 4.3.1 多元尺度化與因素分析之差異 30 4.3.2 計量多元尺度化 31 4.3.3 非計量多元尺度化 34 第四節 音節轉換影像 37 第五節 音節轉換影像之向量量化 38 第五章 語音為基礎之照片索引及檢索 40 第一節 照片索引 40 第二節 檢索模型與相似度量測 41 第六章 實驗結果與討論 43 第一節 實驗設定 43 6.1.1 實驗語料 43 6.1.2 語音辨識器 43 6.1.3 類影像單元之維度 46 第二節 非監督式語者調適實驗 46 第三節 照片檢索效能評估 48 6.3.1 檢索效能評估準則 48 6.3.2 碼簿大小之實驗 48 6.3.3 前n名候選音節之實驗 49 6.3.4 各層次索引權重之實驗 50 6.3.5 計量多元尺度化及非計量多元尺度化之實驗 56 第七章 結論與未來展望 58 參考文獻 59 附錄 A 最大事後機率調適法之推導過程 62 作者簡歷 66

    [Adobe PhotoDeluxe] Adobe PhotoDeluxe.
    http://www.adobe.com/products/photodeluxe/main.html
    [ACD System ACDsee] ACD System ACDsee http://www.acdsystems.com
    [Bilms, 1998] J. A. Bilms, "A Gentle Tutorial of the EM Algorithm and its Application to
    Parameter Estimation for Gaussian Mixture and Hidden Markov Models," ICSI TR-97-021, Apr. 1998.
    [Chen, 2002] B. Chen, H. M. Wang, and L. S. Lee, ”Discriminating Capabilities of
    Syllable-Based Features and Approaches of Utilizing Them for Voice Retrieval of Speech Information in Mandarin Chinese,” IEEE Trans. Speech and Audio Processing, vol. 10,no. 5, pp.303-314, Jul. 2002.
    [Chen, 2004] B. Chen, H. M. Wang, and L. S. Lee, ” A Discriminative HMM/N-
    Gram-Based Retrieval Approach for Mandarin Spoken Documents,” ACM Trans. Asian Language Information Process., vol. 3, no 2, pp.128-145, Jun. 2004.
    [Chien, 1999] J. T. Chien, “Quasi-Bayes Linear Regression for Sequential Learning of
    Hidden Markov Models,” IEEE Trans. Speech and Audio Processing, vol. 7, no. 6, pp.656-667, Nov. 1999.
    [Chou, 2003] W. Chou and B. H. Juang, Pattern Recognition in Speech and Language
    Processing, CRC Press, 2003.
    [Cox, 2001] T. F. Cox and M. A.A. Cox, Multidimensional Scaling 2nd ed., Chapman &
    Hall/CRC, Boca Raton, 2001.
    [Dempster, 1977] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood
    from Incomplete Data via The EM Algorithm,” J. Roy. Stat. Soc, Ser. B, vol. 39, no. 1, pp. 1-38, 1977.
    [Duda, 2001] R. O. Duda, P. E. Hart, D. G. Stork, Pattern Classification 2nd ed., Wiley,
    New York, 2001.
    [Flickner, 1995] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M.
    Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker, “Query by Image and Video Content: The QBIC System,” IEEE Computer, vol. 28, no. 9, pp.23-32, 1995.
    [Gauvain, 1994] J. L. Gauvain and C. H. Lee, “Maximum A Posteriori Estimation for
    Multivariate Gaussian Mixture Observations of Markov Chains,” IEEE Trans. Speech and Audio Processing, vol. 2, pp.291-298, Apr. 1994.
    [Google Picasa 2] Google Picasa 2 http://www.picasa.google.com/index.html
    [Goodrum, 2000] A. A. Goodrum, “Image Information Retrieval: An Overview of Current
    Research,” Information Science, vol. 3, no. 2, 2000.
    [Huo, 2002] Q. Huo and C. H. Lee, “On-Line Adaptive Learning of the Continuous
    Density Hiddden Markov Model Based on Approximate Recursive Bayes Estimate,” IEEE Trans. Speech and Audio Processing, vol. 5, no.5, Jul. 2002.
    [Kim, 2005] D. K. Kim and N. S. Kim, “Rapid Online Adaptation Based on
    Transformation Space Model Evolution,” IEEE Trans. Speech and Audio Processing, vol. 13, no. 2, Mar. 2005.
    [Kuhn, 2000] R. Kuhn, J. C. Junqua, P. Nguyen, and N. Niedzielski, “Rapid Speaker
    Adaptation in Eigenvoice Space,” IEEE Trans. Speech and Audio Processing, vol.8, no. 6, Nov. 2000.
    [Larry, 1998] S. Larry, Linear Algebra 3rd ed, Springer, New York, 1998
    [Leggetter, 1995] C. J. Leggetter and P. C. Woodland, “Maximum Likelihood Linear
    Regression for Speaker Adaptation of Continuous Density Hidden Markov Models,” Comput. Speech Lang., vol. 9, pp.171-185, 1995.
    [Li, 2000] Y. C. Li, W. K. Lo, H. M. Meng, and P. C. Ching, “Query Expansion using
    Phonetic Confusions for Chinese Spoken Document Retrieval,” Proceedings of IRAL, Hong Kong, 2000.
    [Lo, 2005] W. K. Lo, and F. K. Soong, “Generalized Posterior Probability for Minimum
    Error Verification of Recognized Sentences,” Proc. ICASSP, pp.85–89, 2005.
    [MAT Speech Database] MAT Speech Database-TCC300
    http://rocling.iis.sinica.edu.tw/ROCLING/MAT/Tcc_300brief.htm
    [Pentland, 1996] A. Pentland, R. Picard, and S. Sclaroff, “Photobook: Content-Based
    Manipulation of Image Databases,” International Journal of Computer Vision, vol. 18, no.3, pp.233-254, 1996.
    [Rabiner, 1993] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition,
    Prentice-Hall, 1993.
    [Salton, 1983] G. Salton, Introduction to Modern Information Retrieval, McGraw-Hill,
    New York, 1983.
    [Siohan, 2001] O. Siohan, C. Chesta, and C. H. Lee, “Joint Maximum a Posteriori
    Adaptation of Transformation and HMM Parameters,” IEEE Trans. Speech and Audio Processing, vol. 9, no. 4, May 2001.
    [Smith, 1996] J. R. Smith and S. F. Chang, “VisualSEEk: A Fully Automated Content-
    Based Image Query System,” In Proceedings of the Fourth ACM International Conference on Multimedia, pp. 87-98, 1996.
    [Smith, 1997] J. R. Smith and S. F. Chang, “Visually Searching the Web for Content,”
    IEEE Multimedia, vol. 4, no. 3, pp 12-20, 1997.
    [Soong, 2004] F. K. Soong, W. K. Lo, and S. Nakamura, “Generalized Word Posterior
    Probability (GWPP) for Measuring Reliability of Recognized Words,” Proc. SWIN2004.
    [Srinivasan, 2000] S. Srinivasan and D. Petkovic, “Phonetic Confusion Matrix Based
    Spoken Document Retrieval,” in Proc. ACM SIGIR Conf. R&D Information Retrieval, 2000.
    [Wang, 2005] L. Wang, Y. Zhao, M. Chu, F. K. Soong, and Z.Cao, “Phonetic
    Transcription Verification with Generalized Posterior Probability,” Proc. Interspeech 2005, Lisbon, 2005.
    [Wessel, 2001] F. Wessel, R. Schluter, K. Macherey, and H.Ney, “Confidence Measures
    for Large Vocabulary Continuous Speech Recognition,” IEEE Trans. Speech and Audio Processing, vol. 9, no. 3, pp. 288-298, Mar. 2001.
    [王小川, 2004] 王小川, 語音訊號處理, 全華科技圖書股份有限公司, 2004
    [陳順宇, 2004] 陳順宇, 多變量分析三版, 華泰書局, 2004
    [顏月珠, 2003] 顏月珠, 統計學, 三民書局, 2003
    [郭人瑋, 2004] 郭人瑋, 蔡文鴻, 陳伯琳, “非監督式學習於中文電視新聞自動轉寫
    之初步應用,” in Proceedings of ROCLING XVI, Taipei, Taiwan, 2004.

    下載圖示 校內:立即公開
    校外:2006-07-17公開
    QR CODE