| 研究生: |
李偉銓 Lee, Wei-Chuan |
|---|---|
| 論文名稱: |
應用語音註解與音節轉換影像於照片檢索 Photo Retrieval via Speech Annotation Using Syllable-Transformed Images |
| 指導教授: |
吳宗憲
Wu, Chung-Hsien |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2006 |
| 畢業學年度: | 94 |
| 語文別: | 中文 |
| 論文頁數: | 66 |
| 中文關鍵詞: | 音節轉換影像 、外辭彙 、多元尺度化 |
| 外文關鍵詞: | Out-of-vocabulary, Multidimensional Scaling, Syllable-transformed image |
| 相關次數: | 點閱:105 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來隨著數位相機的快速普及化之下,使用者在日常生活中所拍攝到的照片大量增加,對於使用者而言,如何從這些大量的照片裡快速且精確地找到他們所想要的照片是重要的。因此,發展出一些精緻有效率的照片檢索方法是逐漸被需要的。
本論文提出了以語音為基礎之照片檢索方法,為了處理語音資訊檢索中外詞彙(out-of-vocabulary)以及語音辨識中語音辨識錯誤問題,並且基於在相同或是相似的音節,其辨認出來的候選音節序列會很相似的假設之下,本研究提出了將前n名候選音節圖形(syllable candidate pattern)序列經由多元尺度化轉換成音節轉換影像序列,接著使用向量量化(vector quantization)將每個音節轉換影像量化成一個語碼(codeword)來表示,建立出語碼層次索引特徵(codeword level indexing feature)。此外,為了改善音節辨識率,我們採用了非監督式語者調適。最後,一個結合了詞、音節、語碼層次索引特徵之向量空間模型被用來作為照片檢索之研究。
本實驗在800張經由語音註解後之照片中進行,以mAP (mean average precision)當作檢索效能之評估準則。實驗結果顯示,使用傳統方法(詞、音節層次索引特徵)之mAP為77%;結合傳統方法與所提出方法(語碼層次索引特徵)之mAP為80%,mAP有了3%之改善。
With the rapid popularity of digital cameras in recent years, the amount of digital photos taken by users in daily life has skyrocketed. It is crucial for users to find photos they desired quickly and accurately from a large amount of digital photos. Hence, there is a growing need for more sophisticated means of retrieving photos.
In this thesis, a speech-based photo retrieval method is proposed. To deal with out-of-vocabulary (OOV) in speech information retrieval and speech recognition error in speech recognition, the recognized top-n syllable candidate sequence is transformed into the syllable-transformed image sequence using multidimensional scaling (MDS) based on the assumption that the same or similar syllables will have similar syllable candidate patterns. Vector quantization is then applied to quantize each syllable-transformed image into a codeword as a codeword-level indexing feature. In addition, we adopt unsupervised speaker adaptation in order to improve syllable recognition rate. Finally, a vector space model (VSM) considering word-, syllable-, and codeword-level indexing features is investigated for photo retrieval.
Experiments were conducted on a collection of 800 speech annotated digital photos. mAP (mean average precision) is adopted for retrieval performance evaluation. The experimental results show that mAP is 77% when using conventional (word- and syllable-level indexing features) method, and mAP is 80% when combining conventional method and the proposed method (codeword-level indexing feature). There is a 3% statistically significant improvement in mAP.
[Adobe PhotoDeluxe] Adobe PhotoDeluxe.
http://www.adobe.com/products/photodeluxe/main.html
[ACD System ACDsee] ACD System ACDsee http://www.acdsystems.com
[Bilms, 1998] J. A. Bilms, "A Gentle Tutorial of the EM Algorithm and its Application to
Parameter Estimation for Gaussian Mixture and Hidden Markov Models," ICSI TR-97-021, Apr. 1998.
[Chen, 2002] B. Chen, H. M. Wang, and L. S. Lee, ”Discriminating Capabilities of
Syllable-Based Features and Approaches of Utilizing Them for Voice Retrieval of Speech Information in Mandarin Chinese,” IEEE Trans. Speech and Audio Processing, vol. 10,no. 5, pp.303-314, Jul. 2002.
[Chen, 2004] B. Chen, H. M. Wang, and L. S. Lee, ” A Discriminative HMM/N-
Gram-Based Retrieval Approach for Mandarin Spoken Documents,” ACM Trans. Asian Language Information Process., vol. 3, no 2, pp.128-145, Jun. 2004.
[Chien, 1999] J. T. Chien, “Quasi-Bayes Linear Regression for Sequential Learning of
Hidden Markov Models,” IEEE Trans. Speech and Audio Processing, vol. 7, no. 6, pp.656-667, Nov. 1999.
[Chou, 2003] W. Chou and B. H. Juang, Pattern Recognition in Speech and Language
Processing, CRC Press, 2003.
[Cox, 2001] T. F. Cox and M. A.A. Cox, Multidimensional Scaling 2nd ed., Chapman &
Hall/CRC, Boca Raton, 2001.
[Dempster, 1977] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum Likelihood
from Incomplete Data via The EM Algorithm,” J. Roy. Stat. Soc, Ser. B, vol. 39, no. 1, pp. 1-38, 1977.
[Duda, 2001] R. O. Duda, P. E. Hart, D. G. Stork, Pattern Classification 2nd ed., Wiley,
New York, 2001.
[Flickner, 1995] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M.
Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker, “Query by Image and Video Content: The QBIC System,” IEEE Computer, vol. 28, no. 9, pp.23-32, 1995.
[Gauvain, 1994] J. L. Gauvain and C. H. Lee, “Maximum A Posteriori Estimation for
Multivariate Gaussian Mixture Observations of Markov Chains,” IEEE Trans. Speech and Audio Processing, vol. 2, pp.291-298, Apr. 1994.
[Google Picasa 2] Google Picasa 2 http://www.picasa.google.com/index.html
[Goodrum, 2000] A. A. Goodrum, “Image Information Retrieval: An Overview of Current
Research,” Information Science, vol. 3, no. 2, 2000.
[Huo, 2002] Q. Huo and C. H. Lee, “On-Line Adaptive Learning of the Continuous
Density Hiddden Markov Model Based on Approximate Recursive Bayes Estimate,” IEEE Trans. Speech and Audio Processing, vol. 5, no.5, Jul. 2002.
[Kim, 2005] D. K. Kim and N. S. Kim, “Rapid Online Adaptation Based on
Transformation Space Model Evolution,” IEEE Trans. Speech and Audio Processing, vol. 13, no. 2, Mar. 2005.
[Kuhn, 2000] R. Kuhn, J. C. Junqua, P. Nguyen, and N. Niedzielski, “Rapid Speaker
Adaptation in Eigenvoice Space,” IEEE Trans. Speech and Audio Processing, vol.8, no. 6, Nov. 2000.
[Larry, 1998] S. Larry, Linear Algebra 3rd ed, Springer, New York, 1998
[Leggetter, 1995] C. J. Leggetter and P. C. Woodland, “Maximum Likelihood Linear
Regression for Speaker Adaptation of Continuous Density Hidden Markov Models,” Comput. Speech Lang., vol. 9, pp.171-185, 1995.
[Li, 2000] Y. C. Li, W. K. Lo, H. M. Meng, and P. C. Ching, “Query Expansion using
Phonetic Confusions for Chinese Spoken Document Retrieval,” Proceedings of IRAL, Hong Kong, 2000.
[Lo, 2005] W. K. Lo, and F. K. Soong, “Generalized Posterior Probability for Minimum
Error Verification of Recognized Sentences,” Proc. ICASSP, pp.85–89, 2005.
[MAT Speech Database] MAT Speech Database-TCC300
http://rocling.iis.sinica.edu.tw/ROCLING/MAT/Tcc_300brief.htm
[Pentland, 1996] A. Pentland, R. Picard, and S. Sclaroff, “Photobook: Content-Based
Manipulation of Image Databases,” International Journal of Computer Vision, vol. 18, no.3, pp.233-254, 1996.
[Rabiner, 1993] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition,
Prentice-Hall, 1993.
[Salton, 1983] G. Salton, Introduction to Modern Information Retrieval, McGraw-Hill,
New York, 1983.
[Siohan, 2001] O. Siohan, C. Chesta, and C. H. Lee, “Joint Maximum a Posteriori
Adaptation of Transformation and HMM Parameters,” IEEE Trans. Speech and Audio Processing, vol. 9, no. 4, May 2001.
[Smith, 1996] J. R. Smith and S. F. Chang, “VisualSEEk: A Fully Automated Content-
Based Image Query System,” In Proceedings of the Fourth ACM International Conference on Multimedia, pp. 87-98, 1996.
[Smith, 1997] J. R. Smith and S. F. Chang, “Visually Searching the Web for Content,”
IEEE Multimedia, vol. 4, no. 3, pp 12-20, 1997.
[Soong, 2004] F. K. Soong, W. K. Lo, and S. Nakamura, “Generalized Word Posterior
Probability (GWPP) for Measuring Reliability of Recognized Words,” Proc. SWIN2004.
[Srinivasan, 2000] S. Srinivasan and D. Petkovic, “Phonetic Confusion Matrix Based
Spoken Document Retrieval,” in Proc. ACM SIGIR Conf. R&D Information Retrieval, 2000.
[Wang, 2005] L. Wang, Y. Zhao, M. Chu, F. K. Soong, and Z.Cao, “Phonetic
Transcription Verification with Generalized Posterior Probability,” Proc. Interspeech 2005, Lisbon, 2005.
[Wessel, 2001] F. Wessel, R. Schluter, K. Macherey, and H.Ney, “Confidence Measures
for Large Vocabulary Continuous Speech Recognition,” IEEE Trans. Speech and Audio Processing, vol. 9, no. 3, pp. 288-298, Mar. 2001.
[王小川, 2004] 王小川, 語音訊號處理, 全華科技圖書股份有限公司, 2004
[陳順宇, 2004] 陳順宇, 多變量分析三版, 華泰書局, 2004
[顏月珠, 2003] 顏月珠, 統計學, 三民書局, 2003
[郭人瑋, 2004] 郭人瑋, 蔡文鴻, 陳伯琳, “非監督式學習於中文電視新聞自動轉寫
之初步應用,” in Proceedings of ROCLING XVI, Taipei, Taiwan, 2004.