| 研究生: |
陳智仁 Chen, Chih-Jen |
|---|---|
| 論文名稱: |
整合視覺特徵與語音資訊之視訊註解方法 Video Annotation by Using Visual and Speech Features |
| 指導教授: |
曾新穆
Tseng, Shin-Mu |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2006 |
| 畢業學年度: | 94 |
| 語文別: | 中文 |
| 論文頁數: | 72 |
| 中文關鍵詞: | 視訊影片註解 、關聯規則 、以統計為基礎的預測模型 、融合 、資料探勘 |
| 外文關鍵詞: | Video Annotation, Statistics-Based Model, Rule-Based Model, Fusion, Association Rule |
| 相關次數: | 點閱:99 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
摘要
由於視訊影片具有多重特徵,如:影像、聲音、文字等特徵。其隱含的高階意涵概念也愈複雜,若單以視覺特徵做自動註解,無法完全藉由影像低階特徵值求得較高階的語義(接近自然語言);相反的,雖然語音資訊的內容較切近人類自然語言,但若單以語音資訊來做註解,可能會有某視訊片段的語音資訊內容與其視訊內容無關的情形發生,導致註解錯誤。為了拉近低階特徵與高階意涵概念間的差距,我們提出一個整合視覺特徵與語音資訊的方法,分別對視覺特徵與語音資訊做處理,建構兩個預測模型:以統計為基礎的模型ModelCRM與以關聯規則為基礎的模型ModelSAR,並藉由融合的方式,將兩個預測模型所產生的機率列表結合在一起,來增進視訊影片註解的準確性。在最後的實驗分析中,我們採用公開性的視訊影片資料(TRECVID 2003)作為實驗資料,而實驗結果證實我們的方法的確有著不錯的預測效果。
ABSTRACT
Video is composed of various types of multimedia data such as image, audio and text, so the implicit high-level concept hidden in it is of high degree of complexity. Accordingly, it is hard to capture the high-level semantics by analyzing only the visual features. On contrast, automatic video annotation may bring out the mismatching between the shot and the speech if only speech information is considered. In order to reduce the gap between low-level features and high-level concepts, we propose an approach that integrates visual features and speech information to yield two referred prediction models, namely ModelCRM (statistics-based) and ModelSAR (rule-based). Through fusing two prediction models, the generated probability list can effectively help enhance the precision of video annotation. Through experimental evaluation on well-known TRECVID 2003 datasets, our proposed approach was shown to deliver higher precision than other existing methods.
參考文獻
[1] R. Agrawal, T. Imielinski and A. Swami, “Mining Association Rules Between Sets of Items in Large DataBases,” Proc. Of the ACM SIGMOD Conference on Management of Data, pp207-216.1993.
[2] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules," Proc.20th Very Large Databases(VLDB) Conference,pp 487-499,Chile.1994.
[3] K. Barnard, P. Duygulu, N. De Freitas, D. A. Forsyth, D. Blei, and M. Jordan, ”Matching words and pictures,” Journal of Machine Learning Research. 3:1107-1135, 2003.
[4] K. Barnard and D. A. Forsyth, “Exploiting image semantics for picture libraries,” In The First ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 408-15, 2001.
[5] A. Bouajila , C. Claus and A. Herkersdorf .“MPEG-7 eXperimentation Model (XM).” Avaliable at : http://www.lis.e-technik.tu-muenchen.de/research/bv/topics/mmdb/e_mpeg7.html
[6] A. Dorado, J. Calic, and E. izquierdo. “A Rule-based Video Annotation System.” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 14, No. 5, May 2004.
[7] P. Duygulu and H. Wactlar, “Associating Video Frames with Text,” In Proccedings of the SIGIR Multimedia Information Retrieval Workshop 2003, Aug, 2003.
[8] C. Fellbaum, "WordNet : An Electronic Lexical Database Edited by Christiane Fellbaum," MIT Press. May, 1998.
[9] S. L. Feng, R. Manmatha, and V. Lavrenko, “Multiple Bernoulli Relevance Models for Image and Video Annotation.” 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 1002-1009, 2004.
[10] J. L. Gauvain, L. Lamel, and G. Adda. “The LIMSI Broadcast News Transcription System.” Speech Communication, 37(1-2):89-108, 2002.
[11] A. Ghoshal, P. Ircing and S. Khudanpur, "Hidden Markov Models for Automatic Annotation and Content-Based Retrieval of Images and Video," Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval SIGIR '05. August 2005.
[12] K. Hacioglu and B. Pellom. “A Distributed Architecture for Robust Automatic Speech Recognition.” IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. (ICASSP '03) Volume 1,pp.328-331,April 2003.
[13] W. J. Heng, “Shot Boundary Refinement for Long Transition in Digital Video Sequence,” IEEE Trans. On Multimedia, Vol. 4, No. 4, pp. 434-445, December 2002.
[14] J. H. Huang, "A Novel Video Annotation Method by Integrating Visual Features and Frequent Patterns", Master Thesis, National Cheng Kung University, 2006.
[15] IBM Research Center , “IBM VideoAnnEx Annotation Tool.” Avaliable at : http://www.research.ibm.com/VideoAnnEx/
[16] C. Jelmini and S. Marchand-Maillet. Deva. “an extensible ontology-based annotation model for visual document collections.” In Proceedings of SPIE Photonics West, Electronic Imaging 2002, Internet Imaging IV, Santa Clara, CA, USA, 2003.
[17] K. Johar and R. Simha,"The George Washington University JWord 3.0," Avaliable at : http://www.seas.gwu.edu/~simhaweb/software/jword/
[18] J. R. Kender and M. R. Naphade. “Visual Concepts for News Story Tracking: Analyzing and Exploiting the NIST TRECVID Video Annotation Experiment.” Proceeding of the 2005 IEEE Computer Society Conference on Computer Vision and Patter Recognition, June 2005.
[19] V. Lavrenko, R. Manmatha and J. Jeon. “A Model for Learning the Semantics of Pictures.” In Proceedings of NIPS’03. 2003
[20] V. Lavrenko, S. L. Feng, and R. Manmatha, “Statistical Models for Automatic Video Annotation and Retrieva.” The International Conference on Acoustics, Speech and Signal Processing, May 2004.
[21] J. Li and J. Z. Wang. “Automatic linguistic indexing of pictures by a statistical modeling approach.” IEEE Trans. On Pattern Analysis and Machine Intelligence, 25(10) : 14, 2003.
[22] G. A. Miller, “Wordnet: A Dictionary Browser in Information in Data,” Proceedings of the First Conference of the UW Centre for the New Oxford Dictionary. Waterloo, Canada: University of Waterloo, 1985.
[23] NIST(The National Institute of Standards and Technology), IN Proceedings of the TREC Video Retrieval Evaluation Conference(TrecVID2003), November 2003.
[24] B. Pellom, "Sonic: The University of Colorado Continuous Speech Recognizer", Technical Report TR-CSLR-2001-01, CSLR, University of Colorado, March 2001.
[25] M.F. Porter, "An algorithm for suffix stripping," published in Program, 14 no. 3, pp 130-137, July 1980.
[26] J. D. M. Rennie, “Derivation of the F-Measure,” Available at : http://people.csail.mit.edu/jrennie/writing, February 2004.
[27] Y. Rui, T. S. Huang, and S. Mehrotra, ”Constructing Table-of-Content for Videos.” ACM Multimedia Systems Journal – Special Issue Multimedia Systems on Video Libraries, Vol. 7, No. 5, pp. 359-368, September 1999.
[28] G. Salton and M.J. McGill, ”Introduction to modern information retrieval.” McGraw Hill. 1983.
[29] M. Srikanth, J. Varner, M. Bowden and Moldovan,"Exploiting ontologies for automatic image annotation," Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval SIGIR '05, August 2005.
[30] C. Zhang, S. C. Chen and M. L. Shyu. “PixSO : A system for video shot detection.” ICICS-PCM , December 2003.