成功大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳智仁 Chen, Chih-Jen
論文名稱：	整合視覺特徵與語音資訊之視訊註解方法 Video Annotation by Using Visual and Speech Features
指導教授：	曾新穆 Tseng, Shin-Mu
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering
論文出版年：	2006
畢業學年度：	94
語文別：	中文
論文頁數：	72
中文關鍵詞：	視訊影片註解、關聯規則、以統計為基礎的預測模型、融合、資料探勘
外文關鍵詞：	Video Annotation, Statistics-Based Model, Rule-Based Model, Fusion, Association Rule
相關次數：	點閱：99 下載：1
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

摘要
由於視訊影片具有多重特徵，如:影像、聲音、文字等特徵。其隱含的高階意涵概念也愈複雜，若單以視覺特徵做自動註解，無法完全藉由影像低階特徵值求得較高階的語義(接近自然語言)；相反的，雖然語音資訊的內容較切近人類自然語言，但若單以語音資訊來做註解，可能會有某視訊片段的語音資訊內容與其視訊內容無關的情形發生，導致註解錯誤。為了拉近低階特徵與高階意涵概念間的差距，我們提出一個整合視覺特徵與語音資訊的方法，分別對視覺特徵與語音資訊做處理，建構兩個預測模型:以統計為基礎的模型ModelCRM與以關聯規則為基礎的模型ModelSAR，並藉由融合的方式，將兩個預測模型所產生的機率列表結合在一起，來增進視訊影片註解的準確性。在最後的實驗分析中，我們採用公開性的視訊影片資料(TRECVID 2003)作為實驗資料，而實驗結果證實我們的方法的確有著不錯的預測效果。

ABSTRACT
Video is composed of various types of multimedia data such as image, audio and text, so the implicit high-level concept hidden in it is of high degree of complexity. Accordingly, it is hard to capture the high-level semantics by analyzing only the visual features. On contrast, automatic video annotation may bring out the mismatching between the shot and the speech if only speech information is considered. In order to reduce the gap between low-level features and high-level concepts, we propose an approach that integrates visual features and speech information to yield two referred prediction models, namely ModelCRM (statistics-based) and ModelSAR (rule-based). Through fusing two prediction models, the generated probability list can effectively help enhance the precision of video annotation. Through experimental evaluation on well-known TRECVID 2003 datasets, our proposed approach was shown to deliver higher precision than other existing methods.

目錄

英文摘要	I
中文摘要	II
誌謝	III
目錄	IV
表目錄	VI
圖目錄	VII

第一章 導論	- 1 -
1 研究目的	- 1 -
2 問題描述	- 2 -
3 研究方法	- 3 -
4 研究貢獻	- 4 -
5 論文架構	- 5 -

第二章 文獻探討	- 6 -
1 視訊片段偵測及關鍵影格擷取(Shot Detection & Keyframe Extraction)	- 6 -
2 視訊影片組成	- 8 -
3 影像低階特徵值擷取	- 9 -
4 針對關鍵影格做分析的註解技術	- 10 -
5 自動語音辨認技術(ASR, Automatic Speech Recognition)	- 11 -
6文字處理技術	- 12 -
7 關聯規則探勘	- 13 -
7.1 關聯規則之定義	- 13 -
7.2 關聯規則探勘法之目的	- 14 -
7.3 關聯規則探勘方法	- 14 -
7.4 Apriori演算法	- 14 -

第三章 研究方法	- 16 -
1 方法架構	- 16 -
2 訓練階段(Training Stage)	- 18 -
2.1 ModelCRM之建構	- 19 -
2.2 ModelSAR之建構	- 21 -
3  預測階段(Prediction stage)	- 27 -
3.1 ModelCRM之應用(Annotation by ModelCRM)	- 28 -
3.2 ModelSAR之應用(Annotation by ModelSAR)	- 29 -
3.3 融合ModelCRM和ModelSAR(Fusion ModelCRM and ModelSAR)	- 32 -

第四章 實驗分析	- 33 -
1 實驗資料說明	- 33 -
2 實驗評估公式	- 36 -
3 實驗設計	- 37 -
3.1 ModelCRM的參數設定實驗	- 37 -
3.2 ModelCRM預測實驗	- 39 -
3.3 ModelSAR預測實驗	- 42 -
3.4融合ModelCRM和ModelSAR的預測實驗	- 49 -
4 各種預測模型之比較	- 60 -
5 補充實驗	- 62 -
6 實驗總結	- 65 -

第五章 結論及未來發展	- 66 -
1 結論	- 66 -
2 未來發展	- 67 -

參考文獻	- 68 -

作者自述	- 72 -


                                    

參考文獻
[1] R. Agrawal， T. Imielinski and A. Swami, “Mining Association Rules Between Sets of Items in Large DataBases,” Proc. Of the ACM SIGMOD Conference on Management of Data, pp207-216.1993.

[2] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules," Proc.20th Very Large Databases(VLDB) Conference,pp 487-499,Chile.1994.

[3] K. Barnard, P. Duygulu, N. De Freitas, D. A. Forsyth, D. Blei, and M. Jordan, ”Matching words and pictures,” Journal of Machine Learning Research. 3:1107-1135, 2003.

[4] K. Barnard and D. A. Forsyth, “Exploiting image semantics for picture libraries,” In The First ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 408-15, 2001.

[5] A. Bouajila , C. Claus and A. Herkersdorf .“MPEG-7 eXperimentation Model (XM).” Avaliable at : http://www.lis.e-technik.tu-muenchen.de/research/bv/topics/mmdb/e_mpeg7.html

[6] A. Dorado, J. Calic, and E. izquierdo. “A Rule-based Video Annotation System.” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 14, No. 5, May 2004.

[7] P. Duygulu and H. Wactlar, “Associating Video Frames with Text,” In Proccedings of the SIGIR Multimedia Information Retrieval Workshop 2003, Aug, 2003.

[8] C. Fellbaum, "WordNet : An Electronic Lexical Database Edited by Christiane Fellbaum," MIT Press. May, 1998.

[9] S. L. Feng, R. Manmatha, and V. Lavrenko, “Multiple Bernoulli Relevance Models for Image and Video Annotation.” 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 1002-1009, 2004.

[10] J. L. Gauvain, L. Lamel, and G. Adda. “The LIMSI Broadcast News Transcription System.” Speech Communication, 37(1-2):89-108, 2002.

[11] A. Ghoshal, P. Ircing and S. Khudanpur, "Hidden Markov Models for Automatic Annotation and Content-Based Retrieval of Images and Video," Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval SIGIR '05. August 2005.

[12] K. Hacioglu and B. Pellom. “A Distributed Architecture for Robust Automatic Speech Recognition.” IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. (ICASSP '03) Volume 1,pp.328-331,April 2003.

[13] W. J. Heng, “Shot Boundary Refinement for Long Transition in Digital Video Sequence,” IEEE Trans. On Multimedia, Vol. 4, No. 4, pp. 434-445, December 2002.

[14] J. H. Huang, "A Novel Video Annotation Method by Integrating Visual Features and Frequent Patterns", Master Thesis, National Cheng Kung University, 2006.

[15] IBM Research Center , “IBM VideoAnnEx Annotation Tool.” Avaliable at : http://www.research.ibm.com/VideoAnnEx/

[16] C. Jelmini and S. Marchand-Maillet. Deva. “an extensible ontology-based annotation model for visual document collections.” In Proceedings of SPIE Photonics West, Electronic Imaging 2002, Internet Imaging IV, Santa Clara, CA, USA, 2003.

[17] K. Johar and R. Simha,"The George Washington University JWord 3.0," Avaliable at : http://www.seas.gwu.edu/~simhaweb/software/jword/

[18] J. R. Kender and M. R. Naphade. “Visual Concepts for News Story Tracking: Analyzing and Exploiting the NIST TRECVID Video Annotation Experiment.” Proceeding of the 2005 IEEE Computer Society Conference on Computer Vision and Patter Recognition, June 2005.

[19] V. Lavrenko, R. Manmatha and J. Jeon. “A Model for Learning the Semantics of Pictures.” In Proceedings of NIPS’03. 2003

[20] V. Lavrenko, S. L. Feng, and R. Manmatha, “Statistical Models for Automatic Video Annotation and Retrieva.” The International Conference on Acoustics, Speech and Signal Processing, May 2004.

[21] J. Li and J. Z. Wang. “Automatic linguistic indexing of pictures by a statistical modeling approach.” IEEE Trans. On Pattern Analysis and Machine Intelligence, 25(10) : 14, 2003.

[22] G. A. Miller, “Wordnet: A Dictionary Browser in Information in Data,” Proceedings of the First Conference of the UW Centre for the New Oxford Dictionary. Waterloo, Canada: University of Waterloo, 1985.

[23] NIST(The National Institute of Standards and Technology), IN Proceedings of the TREC Video Retrieval Evaluation Conference(TrecVID2003), November 2003.

[24] B. Pellom, "Sonic: The University of Colorado Continuous Speech Recognizer", Technical Report TR-CSLR-2001-01, CSLR, University of Colorado, March 2001.

[25] M.F. Porter, "An algorithm for suffix stripping," published in Program, 14 no. 3, pp 130-137, July 1980.

[26] J. D. M. Rennie, “Derivation of the F-Measure,” Available at : http://people.csail.mit.edu/jrennie/writing, February 2004.

[27] Y. Rui, T. S. Huang, and S. Mehrotra, ”Constructing Table-of-Content for Videos.” ACM Multimedia Systems Journal – Special Issue Multimedia Systems on Video Libraries, Vol. 7, No. 5, pp. 359-368, September 1999.

[28] G. Salton and M.J. McGill, ”Introduction to modern information retrieval.” McGraw Hill. 1983.

[29] M. Srikanth, J. Varner, M. Bowden and Moldovan,"Exploiting ontologies for automatic image annotation," Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval SIGIR '05, August 2005.

[30] C. Zhang, S. C. Chen and M. L. Shyu. “PixSO : A system for video shot detection.” ICICS-PCM , December 2003.

校內：2007-08-29公開
校外：2007-08-29公開

簡易檢索 / 詳目顯示

相關論文