| 研究生: |
陳震宇 Chen, Chen-Yu |
|---|---|
| 論文名稱: |
新聞、運動與監控視訊摘要方法之研究 A Study on Video Summarization Approaches for News, Sports, and Surveillance |
| 指導教授: |
王駿發
Wang, Jhing-Fa |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 電機工程學系 Department of Electrical Engineering |
| 論文出版年: | 2008 |
| 畢業學年度: | 96 |
| 語文別: | 英文 |
| 論文頁數: | 90 |
| 中文關鍵詞: | 視訊摘要 、分散式新聞影片伺服器 、基於混亂度之運動特徵萃取 、同質性錯誤模式 、狀態轉栘之支持向量機 、人類行為辨識 |
| 外文關鍵詞: | Human Behavior Identification, State Transition Support Vector Machine, Video Summarization, Distributed News Video Servers, Entropy-based Motion Feature Extraction, Heteroscedastic Error Model |
| 相關次數: | 點閱:167 下載:5 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著多媒體時代來臨,視訊摘要已成為一個很重要的議題,使用者可以透過瀏灠摘要影片快速地取得原影片想要表達的意涵。本論文針對時下有廣泛需求的影片種類提出研究方法,分別為新聞影片摘要方法、運動影片摘要方法及監視影片摘要之異常行為辨識方法。
在新聞影片視訊摘要方法中,本論文提出一個以分散式為架構的摘要方法,其架構包含新聞影片視訊處理伺服器和影片查詢/瀏灠伺服器。透過本論文提出的兩種伺服器之間的合作,可以降低新聞影片存放裝置的大小、新聞故事摘要產生的運算量及在網際網路間傳送影片所需要的網路頻寬。
運動影片中物體(人物)栘動常常是在處理運動視訊摘要時極為重要的依據,然而只以物體運動作為該片段是否具有事件發生的判斷依據是有疑慮的,因為攝影機在栘動時也會產生大量的運動量,常常造成系統的誤判。為了解決這個困惱,本論文特別提出一種基於混亂度運動特徵來降低由於攝影機栘動而產生的運動量變化。該運動特徵不只考量運動量大小,還要考慮影片中運動向量的混亂度,若是某一影片片段的運動向量混亂度很高,則表示該片段有重要事件的可能性較高,反之則否。
監視系統已被廣泛地使用於許多公共場合或是家庭安全監控系統,在這類的系統中,會將錄下的影片存在儲存裝置中,所以該系統需要大容量的儲存裝置,而監控影片摘要便因應而生,以降低該系統對儲存裝置的需求量。於監控影片摘要系統中,異常行為偵測又尤為重要,可在特殊事件發生的時候,提出警告訊息或是只將該事件發生前後的影片儲存即可,以達到減少儲存裝置的需求量,所以,本論文針對人類異常行為偵測發展一套可以辨識人類行為偵測系統,該系統由數個連續的狀態組成,每一個狀態可以識別一個特定行為的片段,狀態與狀態之間的關聯性可以由馬可夫隨機場來定義,透過數個狀態的組合,該系統便可達到辦識隨時間變化內容之監控影片。
It is easy to watch video in its most common representation, namely sequential. However browsing, manipulating and editing video is a tedious process. In this thesis, we investigate distributed news video server architecture, entropy-based motion feature, and state transition SVM for news video summarization, sports video summarization, and surveillance video summarization, respectively.
The distributed news video summarization architecture includes video news processing (NVP) servers and the video querying/browsing (VQB) server. To reduce the storage size for news video, computation cost of story abstract generation, and required Internet bandwidth, this research work proposed an efficient news video querying/browsing based on distributed news video servers. Our news video querying/browsing system works well since these distributed NVP servers are associated with the VQB server. Each news story abstract is generated by the corresponding NVP server and sent to the VQB server for user querying/browsing.
Entropy-based motion feature is extracted from sports video not only concerning motion magnitude but also motion directivity. The proposed novel feature can decrease the effect from camera motion. Furthermore, the entropy-based motion feature is effective for a variety of sports video summarization, such as soccer video,
tennis video, and basketball video. Meaningful events then can be segmented by the heteroscedastic error model while sports video is represented as entropy motion values.
To improve the performance of intelligent surveillance system, we developed a human behavior identification module to increase the efficiency by integrating visual and contour information. The state transition support vector machine (STSVM) is used to human behavior identification with continuous property. The STSVM assumes a human activity is composed of several successive states. Each state is modeled by an individual 2-class SVM and the transition probabilities of consecutive states are calculated by Markov random field (MRF) theory.
Each proposed approach and the corresponding experimental results will be explained clearly in the dissertation contents.
[1] B. Li and M. I. Sezan, "Event detection and summarization in sports video," IEEE Workshop on Content-based Access of Image and Video Libraries, pp. 132-138, 2001.
[2] S. M. Iacob, R. L. Lagendijk, and M.E. Iacob, “Video abstraction based on asymmetric similarity values”, in Proc. SPIE Vol. 3846 Multimedia Storage and Archiving Systems IV, pp. 181-191, 1999.
[3] M. Cooper and J. Foote, “Summarizing video using non-negative similarity matrix factorization,” in Proc. IEEE Workshop on Multimedia Signal Processing, pp. 25-28, 2002.
[4] L. Chaisorn, T. S. Chua, and C. H. Lee; “The segmentation of news video into story units,” in Proc. IEEE Int. Conf. Multimedia and Expo, pp. 73-76, 2002.
[5] N. E. O'Connor, S. Marlow, N. Murphy, A. F. Smeaton, P. Browne, S. Deasy, H. Lee, and K. McDonald, “Fischlar: an on-line system for indexing and browsing broadcast television content,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, pp. 1633-1636, May 2001.
[6] H. T. Chen, D. Y. Chen, and S. Y. Lee, “Object based video similarity retrieval and its application to detecting anchorperson shots in news video,” in Proc. Fifth International Symposium on Multimedia Software Engineering, pp. 172-179, 2003.
[7] X. Gao and X. Tang, “Unsupervised video-shot segmentation and model-free anchorperson detection for news video story parsing,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 9, pp. 765-776, Sep. 2002.
[8] A. Hanjalic, R. L Lagensijk, and J. Biemond, “Template-based detection of anchorperson shots in news programs,” in Proc. 1998 Int. Conf. Image Processing, pp. 148-152, Oct. 1998.
[9] W. Qi, L. Gu, H. Jiang, X. R. Chen, and H. J. Zhang, “Integrating visual, audio and text analysis for news video,” in Proc. Int. Conf. Image Processing, pp. 520-523, Sept. 2000.
[10] S. Raaijmakers, J. den Hartog, and J. Baan, “Multimodal topic segmentation and classification of news video,” in Proc. IEEE Int. Conf. Multimedia and Expo, pp. 33-36, Aug. 2002.
[11] R. S. Jasinschi, N. Dimitrova, T. McGee, L. Agnihotri, J. Zimmerman, and D. Li, “Integrated multimedia processing for topic segmentation and classification,” in Proc. Int. Conf. Image Processing, pp. 366-369, Oct. 2001.
[12] W. Zhu, C. Toklu, and S. P. Liou, “Automatic news video segmentation and categorization based on closed-captioned text,” in Proc. IEEE Int. Conf. Multimedia and Expo, pp. 829-832, Aug. 2001.
[13] Y. Gong, L.T. Sin, C.H. Chuan, H. Zhang and M. Sakauchi, “Automatic parsing of TV soccer programs,” in Proc. Internat. Conf. on Multimedia Computing and Systems, pp. 167-174, May 1995.
[14] D. Yow, B-L. Yeo, M. Yeung and B. Liu, “Analysis and presentation of soccer highlights from digital video,” in Proc. Asian Conf. on Comp. Vision (ACCV), 1995.
[15] D. Zhong and S. F. Chang, “Spatio-temporal video search using the object-based video representation,” in Proc. IEEE Int. Conf. Image Processing, vol. 1, pp. 21–24, Oct. 1997.
[16] H. Denman, N. Rea and A. Kokaram, “Content based analysis for video from snooker broadcasts,” Journal of Computer Vision and Image Understanding (CVIU), pp. 141-306, 2002.
[17] N. Rea, R. Dahyot and A. Kokaram, “Semantic event detection in sports through motion understanding,” Int. Conf. Image and Video Retrieval (CIVR), July 2004.
[18] H. C. Shih and C. L. Huang, “Content-based multi-functional video retrieval system,” Int. Conf. Computers in Education, pp. 383-384, Jan. 2005.
[19] T. Y. Liu, W. Y. Ma and H. J. Zhang, “Effective feature extraction for play detection in American football video,” in Proc. Conf. Multimedia Modelling, pp. 164-171, 2005.
[20] M. J. Black and P. Anandan, “The robust estimation of multiple motions: parametric and piecewise-smooth flow fields,” Computer Vision and Image Understanding, vol. 6, no. 4, pp. 348-365, 1995.
[21] J. M. Odobcr and P. Bouthemy, “Robust multiresolution estimation of parametric motion models,” Journal of Visual Communication and Image Representation, vol. 6, no. 4, pp. 348-365. 1995.
[22] P. Bauthemy et al. “A unified approach to shot change detection and camera motion characterization.” IEEE Trans. Circuits Syst. Video Technol., vol. 7, pp. 1030-1044, 1999.
[23] T. Liu, H. J. Zhang, and F. Qi, “A novel video key-frame-extraction algorithm based on perceived motion energy model,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 10, pp. 1006-1013, Oct. 2003.
[24] N. Rota and M. Thonnat, “Video sequence interpretation for visual surveillance,” in Proc. the 3rd IEEE International Workshop on Visual Surveillance, pp. 59-68, 2000.
[25] R. Cutler and M. Turk, “View-based interpretation of real-time optical flow for gesture recognition,” University of Maryland, College Park Report, 1997.
[26] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland, “Pfinder: Real-time tracking of the human body,” IEEE Trans. Pattern Anal. Machine Intell., vol.19, no. 7, pp. 780-785, July 1997.
[27] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: a local SVM approach,” in Proc. IEEE Conf. Pattern Recognition, pp. 32-36, Aug. 2004.
[28] J. Ben-Arie, Z. Wang, P. Pandit, and S. Rajaram, “Human activity recognition using multidimensional idexing,” IEEE Trans. Pattern Anal. Machine Intell., vol.24, no. 8, pp. 1091-1104, Aug. 2002.
[29] D. Heesch, M. J. Pickering, S. Ruger, and A. Yavlinsky, “Video retrieval using search and browsing with key frames,” in Proc. TREC Video Retrieval Evaluation (TRECVID), 2003.
[30] C. G. M. Snoek, M. Worring, J. M. Geusebroek, D. C. Koelma, and F. J. Seinstra, “The MediaMill TRECVID 2004 semantic video search engine,” in Proc. TREC Video Retrieval Evaluation (TRECVID), 2004.
[31] D. Heesch, P. Howarth, J. Megalhaes, A. May, M. Pickering, A. Yavlinsky, and S. Ruger, “Video retrieval using search and browsing,” in Proc. TREC Video Retrieval Evaluation (TRECVID), 2004.
[32] N. Serpanos and A. Bouloutas, “Centralized versus distributed multimedia servers,” IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 8, pp. 1483-1449, Dec. 2000.
[33] S. A. Barnett and G. J. Anido, “A Cost Comparison of Distributed and Centralized Approaches to Video-on-Demand,” IEEE Journal on Selected Areas in Communications, vol. 14, no. 6, pp. 1173-1183, Aug 1996.
[34] K. Tanaka, H. Sakamoto, H. Suzuki, and K. Nishimura, “Distributed Architecture for Large-scale Video Servers,” in Proc. Int. Conf. Information, Communications and Signal Processing, pp. 578-583, Sep. 1997.
[35] K. Kondo, Y. Arai, and F. Kozato, “A method of summarizing used paraphrasing a part of text,” Technical Report of IEICE, NLC95-62, pp. 25-30, 1995.
[36] F. Ren, S. Li, and K. Kita “Automatic abstracting important sentences of Web articles,” in Proc. IEEE Int. Conf. System, Man, and Cybernetics, pp. 1705-1710, 2001.
[37] F. Ren and Y. Sadanaga, “An automatic extraction of important sentences using statistical information and structural feature,” Natural Language, no. 125, pp.71-78, May 1998.
[38] K. Otsuji, Y. Tonomura, and Y. Ohba, “Video browsing using brightness data,” in Proc. SPIE Visual Communications and Image Processing,” vol. 1606, pp. 980-985, Nov. 1991.
[39] Y. Tonomura, A. Akutsu, Y. Taniguchi, and G. Suzuki, “Structured video computing,” IEEE Multimedia, vol. 1, no. 3, pp. 34-43, 1994.
[40] H. J. Zhang, A. Kankanhalli, and S. W. Smoliar, “Automatic partitioning of full-motion video,” Multimedia Systems, vol. 1, no. 1, pp. 10-28, June 1993.
[41] Y. Zhuang, Y. Rui, T. S. Huang, and S. Mehrotra, “Adaptive key frame extraction using unsupervised clustering,” in Proc. Int. Conf. Image Processing, pp. 866-870, Oct. 1998.
[42] B. Gunsel and A. M. Tekalp, “Content-based video abstraction,” in Proc. Int. Conf. Image Processing, pp. 128-132, Oct. 1998.
[43] M. M Yeung and B. Liu, “Efficient matching and clustering of video shots,” in Proc. Int. Conf. Image Processing, pp. 338-341, Oct. 1995.
[44] X. Sun, M. S Kankanhalli, Y. Zhu, and J. Wu, “Content-based key frame extraction for digital video,” in Proc. IEEE Int. Conf. Multimedia Computing and Systems, pp. 190-193, 1998.
[45] W. Wolf, “Key frame selection by motion analysis,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, pp. 1228-1231, 1996.
[46] T. Liu, H. J. Zhang, and F. Qi, “A novel video key-frame-extraction algorithm based on perceived motion energy model,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 10, pp. 1006-1013, Oct. 2003.
[47] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information Processing and Management, vol. 24, no. 5, pp. 513-523, 1988.
[48] T. K. Landauer, P. W. Foltz, and D. Laham, “Introduction to latent semantic analysis,” Discourse Processes, vol. 25, pp. 259-284, 1998.
[49] P. W. Foltz, “Latent semantic analysis for text-based research,” Behavior Research Methods, Instruments and Computers, vol. 28, no. 2, pp. 197-202, 1996.
[50] M. W. Peter, “How latent is latent semantic analysis?,” in Proc. Int. Joint Conf. Artificial Intelligence, pp. 932-941, 1999.
[51] Y. Zhao and G. Karypis, “Evaluation of hierarchical clustering algorithms for document datasets,” in Proc. the eleventh Int. Conf. Information and knowledge management, pp. 515-524, 2002.
[52] F. Coldefy, and P. Bouthemy, “Unsupervised soccer video abstraction based on pitch, dominant color and camera motion analysis,” in Proc. ACM Multimedia, pp. 268-271, Oct. 2004.
[53] V. Guralnik and J. Srivastava, “Event detection from time series data,” in Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pp. 33-42, 1999.
[54] S. Suthaharan, S. W. Kim, H. K. Lee, and K. R. Rao, “Perceptually Tuned Video Watermarking Scheme using Motion Entropy Masking,” in Proc. IEEE Region 10 Conf. (TENCON), pp. 182-185, Sep. 1999.
[55] Y. Ma and H. J. Zhang, “A new perceived motion based shot content representation,” in Proc. IEEE Int. Conf. Image Processing, pp. 426-429, Oct. 2001.
[56] J. Zan, M. O. Ahmad, and M. N. S. Swamy, “A multiresolution motion estimation technique with indexing,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 2, pp. 157-165, Feb. 2006.
[57] D. M. Hawkins and D. F. Merriam, “Optimization of digitized sequential data,” Mathematical Geology, pp. 389-395, 1973.
[58] S. B. Guthery, “Pattern regression,” Amer. Statist. Ass., pp. 945-947, 1974.
[59] D. M. Hawkins, “Point estimation of parameters of piecewise regression models,” The Journal of the Royal Statistical Society Series C, pp. 51-57, 1976.
[60] A. Agresti, “Categorical data analysis,” John Wiley & Sons, 1990.
[61] A. J. Smola and B. Scholkopf, “A tutorial on support vector regression,” Statistics and Computing, vol. 14, no. 3, pp. 199-222, 2004.
[62] R. Courant and D. Hilbert, “Methods of mathematical physics,” Interscience Publishers, 1953.
[63] S. Shafer et al., “The new EasyLiving project at Microsoft research,” in Proc. DARPA/NIST Smart Spaces Workshop, pp. 127-130, 1998.
[64] B. Johanson and A. Fox, “Extending tuplespaces for coordination in interactive workspaces,” Journal of Systems and Software, vol. 69, no. 3, pp. 243-266, Jan., 2004.
[65] Philips Home Lab: http://www.philips.com/research/ami.
[66] H. Mizoguchi, T. Sato and T. Ishikawa, “Robotic office room to support office work human behavior understanding function with networked machines,” in Proc. IEEE Conf. Robotics and Automation, pp. 2968-2975, 1996.
[67] F. Doctor, H. Hagras and V. Callaghan, “A fuzzy embedded agent-based approach for realizing ambient intelligence in intelligent inhabited environments,” IEEE Trans. System, Man, and Cybernetics-Part A: System and human, vol. 35, no. 1, pp. 55-65, Jan. 2005.
[68] Z. Cheng, Q. Han, S. Sun, M. Kansen, T. Hosokawa, T. Huang and A. He, “A proposal on a learner’s context-aware personalized education support method based on principles of behavior science,” in Proc. IEEE Conf. Advanced Information Networking and Application, pp. 341-345, 2006.
[69] A. Butz and A. Kruger, “Applying the peephole metaphor in a mixed-reality room,” IEEE Trans. Computer, vol. 1, no. 26, pp. 56-63, Jan. 2006.
[70] J. Rehg and T. Kanade, “Model-based tracking of self-occluding articulated objects,” in Proc. Int. Conf. Computer Vision, pp. 612-617, 1995.
[71] S. X. Ju, M. J. Black, and Y. Yacoob, “Cardboard people: A parameterized model of articulated motion,” in Proc. IEEE Conf. Automatic Face and Gesture Recognition, pp. 38-44, 1996.
[72] H. Sidenbladh, M. Black, and D. Fleet, “Stochastic tracking of 3d human figures using 2d image motion,” in European Conf. Computer Vision, pp. 702-718, 2000.
[73] K. Grauman, G. Shakhnarovich, and T. Darrell, “Inferring 3d structure with a statistical image-based shape model,” in Proc. Int. Conf. Computer Vision, pp. 641-647, 2003.
[74] I. Haritaoglu, D. Harwood, and L. S. Davis, “W4: Real-time surveillance of people and their activities,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, no. 8, pp. 809-830, Aug. 2000.
[75] S. McKenna, S. Jabri, Z. Duric, A. Rosenfeld, and H. Wechsler, “Tracking groups of people,” Comput. Vis. Image Understanding, vol. 80, no. 1, pp. 42-56, 2000.
[76] C. Stauffer and W. Grimson, “Adaptive background mixture models for real-time tracking,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, pp. 246-252, 1999.
[77] A. Prati, I. Mikic, M. Trivedi, and R. Cucchiara, “Detecting moving shadows: algorithms and evaluation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 7, pp. 918-923, July 2003.
[78] V. Vapnik, Statistical Learning Theory, New York: Wiley, 1998.
[79] R. Courant and D. Hilbert, Methods of mathematical physics, Interscience Publishers, 1953.
[80] A. Ganapathiraju, J.E. Hamaker, and J. Picone, “Applications of support vector machines to speech recognition,” IEEE Transactions on Signal Processing, vol. 52, no. 8, pp. 2348–2355. Aug. 2004.
[81] M. I. Jordon, editor, Learning in graphical models, MIT Press, 1999.
[82] C. M. Bishop, Pattern recognition and machine learning, Springer, 2006.
[83] L. Gupta and S. Ma, “Gesture-based interaction and communication: automated classification of hand gesture contours,” IEEE Trans. System, Man, and Cybernetics-Part C: Application and reviews, vol. 31, no. 31, pp. 114-120, Feb. 2001.
[84] G. D. Forney, “The Viterbi algorithm,” Proc. the IEEE, vol. 61, no. 3, pp. 268-278, Mar. 1973.