簡易檢索 / 詳目顯示

研究生: 葉峻賓
Yeh, Jun-Bin
論文名稱: 應用視覺片語之多概念探知於事件影片檢索之研究
Visual Phrase-Based Multiple Concept Discovery for Event-Related Video Retrieval
指導教授: 吳宗憲
Wu, Chung-Hsien
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2012
畢業學年度: 100
語文別: 英文
論文頁數: 108
中文關鍵詞: 視覺文字視覺片語多重視角圖形探勘事件相關詞彙箱型圖視覺概念探知視覺文字分群法多媒體匹配IBM模型視覺語言模型
外文關鍵詞: visual word, visual phrase, multiple angles, graph mining, relevant term, boxplot, visual concept discovery, concept-based visual word clustering, multimedia alignment, IBM model-1, visual language model
相關次數: 點閱:218下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來在雲端計算的熱潮之下,多媒體內容的提供變得非常普及,諸如YouTube,CNN,BBC等等。在資訊爆炸的時代下,作為一個好的線上服務提供者,必然需要為使用者精準地檢索,並重新組織內容後再呈現。在一個基於內容比對的影片檢索系統中,有兩個迫切的需求:理解多媒體內容,和重新組織多媒體內容。一個很好的範本是CNN的專題報導(special report)。新聞影片在經過人工比較後,將相關影片列在同一個網頁上。本論文著眼於開發一個以專題報導為目標,但不需人工介入的系統。為了理解影片內容,我們利用強健的視覺特徵來探知畫面中出現的概念。為了重新組織檢索結果,系統需要抽取出事件相關的詞彙(relevant term)。系統亦需要為檢索得到的影片產生有意義的簡短預覽影片(brief preview),以利線上快速預覽。
    近年來,研究學者們利用視覺概念探知來填補理解視覺內容的語意缺口。視覺概念可藉由視覺文字(Visual Word, VW)探知。視覺文字在過去研究中,被證實有不錯的正確率。然而當一個畫面中包含多個概念時,會劇烈降低探知的準確率。本論文提出基於概念的視覺文字分群法(Concept-based Visual Word Clustering, CVWC),在未經事前切割的圖片上探知出現的概念。我們利用網路圖片周圍的文字訓練出概念之間的預知識(prior knowledge)。為了求得分群法的近似最佳解,我們提出概念式基因演算法(concept-based genetic algorithm,CBGA),根據視覺文字在所屬概念中之重要性,以及兩兩概念共同出現的機率來求解。CBGA亦利用空間上周圍的視覺文字來探知概念。概念延伸(concept extension, CE)接著從目前的分群結果中,更新可能探知的概念集合。
    視覺片語(visual phrase)是一個視覺文字的集合,它提升了視覺概念探知的準確率,但在真實影片中應用時仍稍嫌不足。由於一個物件的圖片可能擁有多重視角,在過去方法取出之視覺片語中,部份可能會在某些角度下被遮蔽,而影響視覺片語的抽取結果。本論文利用圖形探勘(graph mining)來取出強健的視覺片語。視覺文字在每張屬於同一個分類圖片的出現頻率(appearance frequencies),分別用來建立起一個個的視覺文字關係圖。圖形探勘便從這些關係圖中,探勘出部份的子圖。這些子圖的出現密集(dense)和出現頻率(frequency)皆超出指定的門坎值(threshold),並視為視覺片語。
    事件相關的詞彙(relevant term)時常在傳統TFIDF方法下得到較低的逆向文件頻率(inverse document frequency),而被誤判為常見詞彙。本論文提出基於視窗文件頻率分佈的詞彙計算權重方式,來強調事件相關詞彙的分數。套用滑動視窗(sliding window)於頻率分佈上,計算出視窗文件頻率分佈。藉由箱型圖(boxplot),可以將分佈上的視窗文件頻率分為尋常和不尋常兩個詞彙群組。依此兩個群組的差值,可計算出指定詞彙的相關詞彙權重。
    新聞影片中,主播片段的文字是經過人工摘要過的精簡內容。藉由匹配(alignment)主播片段的文字和外場富含事件意義的畫面,可以產生適當的簡短預覽影片。匹配這兩種媒體並不如想像中容易,主因有二:畫面上精準找到物件並不容易,而且文字詞彙常有同義詞而引發錯誤匹配。本論文利用視覺上的語言模型來描述畫面在時間上的關係,並套用在以句子為單位的匹配方法。對於資料庫每則新聞影片皆取出關鍵畫面,和畫面上的主要物件。單字集(Bag-of-Words)表示法接著將主要物件對應到視覺樣本(visual pattern)。文字詞彙同時利用知網(HowNet)對應到文字概念(textual concept)。最後利用IBM模型(IBM Model-1)考量視覺樣本間的語言模型,來匹配文字概念和視覺樣本。
    本論文中每個提出的方法,皆實作於事件影片檢索系統中,以評估可行性。實驗結果顯示這些方法可以在事件影片檢索系統中得到改善。

    In recent years, multimedia content has been used on video-sharing Web sites, such as YouTube, CNN, and the BBC. As a useful online aid for information, the content should be retrieved accurately and reorganized heuristically for users. In a content-based video retrieval system, two issues are in emergent need: content understanding and content organization. Special reports in news videos are a template for online services. Related videos are shown on the same Web page, and the content is compared with similar concepts manually. For this dissertation, we develop a system of special reporting in news videos without human interference. To understand content, the concepts appearing in the video have strong visual features. For content reorganization, event-related terms must be extracted. A meaningful preview video for each video must be generated.
    Visual concept discovery (CD) was used to fill the semantic gap for retrieving visual content. Visual words (VWs), which have a good accuracy rate, were applied to visual CD. However, multiple concepts in an image generally degrade the discovery accuracy. A concept-based visual word clustering (CVWC) method is proposed to discover multiple concepts in an image without presegmented training images. CVWC is based on prior knowledge of concepts, which are derived from the meta-text of Web images. Concepts are obtained by clustering VWs extracted from image segments. A concept-based genetic algorithm (CBGA) is used to search for near-optimal clusters by VWs and the co-occurrence probability of two concepts. The clustering procedure is performed on related VWs to discover all aspects representing a concept. A concept extension method (CE) is applied to iteratively update the discovered concepts from the cluster results.
    Although visual phrases (VP), a set of VWs, improve the accuracy of visual CD, the VWs are not versatile enough. For images of objects with multiple angles, the VWs in a VP may overlap with other objects, thus degrading VP extraction performance. This dissertation presents an approach for effective visual phrase extraction using graph mining. A concurrent appearance of two VWs is estimated encompassing all category-related images in a database. The appearance frequencies of VWs in each image are then used to construct a relation graph of VWs. Graph mining is used to mine dense subgraphs from the VW relationship graphs to extract VPs.
    Event-related terms, known as relevant terms (RTs), are commonly used terms and obtain lower inverse document frequencies when using a conventional term frequency-inverse document frequency (TFIDF) method. A term scoring method is proposed to enhance the weights of RT in an event by considering the windowed document-frequency distribution. The weight for a given term is determined by the mean of the difference between usual and unusual term categories, which are quantized using a boxplot method.
    A preview of a news video is generated by semantically aligning the textual sentences of the anchor report, summarized by the anchor, with visual field report shots. Because accurately detecting an object in a visual shot is difficult, and a textual term may generally correspond to several synonyms, the alignment of an anchor sentence with a video shot remains challenging. The temporal relations among the frames in a visual shot are characterized by a visual language model. Language model-based temporal relationships are then applied to sentence-based alignments. The Bag-of-Word (BoW) representations for the main objects in the key frames of a visual shot are first mapped to the visual patterns obtained from the news-video database. Second, textual terms in the report sentence are mapped to the textual concepts that are obtained from the HowNet knowledge database. Finally, unsupervised alignment between the textual concepts and the visual patterns in the news videos is performed using the IBM model-1.
    These methods were implemented in each component of an event-related video retrieval system for performance evaluation. The experimental results show that the proposed approaches achieved improvements in event-related video retrieval.

    中文摘要 I Abstract III Acknowledgement V Contents VII List of Figures XI List of Tables XIV Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contributions 2 1.3 The Organization of this Dissertation 3 Chapter 2 Introduction to Concept Discovery for Video Retrieval 4 2.1 Concept Discovery 4 2.1.1 Concept Definition 5 2.2 Bag-of-Visual Words 6 2.2.1 Scale-Invariant Feature Transform (SIFT) (Lowe, 2004) 6 2.2.2 Visual Word (Sivic & Zisserman, 2003) 8 2.2.3 Visual Phrase (Zheng et al., 2006) 9 2.2.4 Concept Detection 10 2.3 Video Retrieval 11 2.4 System Overview of This Dissertation 12 Chapter 3 Event-Related Term Extraction 14 3.1 Text-Based Video Retrieval 14 3.2 System Overview 16 3.2.1 Video Retrieval System 16 3.2.2 Video Segmentation and Keyframe Selection 17 3.2.3 Vector Space Model (Salton, 1975) 18 3.2.4 Property of Relevant Terms Distribution for an Event 18 3.3 Term Frequency – Relevant Document Frequency 19 3.3.1 Re-sampling 19 3.3.2 Boxplot-based Grouping 20 3.4 Summary of This Chapter 21 Chapter 4 Robust Visual Phrase Extraction 22 4.1 Visual Phrase Extraction 22 4.2 System Framework 23 4.3 Representation of Visual Word 24 4.4 Visual Phrase Extraction 25 4.5 Summary of This Chapter 27 Chapter 5 Multiple Visual Concept Discovery 28 5.1 Visual Concept Discovery 28 5.2 Overview of Multiple Visual Concept Discovery 31 5.3 Concept-Based Visual Word Clustering 34 5.3.1 Chromosome Representation 34 5.3.2 CBGA-Based Evolution 35 5.3.3 CBGA-Based Evaluation: Fitness Function 40 5.3.4 Concept Extension 40 5.4 Summary of This Chapter 41 Chapter 6 Brief Preview Video Generation 42 6.1 Multimedia Content Alighment 42 6.2 System Framework 47 6.2.1 Training 48 6.2.2 Test 49 6.3 Pre-Processing 49 6.3.1 Key term extraction 49 6.3.2 Key frame extraction 50 6.4 Textual Concept Discovery 51 6.4.1 HowNet term mapping 51 6.4.2 Unknown term mapping 52 6.4.3 Noun concept mapping 53 6.5 Visual Pattern Discovery 53 6.5.1 Visual word extraction 54 6.5.2 Object-of-Interest extraction 55 6.5.3 Visual pattern dictionary generation and visual pattern mapping 57 6.6 Concept-Based Alignment Model 57 6.7 Summary of This Chapter 61 Chapter 7 Evaluations 62 7.1 Database 62 7.2 Evaluations of Relevant Term Extraction 64 7.2.1 Experimental Setup 64 7.2.2 Analysis by Manually Generated Data 65 7.2.3 Analysis by Real Data 67 7.3 Evaluations of Visual Phrase Extraction 69 7.3.1 Experimental setup 69 7.3.2 Threshold edge , freq, and, dense 70 7.3.3 Evaluation of image retrieval 71 7.4 Evaluations of Multiple Concept Discovery 73 7.4.1 Experimental Setup 73 7.4.2 Estimation of CVWC Parameters 75 7.4.3 Case Study for Video Retrieval 81 7.5 Evaluations of Brief Preview Video Generation 87 7.5.1 Experimental setup 87 7.5.2 Evaluation of thresholds 88 7.5.3 Evaluation of alignment approaches 91 7.5.4 Evaluation of textual concept discovery 95 7.5.5 Evaluation of OOI extraction 96 Chapter 8 Conclusion and Future Work 98 8.1 Conclusion 98 8.2 Future work 100 Bibliography 101 Publication 108

    Anerand, A., & Kender, J. R. (2001). A unified memory based approach to cut, dissolve, key frame and scene analysis. In Proc. IEEE International Conference on Image Processing (ICIP).
    Bay, H., Ess, A., Tuytelaars, T., & Gool, L. V. (2008). Speeded-Up Robust Features (SURF). Computer Vision and Image Understanding, 110, 346-359. doi: 10.1016/j.cviu.2007.09.014.
    BBC. (2012). BBC News Video. Retrieved July, 2012, from http://www.bbc.co.uk/news/video_and_audio/
    Becker, J., & Kuropka, D. (2003). Topic-based Vector Space Model. In Proc. International Conference on Business Information Systems (BIS).
    Berg, T. L., Berg, A. C., Edwards, J., Maire, M., White, R., The, Y. W., Erik, L. M., & Forsyth, D. A. (2004). Name and Faces in the News. In Proc. IEEE International Conference on Computer vision and pattern recognition (CVPR).
    Bosch, A., Zisserman, A., & Munoz, X. (2008). Scene Classification Using a Hybrid Generative/Discriminative Approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(4), 712-727. doi: 10.1109/TPAMI.2007.70716.
    Brown, P. F., Pietra, S. A. D., Pietra, V. J. D., & Mercer, R. L. (1993). The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19, 263-311.
    Buckley, C., & Voorhees, E. (2000). Evaluating evaluation measure stability. In Proc. ACM Special Interest Group on Information Retrieval (SIGIR), 33-40. doi: 10.1145/345508.345543.
    Bunday, B. D. (1984). Basic Optimisation methods. London: Edward Arnold. ISBN: 978-0713135060.
    Burghouts, G. J., & Geusebroek, J. M. (2009). Performance evaluation of local colour invariants. Computer Vision and Image Understanding, 113, 48-62. doi: 10.1016/j.cviu.2008.07.003.
    Cao, J., Lan, Y., Li, J., Li, Q., Li, X., Lin, F., Liu, X., Luo, L., Peng, W., Wang, D., Wang, H., Wang, Z., Xiang, Z., Yuan, J., Zheng, W., Zhang, B., Zhang, J., Zhang, L., & Zhang. X. (2006). Intelligent multimedia group of Tsinghua University at TRECVID 2006. In Proc. TRECVID Workshop.
    Chang, S. F., Ellis, D., Jiang, W., Lee, K., Yanagawa, A., Loui, A. C., & Luo, J. (2007). Large-Scale Multimodal Semantic Concept Detection for Consumer Video. In Proc. ACM Multimedia Information Retrieval (MIR).

    Chang, C. C., & Lin, C. J. (2012). LIBSVM: A Library for Support Vector Machines. Retrieved July 2012, from http://www.csie.ntu.edu.tw/~cjlin/libsvm.
    Chen, K. J. (2000). CKIP Chinese Word Segmentation System, Chinese Knowledge and Information Processing Group. Retrieved July, 2012, from http://ckipsvr.iis.sinica.edu.tw/
    CNN (2012). Spetial Coverage International - CNN.com. Retrieved July, 2012, from http://edition.cnn.com/SPECIALS/.
    Cour, T., Benezit, F., & Shi, J. (2005). Spectral Segmentation with Multiscale Graph Decomposition. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1124-1131. doi: 10.1109/CVPR.2005.332.
    Csurka, G., Dance, C., Fan, L., Williamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. in Workshop on Statistical Learning in Computer Vision (ECCV).
    Dong, Z., & Dong, Q. (2000). An Introduction to HowNet. Retrieved July 2012, from http://www.keenage.com.
    Fergus, R., Li, F. F., Perona, P., & Zisserman, A. (2005). Learning object categories from google’s image search. In Proc. IEEE International Conference on Computer Vision (ICCV).
    Flickr. (2012). Flickr. Retrieved July, 2012, from http://www.flickr.com/.
    Funt, B. V., & Finlayson, G. D. (1995). Color constant color indexing. IEEE Transaction on Pattern Analysis and Machine Intelligence, 17(5) , 522-529.
    Geusebroek, J. M., Boomgaard, R., Smeulders, A. W. M., & Geerts, H. (2001). Color invariance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23, 1338–1350. doi:10.1109/34.977559.
    Gonfaus, J. M., Boix, X., van de Weijer, J., Bagdanov, A. D., Serrat, J., & Gonzàlez, J. (2010). Harmony Potentials for Joint Classification and Segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3280-3287. doi: 10.1109/CVPR.2010.5540048.
    Google. (2012). Google Image. Retrieved July, 2012, from http://www.google.com/imghp.
    Google. (2012). Youtube. Retrieved July, 2012, from http://www.youtube.com/.
    Grady, L. (2006). Random walks for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11), 1768-1783. doi: 10.1109/TPAMI.2006.233.
    Grady, L., & Schwartz, E. L. (2006). Isoperimetric Graph Partitioning for Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(3), 469-475, 2006. doi: 10.1109/TPAMI.2006.57.
    Harris, C., & Stephens, M. (1998). A combined corner and edge detection. In Proc. The fourth Alvey Vision Conference (AVC), 147-151.
    Hess, R. SIFT Library. Retrieved July 2012, from http://web.engr.oregonstate.edu/~hess/index.html.
    Hoang, M. A., Geusebroek, J. M., & Smeulders, A. W. M. (2005). Color texture measurement and segmentation. Signal Processing, 85, 265–275. doi:10.1016/j.sigpro.2004.10.009.
    Huiskes, M. J., & Lew, M. S. (2008). The MIR Flickr Retrieval Evaluation. In Proc. The ACM Multimedia Information Retrieval (MIR), 39-43. doi:10.1145/1460096.1460104.
    Huang, Y., Shekhar, S., & Xiong, H. (2004). Discovering Colocation Patterns from Spatial Data Sets: A General Approach. IEEE Transactions on Knowledge and Data Engineering, 16(12).
    Hanjalic, A., & Xu, L. Q. (2005). Affective Video Content Representation and Modeling. IEEE Transactions on Multimedia, 7(1), 143-154.
    Hu, H., Yan, X., Huang, Y., Han, J., & Zhou, X. J. (2005). Mining coherent dense subgraphs across massive biological networks for functional discovery. Bioinformatics, 21(1), 213-221.
    Hironobu, Y. M., Yakahashi, H., & Oka, R. (1999). Image-to-Word transformation based on dividing and vector quantizing images with words. In Proc. First International Workshop on Multimedia Intelligent Storage and Retrieval Management(MISRM).
    Jian, M., Dong, J., & Tang, R. (2007). Combining Color, Texture and Region with Object of User’s Interest for Content-Based Image retrieval. In Proc. Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD).
    Kim, J. G., Chang, H. S., Kang, K., Kim, M., Kim, J., & Kim, H. M. (2004). Summarization of News Video and Its Description for Content-based Access. International Journal Imaging System Technology, 13(5), 267-274.
    Kass, M., Witkin, A., & Terzopoulos, D. (1988). Snakes: Active contour models. International Journal of Computer Vision, 1(4), 321-331. doi: 10.1007/BF00133570.
    Liu, D., & Chen, T. (2008). DISCOV: A Framework for Discovering Objects in Video. IEEE Transactions of Multimedia, 10(2), 200-208.
    Lee, C. S., Jian, Z. W., & Huang, L. K. (2005). A Fuzzy Ontology and Its Application to News Summarization. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 35(5), 859-880.
    Lin, F. R., & Liang, C. H. (2008). Storyline-based summarization for news topic retrospection. Decision Support System, 45(3), 473-490.
    Liu, Y., Mei, T., Qi, G., Wu, X., & Hua, X. S. (2008). Query-Independent Learning for Video Search” In Proc. the IEEE International Conference on Multimedia and Expo (ICME).
    Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91-110.
    Li, F. F., Rob, F., & Pietro, P. (2004). Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Li, X., Wang, D., Li, J., & Zhang, B. (2007). Video Search in Concept Subspace: A Text-Like Paradigm. In Proc. International Conference on Content-based Image and Video Retrieval (CIVR).
    Ma, Y. F., Lu, L., Zhang, H. J., & Li, M. J. (2002). A User attention Model for Video Summarization. In Proc. ACM Multimedia (MM).
    Moosmann, F., Nowak, E., & Jurie, F. (2008). Randomized clustering forests for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1632–1646. doi: 10.1109/TPAMI.2007.70822.
    Nils, P., Marc, T., & Shinichi, N. (2009). Multi-class image segmentation using Conditional Random Fields and Global Classification. In Proc. International Conference on Machine Learning (ICML), 817-824. doi: 10.1145/1553374.1553479.
    Och, F. J., & Ney, H. (2003). A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29(1), 19-51.
    Peng, Y., & Ngo, C.W. (2006). Clip-Based Similarity Measure for Query-Dependent Clip Retrieval and Video Summarization. IEEE Transactions on Circuits and Systems for Video Technology, 16, 612-627.
    Pan, J. Y., Yang, H., & Faloutsos, C. (2004). MMSS: Multi-Modal Story-Oriented Video Summarization. In Proc. IEEE International Conference on Data Mining (ICDM).
    Pass, G., & Zabih, R. (1996). Histogram refinement for content-based image retrieval. In Proc. IEEE workshop on application of computer vision (WACV), 96-102.
    Qi, G. J., Hua, X. S., Rui, Y., Tang, J., Mei, T., & Zhang, H. J. (2007). Correlative Multi-Label Video Annotation. in Proc . ACM Multimedia (MM).
    Russell, B. C., Freeman, W. T., Efros, A. A., Sivic, J., & Zisserman, A. (2006). Using Multiple Segmentations to Discover Objects and their Extent in Image Collections. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1605-1614. doi: 10.1109/CVPR.2006.326 .
    Randen, T., & Husoy, J. H. (1999). Filtering for texture classification: A comparative study. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21, 291–310. doi:10.1109/34.761261.
    Rafael, C. G., & Richard, E. W. (2007). Digital Image Processing (3rd Edition). Prentice Hall. ISBN: 978-0131687288.
    Saux, B. L., & Amato, G. (2004). Image Recognition for Digital Libraries. In Proc. ACM Multimedia Information Retrieval (MIR).
    Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 26, 43- 49.
    van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M.. Evaluation of color descriptors for object and scene recognition. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 23-28. doi: 10.1109/CVPR.2008.4587658.
    Shi, J. Normalized Cuts. Retrieved July 2012, from http://www.cis.upenn.edu/~jshi/software/demo1.html.
    Shi, J., & Malik, J. (2000). Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888-905.
    Smith, J. R., Naphade, M. R., & Natsev, A. P. (2003). Multimedia semantic indexing using model vectors. In Proc. the IEEE International Conference on Multimedia and Expo (ICME), 445-448. doi:10.1109/ICME.2003.1221649.
    Sun, Y., Shimada, S., & Morimoto, M. (2006). Visual pattern discovery using web images. In Proc. ACM International Conference on Multimedia Information Retrieval (MIR).
    Salton, G., Wong, A., & Yang, C. S. (1975). A Vector Space Model for Automatic Indexing. Communications of the ACM, 18, 613-620.
    Sivic, J., & Zisserman, A. (2003). Video Google: A text retrieval approach to object matching in videos. In Proc. IEEE International Conference on Computer Vision (ICCV).
    Sivic, J., & Zisserman, A. (2007). Video Google: A Text Retrieval Approach to Object Matching in Videos. In Proc. IEEE International Conference on Computer Vision(ICCV).
    Tirilly, P., Claveau, V., & Gros, P. (2008). Language modeling for bag-of-visual words image categorization. In Proc. International Conference on Image and Video Retrieval (CIVR).
    Vedaldi, A., & Fulkerson, B. An Open and Portable Library of Computer Vision Algorithms. Retrieved July 2012, from http://www.vlfeat.org.
    Volkmer, T., & Natsev, A. P. (2006). Exploring Automatic Query Refinement for Text-Based Video Retrieval. In Proc. the IEEE International Conference on Multimedia and Expo (ICME).

    Voorhees, E. M. (1998). Variations in relevance judgments and the measurement of retrieval effectiveness. In Proc. ACM Special Interest Group on Information Retrieval (SIGIR), 697-716. doi: 10.1145/290941.291017.
    Vincent, L., & Soille, P. (1991). Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13, 583-598. doi: 10.1109/34.87344.
    Wang, H. M., Chen, B., Kuo, J. W., & Cheng, S. S. (2005). MATBN: A Mandarin Chinese Broadcast News Corpus. International Journal of Computational Linguistics and Chinese Language Processing, 10(2), 219-236.
    Wang, Y., & Gong, S. (2007). Translating Topics to Words for Image Annotation. In Proc. ACM conference on Conference on information and knowledge management (CIKM).
    Wu, C. H., Hsieh, C. H., & Huang, C. L. (2007). Speech Sentence Compression based on Speech Segment Extraction and Concatenation. IEEE Transactions on Multimedia, 9(2), 434-437.
    Wu, L., Hu, Y., Li, M., Yu, N., & Hua, X. S. (2009). Scale-Invariant Visual Language Modeling for Object Categorization. IEEE Transactions on Multimedia, 11(2), 286-294.
    Wong, R. C. F., & Leung, C. H. C. (2008). Automatic Semantic Annotation of Real-World Web Image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 933-1944.
    Wang, D., Liu, X., Luo, L., Li, J., & Zhang, B. (2007). Video diver: Generic video indexing with diverse features. in: Proc. ACM Multimedia Information Retrieval (MIR), 61-70. doi: 10.1145/1290082.1290094.
    Wang, W., Luo, Y., & Tang, G. (2008). Object retrieval using configurations of salient regions. In Proc. International Conference on Content-based Image and Video Retrieval (CIVR).
    Wikipedia. (2012). Golden Ratio. Retrieved July, 2012, from http://en.wikipedia.org/wiki/Golden_ratio.
    Wu, H. C., Luk, R. W. P., Wong, K. F., & Kwok, K. L. (2008). Interpreting TF-IDF term weights as making relevance decisions. ACM Transactions on Information Systems, 26, 1-37.
    Walpole, R. E., Myers, R. H., Myers, S. L., & Ye, K. (2006). Probability & Statistics for Engineers & Scientists, Pearson.
    Xu, C., Wang, J., Lu, H., & Zhang, Y. (2008). A Novel Framework for Semantic Annotation and Personalized Retrieval of Sports Video. IEEE Transactions on Multimedia, 10(3), 421-436.

    Yahoo. (2012). 生活影新聞. Retrieved at July, 2012, from http://tw.video.news.yahoo.com/video.
    Yanai, K. (2008). Web image selection with PLSA. In Proc. IEEE International Conference on Multimedia and Expo (ICME).
    Yan, X., Mehan, M. R., Huang, Y., Waterman, M. S., Yu, P. S., & Zhou, X. J. (2007). A graph-based approach to systematically reconstruct human transcriptional regulatory modules. Bioinformatics, 23(13), 577-586.
    Yeh, J. B., & Wu, C. H. (2008). Video News Retrieval incorporating Relevant Terms based on Distribution of Document Frequency. in Proc Pacific-Rim Conference on Multimedia (PCM), 583-592. doi: 10.1007/978-3-540-89796-5_60.
    Yeh, J. B., & Wu, C. H. (2009). Extraction of query term-related visual phrases for news video retrieval using mutual information. In Proc. IEEE International Symposium on Circuits and Systems (ISCAS), 730-733. doi: 10.1109/ISCAS.2009.5117852.
    Yeh, J. B., & Wu, C. H. (2010). Extraction of robust visual phrases using graph mining for image retrieval.. In Proc. IEEE International Symposium on Circuits and Systems (ISCAS), 3681-3684. doi: 10.1109/ISCAS.2010.5537760.
    Yeh, J. B., Wu, C. H., & Chang, S. X. (2011). Unsupervised alignment of news video and text using visual patterns and textual concepts. IEEE Transactions on Multimedia, 13(2), 206-215. doi: 10.1109/TMM.2010.2095412.
    Yeh, J. B., Wu, C. H., & Mai, S. X.. Multiple Visual Concept Discovery Using Concept-Based Visual Word Clustering. Multimedia Systems, to be published. doi: 10.1007/s00530-012-0294-9.
    Yeo, B. L., & Liu, B. (1995). Rapid Scene Analysis on Compressed Video. IEEE Transactions on Circuits and Systems for Video Technology, 5, 533-544.
    Yuan, J., Wu, Y., & Yang, M. (2007). Discovery of Collocation Patterns: from Visual Words to Visual Phrases. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    Zha, Z. J., Liu, Y., Mei, T., & Hua, X. S. (2008). Video Concept Detection Using Support Vector Machines - TRECVID 2007 Evaluations. Technical Report, MSR-TR-2008-10.
    Zheng, Q. F., & Gao, W. (2008). Constructing visual phrases for effective and efficient object-based image retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications, 6(1). doi: 10.1145/1404880.1404887.
    Zheng, Q. F., Wang, W. Q., & Gao, W. (2006). Effective and Efficient Object-based Image Retrieval Using Visual Phrases. In Proc. ACM Multimedia (MM).

    下載圖示 校內:2013-08-31公開
    校外:2013-08-31公開
    QR CODE