簡易檢索 / 詳目顯示

研究生: 楊晴晴
Yang, Ching-Ching
論文名稱: 基於跨模態深度學習之弱監督式影像雜湊技術
Weakly-Supervised Deep Image Hashing based on Cross-Modal Transformer
指導教授: 朱威達
Chu, Wei-Ta
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 31
中文關鍵詞: Vision Transformer弱監督式雜湊技術跨模態之圖片搜索
外文關鍵詞: Vision Transformer, Weakly-Supervised Hashing, Cross-Modal Retrieval
相關次數: 點閱:60下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 弱監督式影像雜湊技術近年因想利用豐富但帶有雜亂文字標籤 (tags) 的網路圖片資源而興起,這些圖片附帶的文字資訊與影像本身內容兩者間有著微弱關聯性,因此可用來幫助訓練深度學習雜湊模型。然而,如何從這些複雜的影像與標籤中挑出有效的資訊,以及如何找到這兩類特徵之間的關聯,至今仍是具有挑戰性的問題。在本論文中,我們把 vision transformers 運用於雜湊模型的深度學習,並提出一個新的方法--Weakly-supervised deep Hashing based on Cross-Modal Transformer (WHCMT)。首先,跨維度的注意力機制被發現能更有效地提取圖片特徵,使用基本的 transformer 架構去找尋標籤間的自注意力機制也能用來生成文字特徵;其次,我們提出的 cross-modal transformer 能用來訓練圖片與標籤特徵之間的跨模態注意力機制;最後再藉由 embedding layers 去生成雜湊碼作為最終輸出。在我們針對影像檢索的測試中,WHCMT 於 MIRFLICKR-25K 與 NUS-WIDE 這兩個資料集上,表現皆與目前最優異的結果相當。

    Weakly-supervised image hashing emerges recently because web images associated with contextual text or tags (though noisy) are abundant, and
    the text information weakly-related to images can be utilized to guide the learning of a deep hashing network. However, how to find effective representations
    from complex images and noisy tags, and how to discover the cross-modal association between them, are still an under-explored and challenging problem. In
    this paper, we introduce vision transformers to deep image hashing, and propose a method named Weakly-supervised deep Hashing based on Cross-Modal
    Transformer (WHCMT). First, cross-scale attention between image patches is
    discovered to form more effective visual representations. A baseline transformer
    is also adopted to find self-attention in the set of tags and form tag representations. Second, the cross-modal attention between images and tags is discovered
    by the proposed cross-modal transformer. Effective hash codes are then generated by embedding layers. WHCMT is tested on semantic image retrieval, and
    we show new state-of-the-art results can be obtained for the MIRFLICKR-25K
    dataset and NUS-WIDE dataset.

    摘要 i Abstract ii Table of Contents iii List of Tables v List of Figures vi Chapter 1. Introduction 1 1.1. Motivation 1 1.2. Issues of Weakly-Supervise Task 2 1.3. Contributions 3 1.4. Thesis Organization 3 Chapter 2. Related Works 4 2.1. Vision Transformer 4 2.2. Image Hashing 4 2.3. Summary 6 Chapter 3. The Proposed Model 7 3.1. Transformer 7 3.1.1. Transformer Encoder 7 3.1.2. Vision Transformer 8 3.2. The Proposed WHCMT 12 3.2.1. Overview 12 3.2.2. Cross-Attention Vision Transformer 14 3.2.3. Tag Transformer 15 3.2.4. Cross-Modal Transformer 15 3.3. Loss Functions 16 3.3.1. Pairwise Similarity Loss 16 3.3.2. Hinge Loss 17 3.3.3. Quantization Loss 17 3.4. Summary 18 Chapter 4. Experimental Results 19 4.1. Experimental Settings 19 4.1.1. Datasets 19 4.1.2. Training Details 20 4.2. Performance Evaluation 20 4.2.1. Performance Comparison 21 4.2.2. Weightings for Losses 21 4.2.3. Numbers of Transformer Encoders 23 4.2.4. Performance Variations of Different Frameworks 23 4.2.5. Sample Results 24 4.2.6. Attention Visualization 25 4.3. Limitation and Discussion 27 Chapter 5. Conclusion 28 5.1. Conclusion 28 5.2. Future Work 28 References 29

    [1] Artem Babenko and Victor Lempitsky. Additive quantization for extreme vector compression. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pages 931–938, 2014.
    [2] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of ACM symposium on Theory of Computing, pages 380–388, 2002.
    [3] Chun-Fu Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of IEEE International Conference on Computer Vision, pages 357–366, 2021.
    [4] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. Nus-wide: A real-world web image database from national university of singapore. In Proceedings of ACM International Conference on Image and Video Retrieval, 2009.
    [5] Hui Cui, Lei Zhu, Chaoran Cui, Xiushan Nie, and Huaxiang Zhang. Efficient weakly-supervised discrete hashing for large-scale social image retrieval. Pattern Recognition Letters, 130:174–181, 2020.
    [6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
    [7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
    [8] Kamran Ghasedi Dizaji, Feng Zheng, Najmeh Sadoughi Nourabadi, Yanhua Yang, Cheng Deng, and Heng Huang. Unsupervised deep generative adversarial hashing network. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pages 3664–3673, 2018.
    [9] Thanh-Toan Do, Anh-Dzung Doan, and Ngai-Man Cheung. Learning to hash with binary deep neural network. In Proceedings of European Conference on Computer Vision, pages 219–234, 2016.
    [10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of International Conference on Learning Representations, 2021.
    [11] Shiv Ram Dubey, Satish Kumar Singh, and Wei-Ta Chu. Vision transformer hashing for image retrieval. In Proceedings of IEEE International Conference on Multimedia & Expo, 2022.
    [12] Vijetha Gattupalli, Yaoxin Zhuo, and Baoxin Li. Weakly supervised deep image hashing through tag embeddings. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019.
    [13] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2916–2929, 2013.
    [14] Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh. Rethinking spatial dimensions of vision transformers. CoRR, abs/2103.16302, 2021.
    [15] Jae-Pil Heo, Youngwoon Lee, Junfeng He, Shih-Fu Chang, and Sung-Eui Yoon. Spherical hashing. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pages 2957–2964, 2012.
    [16] Mark J. Huiskes and Michael S. Lew. The mir flickr retrieval evaluation. In Proceedings of ACM International Conference on Multimedia Retrieval, pages 39–43, 2008.
    [17] Zhongming Jin, Cheng Li, Yue Lin, and Deng Cai. Density sensitive hashing. IEEE Transactions on Cybernetics, 44(8):1362–1371, 2014.
    [18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of Advances in Neural Information Processing Systems, 2012.
    [19] Kevin Lin, Jiwen Lu, Chu-Song Chen, and Jie Zhou. Learning compact binary descriptors with unsupervised deep neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1183–1192, 2016.
    [20] Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou. Deep hashing for compact binary codes learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2475–2483, 2015.
    [21] Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. Deep supervised hashing for fast image retrieval. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pages 2064–2072, 2016.
    [22] Wei Liu, Cun Mu, Sanjiv Kumar, and Shih-Fu Chang. Discrete graph hashing. In Proceedings of Advances in Neural Information Processing Systems, pages 3419–3427, 2014.
    [23] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In Proceedings of International Conference on Learning Representations, 2013.
    [24] Jinhui Tang and Zechao Li. Weakly supervised multimodal hashing for scalable social image retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 28(10):2730–2741, 2018.
    [25] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillationthrough attention. In Proceedings of International Conference on Machine Learning, pages 10347–10357, 2021.
    [26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
    [27] Haofan Wang, Mengnan Du, Fan Yang, and Zijian Zhang. Score-cam: Improved visual explanations via score-weighted class activation mapping. CoRR, abs/1910.01279, 2019.
    [28] Jinpeng Wang, Bin Chen, Qiang Zhang, Zaiqiao Meng, Shangsong Liang, and Shutao Xia. Weakly supervised deep hyperspherical quantization for image retrieval. In Proceedings of AAAI Conference on Artificial Intelligence, pages 2755–2763, 2021.
    [29] Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. Semi-supervised hashing for large-scale search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(12):2393–2406, 2012.
    [30] Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In Proceedings of Advances in Neural Information Processing Systems, pages 1753–1760, 2009.
    [31] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. CoRR, abs/2103.15808, 2021.
    [32] Erkun Yang, Cheng Deng, Tongliang Liu, Wei Liu, and Dacheng Tao. Semantic structure-based unsupervised deep hashing. In Proceedings of International Joint Conference on Artificial Intelligence, pages 1064–1070, 2018.
    [33] Huei-Fang Yang, Kevin Lin, and Chu-Song Chen. Supervised learning of semantics-preserving hash via deep convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(2):437–451, 2018.
    [34] Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal self-attention for local-global interactions in vision transformers. CoRR, abs/2107.00641, 2021.

    下載圖示 校內:2023-09-01公開
    校外:2023-09-01公開
    QR CODE