簡易檢索 / 詳目顯示

研究生: 蔡承哲
Tsai, Cheng-Che
論文名稱: 點擊強化寬注意力之互動式多物件影像切割網路
Interactive Networks with Click Enhancement and Large Window Attention for Multiple Objects Segmentation
指導教授: 楊家輝
Yang, Jar-Ferr
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 52
中文關鍵詞: 深度學習電腦視覺互動式影像切割視覺轉換器
外文關鍵詞: deep learning, computer vision, interactive image segmentation, vision transformer
相關次數: 點閱:63下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,圖像分割在計算機視覺領域中成為了一項重要任務。這個過程涉及將圖像劃分為多個區域,也稱為圖像區域或物件(像素集)。圖像分割可以分為三種類型:實例分割、語義分割和全景分割。與這些類型不同的是,基於點擊的互動式圖像分割(IIS)是一項專門的任務,需有人為參與。除了輸入圖像外,點擊提示還作為模型的另一個重要輸入。該提示在指導模型生成高質量的分割結果方面發揮著關鍵作用。儘管許多關於互動式圖像分割的研究主要集中在實例分割上,本文提出了一個網絡,不僅能夠實現高效的實例分割,還能在有限的用戶交互下提供語義結果。所提出的IIS網絡集成了點擊增強(CE)和大窗口注意(LWA)模塊,這些模塊在網絡中部對點擊提示進行處理。根據實驗結果,所提出的CE和LWA模塊成功使模型在與其他互動式圖像分割研究相比中達到競爭性結果。我們還在軟體中實現了雙模式互動式圖像分割,允許用戶根據使用場景自由切換分割模式。

    In recent years, image segmentation has emerged as a prominent task in computer vision. This process involves partitioning an RGB image into multiple segments, also known as image regions or objects (sets of pixels). Image segmentation can be categorized into three types: instance segmentation, semantic segmentation, and panoptic segmentation. Unlike these types, click-based interactive image segmentation (IIS) is a specialized task with human involvements. In addition to the input image, a click prompt serves as another crucial input for the model. This prompt plays a pivotal role in guiding the model to produce high-quality segmentation results. While many studies on interactive image segmentation have primarily focused on instance segmentation, this thesis proposes a network that not only achieves efficient instance segmentation but also delivers semantic results with limited user interaction. The proposed IIS network integrates with click enhancement (CE) and large window attention (LWA) modules to click prompts in the middle of the network. According to experimental results, the proposed CE and LWA modules successfully enable the model to achieve competitive results compared to other interactive image segmentation efforts. We have also implemented a dual-mode interactive image segmentation in software, allowing users to freely switch between segmentation modes according to their usage scenarios.

    摘要 II Abstract III 誌謝 IV Contents V List of Tables VII List of Figures VIII Chapter 1 Introduction 1 1.1 Research Background 1 1.2 Motivations 2 1.3 Thesis Organization 3 Chapter 2 Related Work 4 2.1 Interactive Image Segmentation (IIS) 4 2.2 Pyramid Pooling Transformer (P2T) 5 2.3 Large Window Attention (LWA) 6 2.4 Mask Correction Network 7 2.5 Click Encoding 9 2.6 Interactive Training Strategy 10 Chapter 3 The Proposed Interactive Image Segmentation System 11 3.1 Overview of the Proposed System 11 3.2 Data Pre-processing 12 3.3 Click Simulator 13 3.4 The Proposed Interactive Model 14 3.5 Click Enhancement Large Window Attention Module (CELWAM) 16 3.5.1 Large Window Attention ASPP 17 3.5.2 Click Enhancement Module 18 3.6 Loss Function 21 Chapter 4 Experiment Results 23 4.1 Environment Settings and Dataset 23 4.2 Training Details 24 4.3 Evaluation Metrics 24 4.3.1 Number of Clicks (NoC) 24 4.3.2 Number of Failure Cases (NoF) 25 4.4 Comparison with Other Methods 25 4.5 Ablation Study 29 4.6 Visualization Result 32 4.7 System Implementation 35 Chapter 5 Conclusions 36 Chapter 6 Future Work 37 References 39

    [1] C. Rother, V. Kolmogorov, and A. Blake, “‘GrabCut’ interactive fore ground extraction using iterated graph cuts,” ACM Trans. Graph., vol. 23, no. 3, pp. 309–314, 2004.
    [2] V. Kwatra, A. Schödl, I. Essa, G. Turk and A. Bobick, “Graphcut textures: Image and video synthesis using graph cuts,” ACM Trans. Graph., vol. 22, no. 3, pp. 277-286, July 2003.
    [3] Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun. “Scribblesup: Scribble-supervised convolutional networks for semantic segmentation,” in Proc. of the IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 3159–3167.
    [4] K. Sofiiuk, I. Petrov, O. Barinova, and A. Konushin, “F-BRS: Rethink ing backpropagating refinement for interactive segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 8623–8632.
    [5] K. Sofiiuk, I. A. Petrov, and A. Konushin, “Reviving iterative training with mask guidance for interactive segmentation,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Oct. 2022, pp. 3141–3145.
    [6] Z. Lin, Z. Zhang, L.-Z. Chen, M.-M. Cheng, and S.-P. Lu, “Interactive image segmentation with first click attention,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 13339–13348.
    [7] K.-K. Maninis, S. Caelles, J. Pont-Tuset, and L. Van Gool, “Deep extreme cut: From extreme points to object segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 616–625.
    [8] A. Dosovitskiy et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in ICLR, 2021.
    [9] W. Wang et al, "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions," in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 568-578.
    [10] W. Yu et al, "MetaFormer Is Actually What You Need for Vision," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 10819-10829.
    [11] X. Chen, Z. Zhao, Y. Zhang, M. Duan, D. Qi, and H. Zhao, “FocalClick: Towards practical interactive image segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 1290–1299.
    [12] E. Xie et al., “SegFormer: Simple and efficient design for semantic segmentation with transformers,” in Proc. Conf. Neural Informat. Process. Syst., 2021.
    [13] Q. Liu, Z. Xu, G. Bertasius, and M. Niethammer, “SimpleClick: Interactive image segmentation with simple vision transformers,” in Proc. ICCV, 2023, pp. 22290–22300.
    [14] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked Autoencoders Are Scalable Vision Learners,” in Proc IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 16000–16009.
    [15] Y.-H. Wu, Y. Liu, X. Zhan, and M.-M. Cheng, “P2T: Pyramid Pooling Transformer for Scene Understanding,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 11, pp. 12760–12771, 2023, doi: 10.1109/TPAMI.2022.3202765.
    [16] H. Yan, C. Zhang, and M. Wu, “Lawin transformer: Improving semantic segmentation transformer with multi-scale representations via large window attention,” arXiv preprint arXiv:2201.01615, 2022.
    [17] F. Du, J. Yuan, Z. Wang, and F. Wang, “Efficient mask correction for click-based interactive image segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 22773–22782.
    [18] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic imagesegmentation,” 2017, arXiv: 1706.05587.
    [19] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2018, doi: 10.1109/TPAMI.2017.2699184.
    [20] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017, pp. 2980–2988.
    [21] B.Hariharan, P. Arbelaez, L. Bourdev, S.Maji, and J.Malik, “Semantic contours from inverse detectors,” in Proc. IEEE Int. Conf. Comput. Vis., 2011, pp. 991–998.
    [22] T.-Y. Lin et al., “Microsoft COCO: Common Objects in Context,” in Computer Vision – ECCV 2014, 2014, pp. 740–755.
    [23] Agrim Gupta, Piotr Doll´ ar, and Ross Girshick, “LVIS: A dataset for large vocabulary instance segmentation,” in IEEE/CVF Conf. Comput. Vis. Pattern Recogni. (CVPR). Jun. 2019, pp. 5356–5364.
    [24] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, 2001, vol. 2, pp. 416–423 vol.2. doi: 10.1109/ICCV.2001.937655.
    [25] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 724–732.
    [26] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL visual object classes (VOC) challenge,” Int. J. Comput. Vis., vol. 88, pp. 303–338, Jun. 2010.
    [27] X. Chen, Z. Zhao, F. Yu, Y. Zhang, and M. Duan, “Conditional diffusion for interactive segmentation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 7345–7354.
    [28] Q. Liu et al., “PseudoClick: Interactive image segmentation with click imitation,” in Proc. ECCV, 2022, pp. 728–745.
    [29] Z. Lin, Z.-P. Duan, Z. Zhang, C.-L. Guo, and M.-M. Cheng, “FocusCut: Diving into a focus view in interactive segmentation,” in Proc. CVPR, 2022, pp. 2637–2646.
    [30] Q. Wei, H. Zhang, and J.-H. Yong, “Focused and collaborative feedback integration for interactive image segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 18643–18652.
    [31] S. Gerhard, J. Funke, J. Martel, A. Cardona, and R. Fetter, “Segmented anisotropic ssTEM dataset of neural tissue,” Figshare, Nov. 2013.
    [32] U. Baid et al., “The RSNA-ASNR-MICCAI BraTS 2021 benchmark on brain tumor segmentation and radiogenomic classification,” 2021, arXiv:2107.02314.
    [33] F. Ambellan, A. Tack, M. Ehlke, and S. Zachow, “Automated segmen tation of knee bone and cartilage combining statistical shape knowledge and convolutional neural networks: Data from the osteoarthritis initiative,” Med. Image Anal., vol. 52, pp. 109–118, Feb. 2019.
    [34] A. Kirillov et al., “Segment Anything,” in Proc IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2023, pp. 4015–4026.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE