簡易檢索 / 詳目顯示

研究生: 李冠澄
Lee, Guan-Cheng
論文名稱: 二維圖像標記資訊融合光達之點雲分割
2DDATA - 2D Detection Annotations Transmittable Aggregation for Semantic Segmentation on Point Cloud
指導教授: 楊家輝
Yang, Jar-Ferr
學位類別: 碩士
Master
系所名稱: 敏求智慧運算學院 - 智慧科技系統碩士學位學程
MS Degree Program on Intelligent Technology Systems
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 47
中文關鍵詞: 光達點雲分割三維偵測感測融合語意分割模型影像特徵擷取器
外文關鍵詞: Lidar, Point Cloud Segmentation, 3D Detction, Sensor Fusion, Semantic Segmentation Model, Image Feature Extractor
相關次數: 點閱:590下載:101
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,隨著機器人技術或自駕車技術快速發展,其所需的高精度的感測手段亦逐漸受到重視,傳統的電腦視覺深度學習模型,如卷積神經網路 (CNN) 已無法滿足其對安全性的要求,即便更大型的轉換器 (Transformer) 亦對三維世界的高精度辨識任務力有未逮;隨之催生的是感測融合 (Sensor Fusion) 的解決方案,其考量到RGB相機的侷限性,轉而關注多種不同感測器的合作與融合,光達 (Lidar),其具有截然不同於 RGB 相機的物理特性,得以在惡劣天氣、極端光源的狀況下維持辨識系統的準確性對於精度敏感應用之融合方案的不二之選。
    本論文研究中,我們將介紹一種融合光達與RGB相機的深度學習模型,用於點雲分割,並且在執行推論任務時無須輸入RGB影像、校正兩者,而僅需要光達輸入;我們提出了 2D 偵測標記可傳遞聚合網路 (2DDATA),透過修改 2D 先驗輔助語意分割網路 (2DPASS) 之蒸餾模組,以結合二維檢測框 (2D Bounding Box),並以預訓練之語意分割模型 (Semantic Segmentation Model) 替換影像特徵擷取器 (Image Feature Extractor);該模型能夠在不增加任何推論開銷的情況下,有大幅超越 SPVCNN 的表現,展示了泛化二維資訊到三維分割模型的可擴展性。

    Recently, multi-modality models have been introduced because of the rapid developments of automation systems like robotics and autonomous driving. The traditional computer vision technology might not satisfy the security requirements of these systems. The complementary information from different sensors such as Lidar and cameras, however, equip the ability of high precision recognition. The sensor fusion solutions, focus on the combinations of different sensors, alleviate the disadvantage of camera stand-along system, and introduce Lidar to make up the limitation of camera in extreme weather and low lowlight environment. In this thesis, we introduce a deep learning model on point cloud segmentation, which fuses Lidar and camera, and only Lidar is needed during inference. We introduced the 2D detection annotations transmittable aggregation (2DDATA) network, and replaced the distillation module of 2D priors assisted semantic segmentation (2DPASS) model with the customized local object branch to cooperate with 2D bounding box. We also replace the image feature extractor with 2D semantic segmentation model. The result shows we can outperformance SPVCNN without adding extra inference cost like 2DPASS. 2DDATA is one of the best competitors on point cloud segmentation task, proving the feasibility of large multi-modality models fused with modality-specific data.

    摘要 3 Abstract 4 誌謝 5 Contents 6 List of Tables 8 List of Figures 9 Chapter 1 Introduction 10 1.1 Research Background 10 1.2 Motivations 11 1.3 Thesis Organization 12 Chapter 2 Related Work 13 2.1 Pure Lidar Solutions 13 2.2 Multi-Sensor Solutions 16 Chapter 3 The Proposed 2DDATA Model 18 3.1 Coordinate Transformation and Rasterization 18 3.2 2DPASS Overview 21 3.2.1 Feature Extraction 23 3.2.2 Modality Fusion 23 3.2.3 Multi-Scale Fusion-to-Single Knowledge Distillation (MSFSKD) 24 3.2.4 Loss 25 3.3 2DDATA Overview 26 3.3.1 Local Object Branch 27 3.3.2 Box-selected Features 27 3.3.3 Box Embedding and Points Embedding 28 3.3.4 Class Aware Attention 29 3.3.5 Loss 30 Chapter 4 Experiment Results 31 4.1 Environment Setup and Training Settings 31 4.2 Dataset 31 4.2.2 Point Cloud and Image Mismatch Problem 32 4.3 Experiments 34 4.3.1 Experiments Results and Analysis 34 4.3.2 Comparison to 2DPASS 38 4.3.3 Ablation Studies 40 Chapter 5 Conclusions 42 Chapter 6 Future Work 43 References 45

    [1] Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9297–9307, 2019.
    [2] Maxim Berman, Amal Rannen Triki, and Matthew B Blaschko. The lovász-softmaxloss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4413–4421, 2018.
    [3] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, QiangXu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
    [4] Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, and RohitGirdhar. Masked-attention mask transformer for universal image segmentation. arXiv preprint arXiv: 2112.01527, 2022.
    [5] Haotin Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, Ji Lin, Hanrui Wang, and SongHan. Searching efficient 3d architectures with sparse point-voxel convolution. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28,2020, Proceedings, Part XXVIII, pages 685–702. Springer, 2020.
    [6] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding, 2016.
    [7] Kyle Genova, Xiaoqi Yin, Abhijit Kundu, Caroline Pantofaru, Forrester Cole, AvneeshSud, Brian Brewington, Brian Shucker, and Thomas Funkhouser. Learning 3d semantic segmentation with only 2d image supervision, 2021.
    [8] Lingdong Kong, Youquan Liu, Runnan Chen, Yuexin Ma, Xinge Zhu, Yikang Li, Yuenan Hou, Yu Qiao, and Ziwei Liu. Rethinking range view representation for lidar segmentation. arXiv preprint arXiv:2303.05367, 2023.
    [9] Xin Lai, Yukang Chen, Fanbin Lu, Jianhui Liu, and Jiaya Jia. Spherical transformer for lidar-based 3d recognition. arXiv preprint arXiv: 2303.12766, 2023.
    [10] Xin Li, Tao Ma, Yuenan Hou, Botian Shi, Yucheng Yang, Youquan Liu, Xingjiao Wu,Qin Chen, Yikang Li, Yu Qiao, et al. Logonet: Towards accurate 3d object detection with local-to-global cross-modal fusion. arXiv preprint arXiv:2303.03595, 2023.
    [11] Yingwei Li, Adams Wei Yu, Tianjian Meng, Ben Caine, Jiquan Ngiam, Daiyi Peng,Junyang Shen, Yifeng Lu, Denny Zhou, Quoc V Le, et al. Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17182–17191, 2022.
    [12] Tingting Liang, Hongwei Xie, Kaicheng Yu, Zhongyu Xia, Zhiwei Lin, Yongtao Wang,Tao Tang, Bing Wang, and Zhi Tang. Bevfusion: A simple and robust lidar-camera fusion framework. arXiv preprint arXiv:2205.13790, 2022.
    [13] Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao, Daniela Rus, and Song Han. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. arXiv preprint arXiv:2205.13542, 2022.
    [14] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
    [15] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. ArXiv, abs/1912.01703, 2019.
    [16] Haibo Qiu, Baosheng Yu, and Dacheng Tao. Gfnet: Geometric flow network for 3dpoint cloud semantic segmentation. arXiv preprint arXiv:2207.02605, 2022.
    [17] Ran Cheng, Ryan Razani, Ehsan Taghavi, Enxu Li, and Bingbing Liu. 2-s3net: Attentive feature fusion with adaptive feature selection for sparse semantic segmentation network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12547–12556, 2021.
    [18] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Huggingface’s transformers: State-of-the-art natural language processing, 2020.
    [19] Xu Yan, Jiantao Gao, Chaoda Zheng, Chao Zheng, Ruimao Zhang, Shuguang Cui, andZhen Li. 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. InComputer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October23–27, 2022, Proceedings, Part XXVIII, pages 677–695. Springer, 2022.
    [20] Wei Jong Yang and Guan Cheng Lee. Addressing data misalignment in image-lidar fusion on point cloud segmentation. arXiv preprint arXiv: 2309.14932, 2023.
    [21] Maosheng Ye, Rui Wan, Shuangjie Xu, Tongyi Cao, and Qifeng Chen. Efficient point cloud segmentation with geometry-aware sparse networks. In Computer Vision–ECCV2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings,Part XXXIX, pages 196–212. Springer, 2022.
    [22] Yang Zhang, Zixiang Zhou, Philip David, Xiangyu Yue, Zerong Xi, Boqing Gong, andHassan Foroosh. Polarnet: An improved grid representation for online lidar point cloudssemantic segmentation. In Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition, pages 9601–9610, 2020.
    [23] Xinge Zhu, Hui Zhou, Tai Wang, Fangzhou Hong, Yuexin Ma, Wei Li, HongshengLi, and Dahua Lin. Cylindrical and asymmetrical 3d convolution networks for lidarsegmentation. In Proceedings of the IEEE/CVF conference on computer vision andpattern recognition, pages 9939–9948, 2021.
    [24] Zhuangwei Zhuang, Rong Li, Kui Jia, Qicheng Wang, Yuanqing Li, and Mingkui Tan.Perception-aware multi-sensor fusion for 3d lidar semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Comp

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE