簡易檢索 / 詳目顯示

研究生: 涂全成
Tu, Chuan-Cheng
論文名稱: 利用語意資訊之單視角深度估計網路
Monocular Depth Estimation Network with Semantic Information
指導教授: 楊家輝
Yang, Jar-Ferr
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 50
中文關鍵詞: 深度學習深度圖單視角深度估計語意資訊特徵融合
外文關鍵詞: deep learning, depth maps, monocular depth estimation, semantic information, feature fusion
相關次數: 點閱:144下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來隨著深度學習的議題蓬勃發展,越來越多人應用深度學習於電腦視覺領域,例如2D轉3D就是一個熱門領域,其中深度圖在2D轉3D的領域中是極為重要的一項3D資訊,所以擁有一個能產生出精確深度圖的方法就顯得特別重要。
    單視角深度估計透過只輸入單視角影像,就能生成出相對應遠近關係的深度圖,相較於立體匹配的方法更為困難,特別是在深度細節及物體邊緣都會有模糊狀況產生,這對於2D轉3D的結果都會有觀賞度不佳的情況產生。為了解決此問題,我們提出結合語意資訊的語意感知單視角深度估計網路,其中含有深度分支和語意分支。然後再透過我們提出的三階段式的訓練策略,還有能夠讓兩類不同任務之特徵能夠做特徵融合的交互特徵補償模組,使深度特徵和語意特徵能夠充分融合。此概念是為了借助富有邊緣資訊的語意特徵使得預測出來的深度圖也能夠有銳利的邊緣。最後也將網路預測的結果和現有的方法針對室內及室外場景做比較,實驗結果顯示也有不錯有效果,和現有方法亦有一定的競爭力的。

    With the development of deep learning technologies in recent years, more and more people are applying them to various fields of computer vision. For instance, 2D to 3D conversion is a popular field, where depth map is an extremely important feature to represent 3D information. Accordingly, it is especially important to have a method that can generate accurate depth maps. Monocular depth estimation can generate a depth map corresponding to the near and far relationship by only inputting a single image. Compared to the stereo matching methods, it is more difficult, especially in depth details and object edges will be blurred, which might result in poor 3D viewing quality. To solve this problem, we propose a semantic-aware monocular depth estimation network, which contains depth and semantic branches. Furthermore, we propose a three-stage training strategy and a cross feature compensation module that enables feature fusion of two different types of task features to fully integrate depth and semantic features together. We effectively utilize semantic features, which have rich edge information, so that we can have the predicted depth map, which is with sharp edges. Finally, the proposed network prediction results are compared with the existing methods for indoor and outdoor scenes. The good experimental results show that the performance of the proposed method is competitive with the existing methods.

    摘要 I Abstract II 誌謝 III Contents IV List of Tables VII List of Figures VIII Chapter 1 Introduction 1 1.1 Research Background 1 1.2 Motivations 2 1.3 Thesis Organization 4 Chapter 2 Related Work 5 2.1 Depth Estimation 5 2.2 Multi-task Learning Schemes 6 2.3 Feature Fusion Module 7 2.4 Dilated Convolution 8 2.5 Atrous Spatial Pyramid Pooling 9 2.6 Squeeze and Excitation Network 10 2.7 Stripe Refinement Module 11 Chapter 3 The Proposed Semantic-aware Monocular Depth Estimation Network 13 3.1 Overview of the Proposed SMDEN 14 3.2 Network Architecture 15 3.2.1 Shared Encoder 15 3.2.2 Two-Branch Decoder 17 A. Depth Decoder 18 B. Semantic Decoder 19 C. Shared Partial Decoder 20 3.2.3 Boundary Emphasize Module 21 3.2.4 Cross Feature Compensation Module 23 3.3 Training Strategy 26 3.4 Loss Function 26 3.4.1 Semantic Loss 27 3.4.2 Depth Loss 27 A. Multi-scale L1 Loss 27 B. Smoothness Loss 28 C. Structural Similarity Loss 28 3.4.3 Cross Domain Discontinuity Loss 29 Chapter 4 Experimental Results 30 4.1 Environmental Settings 30 4.2 Datasets and Training Detail 31 4.2.1 Datasets 31 4.2.2 Detailed Training Procedure 32 A. Data Augmentation 32 B. Training Settings 32 4.2.3 Statistical Indicators 34 4.3 Ablation Study 35 4.3.1 Verification of Cross Feature Compensation Module (CFCM) 35 4.3.2 Verification of Structure 37 4.3.3 Verification of Feature Fusion Module 37 4.4 Comparison with other Approaches 39 4.4.1 Comparisons on KITTI Dataset 39 4.4.2 Comparisons on NYU Depth v2 Dataset 41 4.5 Predicted Auxiliary Semantic Segmented Map 43 Chapter 5 Conclusions 45 Chapter 6 Future Work 46 References 47

    [1] H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 328-341, 2007.
    [2] D. Eigen, C. Puhrsch and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” Proceedings of Advances in Neural Information Processing Systems, pp. 2366-2374, 2014.
    [3] C. Godard, O. M. Aodha and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270-279, 2017.
    [4] P. Z. Ramirez, M. Poggi, F. Tosi and S. Mattoccia, L. D. Stefano, “Geometry meets semantics for semi-supervised monocular depth estimation,” Proceedings of Asian Conference on Computer Vision, pp. 298-313, Springer, Cham, December 2018.
    [5] H. Ren, N. Gao and J. Li, “Boosting unsupervised monocular depth estimation with auxiliary semantic information,” China Communications, vol. 18, no. 6, pp.228-243, 2021.
    [6] J. Choi, D. Jung, D. Lee and C. Kim, “Safenet: Self-supervised monocular depth estimation with semantic-aware feature extraction,” arXiv:2010.02893[cs.CV], 2020.
    [7] J. Jiao, Y. Cao, Y. Song and R. Lau, “Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss,” Proceedings of the European Conference on Computer Vision (ECCV), pp. 53-69, 2018.
    [8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213-3223, 2016.
    [9] A. Geiger, P. Lenz and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354-3361, June 2012.
    [10] J. Zou, R. Wang and J. Z. Wen, “Multi-task feature fusion network for monocular depth estimation without joint annotations,” Proceedings of Eighth Symposium on Novel Photoelectronic Detection Technology and Applications, vol. 12169, pp. 3042-3047, SPIE, March 2022.
    [11] X. Xu, Z. Chen and F. Yin, “Monocular depth estimation with multi-scale feature fusion,” IEEE Signal Processing Letters, vol. 28, 678-682, 2021.
    [12] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv:1511.07122[cs.CV], 2015.
    [13] L. -C. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834-848, 1 April 2018
    [14] J. Hu, L. Shen and G. Sun, “Squeeze-and-excitation networks,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132-7141, 2018.
    [15] F. Xue, J. Cao, Y. Zhou, F. Sheng, Y. Wang and A. Ming, “Boundary-induced and scene-aggregated network for monocular depth prediction,” Pattern Recognition, vol. 115, Jul 2021.
    [16] O. Ronneberger, P. Fischer and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” Proceedings of International Conference on Medical Image Computing and Computer-assisted Intervention, pp.234-241, Springer, Cham, October 2015.
    [17] I. Laina, C. Rupprecht, V.Belagiannis, F. Tombari and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” Proceedings of IEEE Fourth International Conference on 3D Vision (3DV), pp. 239-248, October 2016.
    [18] N. Silberman, D. Hoiem, P. Kohli and R. Fergus, “Indoor segmentation and support inference from rgbd images,” Proceedings of European Conference on Computer Vision, pp. 746-760, Springer, Berlin, Heidelberg, October 2012.
    [19] K. He, X. Zhang, S. Ren and J. Sun, “Deep residual learning for image recognition” Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, pp. 770-778, 2016.
    [20] T. Kanade and M. Okutomi, “A stereo matching algorithm with an adaptive window: theory and experiment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 9, pp. 920-932, September 1994.
    [21] M. Menze, and A. Geiger, “Object scene flow for autonomous vehicles,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3061-3070, 2015.
    [22] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556 [cs.CV], 2014.
    [23] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700-4708, 2017.

    下載圖示 校內:2024-08-01公開
    校外:2024-08-01公開
    QR CODE