| 研究生: |
涂全成 Tu, Chuan-Cheng |
|---|---|
| 論文名稱: |
利用語意資訊之單視角深度估計網路 Monocular Depth Estimation Network with Semantic Information |
| 指導教授: |
楊家輝
Yang, Jar-Ferr |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 110 |
| 語文別: | 英文 |
| 論文頁數: | 50 |
| 中文關鍵詞: | 深度學習 、深度圖 、單視角深度估計 、語意資訊 、特徵融合 |
| 外文關鍵詞: | deep learning, depth maps, monocular depth estimation, semantic information, feature fusion |
| 相關次數: | 點閱:144 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來隨著深度學習的議題蓬勃發展,越來越多人應用深度學習於電腦視覺領域,例如2D轉3D就是一個熱門領域,其中深度圖在2D轉3D的領域中是極為重要的一項3D資訊,所以擁有一個能產生出精確深度圖的方法就顯得特別重要。
單視角深度估計透過只輸入單視角影像,就能生成出相對應遠近關係的深度圖,相較於立體匹配的方法更為困難,特別是在深度細節及物體邊緣都會有模糊狀況產生,這對於2D轉3D的結果都會有觀賞度不佳的情況產生。為了解決此問題,我們提出結合語意資訊的語意感知單視角深度估計網路,其中含有深度分支和語意分支。然後再透過我們提出的三階段式的訓練策略,還有能夠讓兩類不同任務之特徵能夠做特徵融合的交互特徵補償模組,使深度特徵和語意特徵能夠充分融合。此概念是為了借助富有邊緣資訊的語意特徵使得預測出來的深度圖也能夠有銳利的邊緣。最後也將網路預測的結果和現有的方法針對室內及室外場景做比較,實驗結果顯示也有不錯有效果,和現有方法亦有一定的競爭力的。
With the development of deep learning technologies in recent years, more and more people are applying them to various fields of computer vision. For instance, 2D to 3D conversion is a popular field, where depth map is an extremely important feature to represent 3D information. Accordingly, it is especially important to have a method that can generate accurate depth maps. Monocular depth estimation can generate a depth map corresponding to the near and far relationship by only inputting a single image. Compared to the stereo matching methods, it is more difficult, especially in depth details and object edges will be blurred, which might result in poor 3D viewing quality. To solve this problem, we propose a semantic-aware monocular depth estimation network, which contains depth and semantic branches. Furthermore, we propose a three-stage training strategy and a cross feature compensation module that enables feature fusion of two different types of task features to fully integrate depth and semantic features together. We effectively utilize semantic features, which have rich edge information, so that we can have the predicted depth map, which is with sharp edges. Finally, the proposed network prediction results are compared with the existing methods for indoor and outdoor scenes. The good experimental results show that the performance of the proposed method is competitive with the existing methods.
[1] H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 328-341, 2007.
[2] D. Eigen, C. Puhrsch and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” Proceedings of Advances in Neural Information Processing Systems, pp. 2366-2374, 2014.
[3] C. Godard, O. M. Aodha and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270-279, 2017.
[4] P. Z. Ramirez, M. Poggi, F. Tosi and S. Mattoccia, L. D. Stefano, “Geometry meets semantics for semi-supervised monocular depth estimation,” Proceedings of Asian Conference on Computer Vision, pp. 298-313, Springer, Cham, December 2018.
[5] H. Ren, N. Gao and J. Li, “Boosting unsupervised monocular depth estimation with auxiliary semantic information,” China Communications, vol. 18, no. 6, pp.228-243, 2021.
[6] J. Choi, D. Jung, D. Lee and C. Kim, “Safenet: Self-supervised monocular depth estimation with semantic-aware feature extraction,” arXiv:2010.02893[cs.CV], 2020.
[7] J. Jiao, Y. Cao, Y. Song and R. Lau, “Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss,” Proceedings of the European Conference on Computer Vision (ECCV), pp. 53-69, 2018.
[8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213-3223, 2016.
[9] A. Geiger, P. Lenz and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354-3361, June 2012.
[10] J. Zou, R. Wang and J. Z. Wen, “Multi-task feature fusion network for monocular depth estimation without joint annotations,” Proceedings of Eighth Symposium on Novel Photoelectronic Detection Technology and Applications, vol. 12169, pp. 3042-3047, SPIE, March 2022.
[11] X. Xu, Z. Chen and F. Yin, “Monocular depth estimation with multi-scale feature fusion,” IEEE Signal Processing Letters, vol. 28, 678-682, 2021.
[12] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv:1511.07122[cs.CV], 2015.
[13] L. -C. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834-848, 1 April 2018
[14] J. Hu, L. Shen and G. Sun, “Squeeze-and-excitation networks,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132-7141, 2018.
[15] F. Xue, J. Cao, Y. Zhou, F. Sheng, Y. Wang and A. Ming, “Boundary-induced and scene-aggregated network for monocular depth prediction,” Pattern Recognition, vol. 115, Jul 2021.
[16] O. Ronneberger, P. Fischer and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” Proceedings of International Conference on Medical Image Computing and Computer-assisted Intervention, pp.234-241, Springer, Cham, October 2015.
[17] I. Laina, C. Rupprecht, V.Belagiannis, F. Tombari and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” Proceedings of IEEE Fourth International Conference on 3D Vision (3DV), pp. 239-248, October 2016.
[18] N. Silberman, D. Hoiem, P. Kohli and R. Fergus, “Indoor segmentation and support inference from rgbd images,” Proceedings of European Conference on Computer Vision, pp. 746-760, Springer, Berlin, Heidelberg, October 2012.
[19] K. He, X. Zhang, S. Ren and J. Sun, “Deep residual learning for image recognition” Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, pp. 770-778, 2016.
[20] T. Kanade and M. Okutomi, “A stereo matching algorithm with an adaptive window: theory and experiment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 9, pp. 920-932, September 1994.
[21] M. Menze, and A. Geiger, “Object scene flow for autonomous vehicles,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3061-3070, 2015.
[22] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556 [cs.CV], 2014.
[23] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700-4708, 2017.