簡易檢索 / 詳目顯示

研究生: 宗宛儂
Tsung, Wan-Nung
論文名稱: 應用於自駕技術之邊緣強化及時序性特徵融合單視角深度生成網路
Temporal Monocular Depth Estimation Network with Depth Edge Guidance and Temporal Feature Blocks for Autonomous Driving
指導教授: 陳進興
Chen, Chin-Hsing
楊家輝
Yang, Jar-Ferr
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 51
中文關鍵詞: 單目深度生成深度學習時序性特徵融合3D 影像自動駕駛技術深度邊緣輔助子網路卷積神經網路
外文關鍵詞: monocular depth estimation, deep learning, temporal feature blocks, 3D video, autonomous driving, depth edge-guided network, convolutional neural networks
相關次數: 點閱:64下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著科技的進步,深度估計已成為電腦視覺中重要的主題,並且可應用在許多應用領域,例如3D電影、AR、VR、機器人控制和自動駕駛等。傳統上普遍使用雙視角影像,經由立體匹配技術來獲得深度資訊。然而近年來類神經網路的蓬勃發展,為深度估計領域開闢出新的研究方向,也突破先前傳統演算法的瓶頸。基於深度學習技術,我們可使用具有編碼器和解碼器結構的類神經網絡,成功解決複雜的深度估計問題。為了實現準確且快速的深度估計,在本篇論文中,提出基於一種時序性特徵融合半無監督式學習的單目深度估計(MDE),我們引入兩個輔助子網路,分別位於編碼器與解碼器中。首先,我們在解碼器中加入呈現深度邊緣引導的輔助網絡(DEG),以細化前景對象的模糊深度輪廓。接著我們在編碼器中提出兩種時序性特徵融合區塊(TFB),以通過視頻連續特徵圖的相關性來增強深度圖的精確度。透過我們的網路設計加上訓練策略,從模擬實驗結果可以看出我們所提出的具有深度邊緣引導網絡和時序性特徵融合區塊的MDE網路比其他MDE方法具有更好的性能。

    The depth estimation, which is a key component in computer vision, is needed in many applications such as 3D video presentation, robot control and autonomous driving, etc. By using deep learning technologies, the convolutional neural networks with encoder and decoder structures are used to successfully solve complicated depth estimation problems. To achieve accurate depth estimation, in this thesis, we propose a semi-unsupervised temporal monocular depth estimation (MDE) network by introduction of two supported networks. First, we present the depth edge guided network in the decoder to refine blurry depth contours of foreground objects. Then, we suggest two types of temporal feature blocks in the encoder to strengthen the depth accuracy with the correlation of consecutive feature maps in video frames. With the suggested training strategy, simulation results show that the proposed temporal MDE network with the depth edge guided network and temporal feature blocks achieves better performance than the other MDE methods.

    摘要 II Abstract II 誌謝 III Contents IV List of Tables VI List of Figures VII Chapter 1 Introduction 1 1.1 Research Background 1 1.2 Motivations 3 1.3 Thesis Organization 5 Chapter 2 Related Work 6 2.1 Depth Estimation with Stereo Matching 7 2.2 Supervised Single Image Depth Estimation 8 2.3 Unsupervised Depth Estimation 8 2.4 U-net 10 2.5 Non-local Neural Network 11 2.6 Squeeze and Excitation Network 12 Chapter 3 The Proposed Temporal Monocular Depth Estimation System 13 3.1 Overview of the Proposed System 14 3.2 Hourglass Backbone Network 15 3.3 Depth Edge Guided (DEG) Network 18 3.4 Temporal Feature Blocks (TFB) 19 3.5 Training Loss 24 3.5.1 Appearance Matching Loss 25 3.5.2 Disparity Smoothness Loss 25 3.5.3 Left-Right Disparity Consistency Loss 26 3.5.4 Depth Edge Feature Loss 26 Chapter 4 Network Training and Experimental Results 27 4.1 Training Procedure 27 4.2 Experiments for DEG Network 28 4.3 Experiments for TFB Designs 31 4.4 Experiments for CWB Designs 41 Chapter 5 Conclusions 42 Chapter 6 Future Work 43 References 44

    [1] A. Alatan, Y. Yemez, U. Güdükbay, X. Zabulis, K. Müller, Ç. E. Erdem, C. Weigel, and A. Smolic, “Scene Representation Technologies for 3DTV – A Survey,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 17, no. 11, pp. 1587-1605, Nov. 2007.
    [2] J.-F. Yang, T.-T. Lee, G. C. Chen, W.J. Yang and L. Yu, “A YCbCr Color Depth Packing Method and Its Extension for 3D Video Broadcasting Services,” IEEE Trans. on Circuits and Systems for Video Technology, 10.1109/TCSVT.2019.2934254, Aug. 2019.
    [3] W.-J. Yang, J.-F. Yang, G.-C. Chen, P.-C. Chung and M.-F. Chung, "An Assigned Color Depth Packing Method with Centralized Texture Depth Packing Formats for 3D VR Broadcasting Services," IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 1, pp. 122-132, March 2019.
    [4] A. I. Purica, E. G. Mora, B. Ionescu, “Multiview Plus Depth Video Coding with Temporal Prediction View Synthesis,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 26, no.2, pp. 360–374, Feb. 2016.
    [5] M. Sharma, S. Chaudhury and B. Lall, “A Novel Hybrid Kinect-Variety-Based High-Quality Multiview Rendering Scheme for Glass-Free 3D Displays,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 27, no. 10, pp.2098–2117, Oct. 2017.
    [6] H. Cho, S. Jung and H. Jee, "Real-time interactive AR system for broadcasting," Proc. of IEEE Virtual Reality (VR), Los Angeles, CA, pp. 353-354, 2017.
    [7] Pham, K. M. Lee, S.-K. Park, M. Kim and J. W. Jeon, “FPGA Design and Implementation of a Real-Time Stereo Vision System,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 20, no. 1, pp. 15–26, Jan. 2010.
    [8] K. Hisatomi, M. Kano, K. Ikeya, M. Katayama, T. Mishina and Y. Iwad, “Depth Estimation Using an Infrared Dot Projector and an Infrared Color Stereo Camera,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 27, no. 10, pp. 2086-2097, vol. 27, no. 10, Oct. 2017.
    [9] H. Hirschmuller, "Stereo Processing by Semiglobal Matching and Mutual Information," in IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 328-341, Feb. 2008.
    [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks. " In Advances in Neural Information Processing Systems, pp. 1097-1105, 2012.
    [11] J. Long, E. Shelhamer and T. Darrell, "Fully convolutional networks for semantic segmentation," Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Boston, MA, pp. 3431-3440, 2015.
    [12] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 1 June 2017.
    [13] J. Zbontar and Y. LeCun, “Stereo matching by training a convolutional neural network to compare image patches” Journal of Machine Learning Research, vol.17, no. 1. Pp.1-32, 2016.
    [14] J. Pang, W. Sun, J. S. Ren, C. Yang and Q. Yan, "Cascade Residual Learning: A Two-stage Convolutional Neural Network for Stereo Matching," Proc. of IEEE International Conference on Computer Vision Workshops, Venice, pp. 878-886, 2017.
    [15] J. Chang and Y. Chen, "Pyramid Stereo Matching Network," Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, pp. 5410-5418, 2018.
    [16] P. Knöbelreiter, C. Reinbacher, A. Shekhovtsov and T. Pock, "End-to-End Training of Hybrid CNN-CRF Models for Stereo," Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, pp. 1456-1465, 2017.
    [17] B. Ummenhofer et al., "DeMoN: Depth and Motion Network for Learning Monocular Stereo," Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, HI, pp. 5622-5631, 2017.
    [18] Y. Kim, H. Jung, D. Min and K. Sohn, "Deep Monocular Depth Estimation via Integration of Global and Local Predictions," IEEE Trans. on Image Processing, vol. 27, no. 8, pp. 4131-4144, Aug. 2018.
    [19] A. Atapour-Abarghouei and T. P. Breckon, "Real-Time Monocular Depth Estimation Using Synthetic Data with Domain Adaptation via Image Style Transfer," Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, pp. 2800-2810, 2018.
    [20] H. Fu, M. Gong, C. Wang, K. Batmanghelich and D. Tao, "Deep Ordinal Regression Network for Monocular Depth Estimation," Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 2002-2011.
    [21] D. Eigen, C. Puhrsch, and R. Fergus, "Depth map prediction from a single image using a multi-scale deep network." Proc. of Advances in Neural Information Processing Systems, pp. 2366-2374, 2014.
    [22] A. Roy and S. Todorovic, "Monocular Depth Estimation Using Neural Regression Forest," Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, pp. 5506-5514, 2016.
    [23] X. Zhong, O. Gong, W. Huang, L. Li and H. Xia, "Squeeze-and-Excitation Wide Residual Networks in Image Classification," 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 2019, pp. 395-399, doi: 10.1109/ICIP.2019.8803000.
    [24] C. Godard, O. M. Aodha and G. J. Brostow, "Unsupervised Monocular Depth Estimation with Left-Right Consistency," Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, pp. 6602-6611, 2017.
    [25] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy and T. Brox, "A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation," Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, pp. 4040-4048, 2016.
    [26] A. Geiger, P. Lenz and R. Urtasun, "Are we ready for autonomous driving? The KITTI vision benchmark suite," Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, pp. 3354-3361, 2012.
    [27] Y. Chang, C. Fang, L. Ding, S. Chen and L. Chen, "Depth map generation for 2D-to-3D conversion by short-term motion assisted color segmentation," Proc. of IEEE International Conference on Multimedia and Expo, Beijing, 2007, pp. 1958-1961.
    [28] P. A. Warren and S. K. Rushton, "Perception of scene-relative object movement: Optic flow parsing and the contribution of monocular depth cues," Vision Research, vol. 49, no. 11, 2009, pp. 1406-1419.
    [29] X. Wang, R. Girshick, A. Gupta and K. He, "Non-local Neural Networks," Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, pp. 7794-7803, 2018.
    [30] D. Eigen and R. Fergus, "Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture," Proc. of IEEE International Conf. on Computer Vision, Santiago, 2015, pp. 2650-2658.
    [31] F. Liu, C. Shen and G. Lin, "Deep convolutional neural fields for depth estimation from a single image," IEEE Conf. on Computer Vision and Pattern Recognition, Boston, MA, 2015, pp. 5162-5170.
    [32] B. Li, C. Shen, Y. Dai, A. v. d. Hengel and M. He, "Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs," IEEE Conf. on Computer Vision and Pattern Recognition, Boston, MA, 2015, pp. 1119-1127.
    [33] J. Jin, A. Wang, Y. Zhao, C. Lin and B. Zeng, "Region-Aware 3-D Warping for DIBR," IEEE Transactions on Multimedia, vol. 18, no. 6, pp. 953-966, June 2016.
    [34] D. A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (ELUs),” arXiv:1511.07289v5, ICLR 2016.
    [35] A. Odena, V. Dumoulin, and C. Olah, "Deconvolution and checkerboard artifacts, " Distill, 2016. Available: http://distill.pub/2016/deconv-checkerboard
    [36] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," Proc. of International Conference on Medical Image Computing and Computer-assisted Intervention, Springer, Cham, pp. 234-241, 2015.
    [37] M. Jaderberg, K. Simonyan, and A. Zisserman. "Spatial transformer networks. " Proc. of Advances in Neural Information Processing Systems, pp. 2017-2025, 2015.
    [38] H. Zhao, O. Gallo, I. Frosio, and J. Kautz. "Is l2 a good loss function for neural networks for image processing? " arXiv Preprint, arXiv: 1511.08861, 2015.
    [39] Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, "Image quality assessment: from error visibility to structural similarity," IEEE Trans. on Image Processing, vol. 13, no. 4, pp. 600-612, April 2004.
    [40] F. Liu, C. Shen, and G. Lin, “Deep convolutional neural fields for depth estimation from a single image”, Proc. of IEEE Int. Conf. on Computer Vision and Pattern Recognition, pp. 5162–5170, 2015.
    [41] T. Zhou, M. Brown, N. Snavely and D. G. Lowe, "Unsupervised Learning of Depth and Ego-Motion from Video," Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, pp. 6612-6619, 2017.
    [42] Z. Yin and J. Shi, "GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose," Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, Salt Lake City., pp. 1983-1992, 2018.
    [43] ISO/IEC JTC1/SC29/WG11, Vision on 3D video, MPEG output document N11621, Oc. 2010.
    [44] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes Dataset for Semantic Urban Scene Understanding,” Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

    無法下載圖示 校內:2025-07-20公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE