簡易檢索 / 詳目顯示

研究生: 許至岳
Hsu, Chih-Yue
論文名稱: 利用空間轉換之全自動雙視角生成網路
An Automatic Stereoview Generation Network by Spatial Transformers
指導教授: 楊家輝
Yang, Jar-Ferr
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 46
中文關鍵詞: 深度學習卷積類神經網路雙視角生成網路3D影像時序性補強模組
外文關鍵詞: deep learning, convolutional neural networks, stereoview generation network, 3D video, temporal refinement module
相關次數: 點閱:87下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在電腦視覺任務中,立體影像的生成一直是一項重要且應用廣泛的主題。隨著元宇宙的發展潮流,虛擬實境(VR)等應用的重要性也日益增長。近年深度學習技術的蓬勃發展,類神經網路打破先前傳統演算法的瓶頸,為相關研究開闢了全新的方向。鑑於高精確度深度資訊的取得困難且單視角影像的可參考資訊不足以預測出良好的雙視角影像,我們基於深度學習之框架,提出一端對端之全自動雙視角生成網路。我們基於非監督式單目視差估計(UMDE)骨幹,利用空間轉換模組在對極幾何(epipolar geometry)的前提下對圖像進行處理,並提出一時序性補強模組(TRM)以捕捉單視角輸入中缺失的資訊。另外,我們引入符合人類感知的損失函數對網路進行訓練,以解決像素級損失函數可能導致的圖像模糊或破碎問題。如實驗結果所示,本論文提出之系統在不參考深度基準真相的前提下能夠有效生成正確的雙視角圖像,大幅減少獲取高精度深度圖所花的時間、器材、人力等成本。

    In computer vision, stereoscopic image generation has always been a vital and widely applicable topic. Applications such as virtual reality (VR) become increasingly importance as the Metaverse develops. With the rapid development of deep learning technology these years, neural networks have broken the bottleneck of traditional algorithms and blazed a new trail of related researches. In view of the difficulty of obtaining high-precision depth maps and lacking of reference information in single-view image for predicting good stereoview images, we followed the deep learning framework and proposed an end-to-end automatic stereoview generation network. Based on the unsupervised monocular disparity estimation (UMDE) backbone, we utilized the spatial transformer module to process the input image under the premise of epipolar geometry. We proposed a temporal refinement module (TRM) to capture the missing information in the monocular input. Furthermore, we introduced loss functions that could guide the network close to human perception to deal with blurred and shattered results caused by pixel-wise loss functions. The experimental results show that the proposed system can effectively generate correct stereoview images without referring to depth ground-truths, which greatly reduces the costs for obtaining high-quality depth maps.

    摘要 I Abstract II 誌謝 III Contents IV List of Tables VII List of Figures VIII Chapter 1 Introduction 1 1.1 Research Background 1 1.2 Motivations 3 1.3 Thesis Organization 4 Chapter 2 Related Work 6 2.1 Supervised Monocular Depth Estimation 6 2.2 Unsupervised Monocular Depth Estimation 8 2.3 Spatial Transformer Network 9 2.4 2D-to-3D Transform Networks 11 2.5 Atrous Spatial Pyramid Pooling 12 Chapter 3 The Proposed Stereoview Generation System 14 3.1 Overview of the Proposed Network 15 3.2 Backbone Network 16 3.3 Bilinear Sampler Module 22 3.4 Temporal Refinement Module 23 3.5 Training Loss 24 3.5.1 Reconstruction Loss 24 3.5.2 Structure Loss 25 3.5.3 Left-right Consistency Loss 26 3.5.4 Perceptual Loss 26 Chapter 4 Experimental Results 27 4.1 Environment Settings and Dataset 27 4.2 Training Details 28 4.3 Experimental Results 29 4.3.1 Evaluation Metrics 29 4.3.2 Verification of Temporal Refinement Module 32 4.3.3 Ablation Study of Queue Buffer Length 34 4.3.4 Influences of Different Loss Functions 36 4.4 Depth-like Intermediate Product 38 4.5 Failure Case 39 Chapter 5 Conclusions 41 Chapter 6 Future Work 42 References 43

    [1] S. C. Chan, H. -Y. Shum and K. -T. Ng, “Image-Based Rendering and Synthesis,” IEEE Signal Processing Magazine, vol. 24, no. 6, pp. 22-33, Nov. 2007.
    [2] J. Geng, “Structured-light 3D surface imaging: a tutorial,” Advances in Optics and Photonics 3.2, pp. 128-160, 2011.
    [3] S. B. Gokturk, H. Yalcin and C. Bamji, “A Time-Of-Flight Depth Sensor - System Description, Issues and Solutions,” Proceedings of Conference on Computer Vision and Pattern Recognition Workshop, pp. 35-35, 2004.
    [4] A. Geiger, P. Lenz and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354-3361, 2012.
    [5] J. Lee, H. Jung, Y. Kim and K. Sohn, “Automatic 2D-to-3D conversion using multi-scale deep neural network,” Proceedings of IEEE International Conference on Image Processing, pp. 730-734, 2017.
    [6] B. Chen, J. Yuan and X. Bao, “Automatic 2D-to-3D Video Conversion using 3D Densely Connected Convolutional Networks,” Proceedings of IEEE 31st International Conference on Tools with Artificial Intelligence, pp. 361-367, 2019.
    [7] M. Jaderberg, K. Simonyan, A. Zisserman and K. Kavukcuoglu. “Spatial transformer networks,” Proceedings of Advances in Neural Information Processing Systems 28, 2015.
    [8] L. -C. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. L. Yuille, “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834-848, 1 April 2018.
    [9] D. Eigen, C. Puhrsch, and R. Fergus. “Depth map prediction from a single image using a multi-scale deep network,” Proceedings of Advances in Neural Information Processing Systems 27, 2014.
    [10] C. Wu, G. Er, X. Xie, T. Li, X. Cao and Q. Dai, “A Novel Method for Semi-automatic 2D to 3D Video Conversion,” Proceedings of 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video, pp. 65-68, 2008.
    [11] L. -M. Po, X. Xu, Y. Zhu, S. Zhang, K. -W. Cheung and C. -W. Ting, “Automatic 2D-to-3D video conversion technique based on depth-from-motion and color segmentation,” Proceedings of IEEE 10th International Conference on Signal Processing, pp. 1000-1003, 2010.
    [12] J. Konrad, G. Brown, M. Wang, P. Ishwar, C. Wu, and D. Mukherjee “Automatic 2D-to-3D image conversion using 3D examples from the internet,” Proceedings of Stereoscopic Displays and Applications XXIII, 22 February 2012.
    [13] J. Konrad, M. Wang, P. Ishwar, C. Wu and D. Mukherjee, “Learning-Based, Automatic 2D-to-3D Image and Video Conversion,” IEEE Transactions on Image Processing, vol. 22, no. 9, pp. 3485-3496, Sept. 2013.
    [14] C. Godard, O. Aodha, and G. Brostow, “Unsupervised Monocular Depth Estimation with Left-Right Consistency,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
    [15] M. Poggi, F. Tosi and S. Mattoccia, “Learning Monocular Depth Estimation with Unsupervised Trinocular Assumptions,” Proceedings of International Conference on 3D Vision, pp. 324-333, 2018.
    [16] I. Mehta, P. Sakurikar and P. J. Narayanan, “Structured Adversarial Training for Unsupervised Monocular Depth Estimation,” Proceedings of International Conference on 3D Vision, pp. 314-323, 2018.
    [17] P. -Y. Chen, A. H. Liu, Y. -C. Liu and Y. -C. F. Wang, “Towards Scene Understanding: Unsupervised Monocular Depth Estimation with Semantic-Aware Representation,” Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2619-2627, 2019.
    [18] J. Xie, R. Girshick, and A. Farhadi, “Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks,” Proceedings of European Conference on Computer Vision, Springer, Cham, 2016.
    [19] J. -H. Lee, M. -K. Han, D. W. Ko, and I. H. Suh, “From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation,” arXiv:1907.10326 [cs.CV], 2019.
    [20] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “Denseaspp for semantic segmentation in street scenes,” Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3684-3692, 2018.
    [21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, 2016.
    [22] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation.” Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Cham, 2015.
    [23] R. Zhang, P. Isola, A. A. Efros, E. Shechtman and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 586-595, 2018.
    [24] Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, April 2004.
    [25] K. Simonyan, and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556 [cs.CV], 2014.
    [26] D. P. Kingma, and J. Ba. “Adam: A method for stochastic optimization.” arXiv:1412.6980 [cs.LG], 2014.
    [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. “Imagenet classification with deep convolutional neural networks.” Proceedings of Advances in Neural Information Processing Systems 25, 2012.

    下載圖示 校內:2024-08-01公開
    校外:2024-08-01公開
    QR CODE