簡易檢索 / 詳目顯示

研究生: 姚德威
Yao, De-Wei
論文名稱: 應用於精確2D轉3D視訊之半監督式單視角深度生成系統
A Semi-Supervised Monocular Video Depth Generation System for Precise 2D-to-3D Conversion
指導教授: 楊家輝
Yang, Jar-Ferr
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 69
中文關鍵詞: 景深圖深度學習單視角深度生成關鍵影格卷積類神經網路
外文關鍵詞: depth maps, deep learning, monocular depth generation, keyframes, convolutional neural networks
相關次數: 點閱:83下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著3D技術的發展,人類生活中越來越多3D應用產品。為避免傳送複雜的多視角影像,有效率的3D訊息常用單視角影像及其景深圖表示。為加速推廣3D應用,有效率地將大量2D單視角影像視訊轉化生成具3D內容之影像視訊是相當重要的議題。因此,為單視角影像生成高品質之深度圖是目前重要的研究課題。在現今電腦視覺的研究中,基於傳統影像處理或基於深度學習演算法,單視角影像之景深圖生成常受到場景的限制,且在不同狀況下之適應性亦不夠強健。為解決此問題,本論文提出一半監督式之單視角影像深度生成系統。藉由基於特徵空間分群,提取具代表性之關鍵影格,並標記其景深資訊。依據關鍵影格之景深資訊,以卷積類神經網路生成所有影格之景深圖。由實驗與實際應用之結果顯示,與其他半自動之單視角深度生成系統相比,本論文提出之系統能以更少量的關鍵影格景深資訊,生成更高品質且精準的景深圖。

    With the development of 3D technology, there are more and more 3D applications and services introduced in our daily life. To avoid to transmit complex multi-view images, 3D videos could be efficiently represented by 2D texture images plus their corresponding depth maps. To further promote 3D applications, it is very important to efficiently convert lots of 2D videos with single view into 3D videos with multiple views. Therefore, generating high-quality depth maps for mono-view images is a particularly important research topic. In current computer vision technologies, monocular depth generation algorithms based on traditional algorithms or deep learning, which are often limited by the scene and lack of the adaptability under different conditions, are not robust enough for real applications. To solve this problem, this thesis proposes a semi-supervised monocular video depth generation (MVDG) system. By clustering the images in a feature space, the representative keyframes are extracted and labeled their depth maps. Then, with labeled representative keyframes, we suggest a convolutional neural network, which can generate the depth maps for all un-labeled video frames. The experimental results of the public datasets and the 2D film show that the proposed MVDG system can generate more accurate depth maps with much fewer keyframes than other semi-automatic depth generation systems.

    摘要 I Abstract II 誌謝 III Contents IV List of Tables VII List of Figures VIII Chapter 1 Introduction 1 1.1 Research Background 1 1.2 Motivations 4 1.3 Thesis Organization 4 Chapter 2 Related Work 6 2.1 Automatic Methods for Depth Generation 6 2.1.1 Traditional Methods 7 2.1.2 Deep Learning-Based Methods 8 2.2 Semi-Automatic Methods for Depth Generation 10 2.3 Temporal Estimation in Learning-based Methods 12 2.3.1 Non-Local Neural Networks 13 2.3.2 BlurPool 15 2.4 K-means Clustering Method 16 Chapter 3 The Proposed Semi-supervised Monocular Video Depth Generation System 19 3.1 Overview of the Proposed MVDG System 20 3.2 Keyframe selection 21 3.2.1 Feature Converter 22 3.2.2 Clustering Module 26 3.3 Texture-to-Depth Network 29 3.3.1 Backbone Network 30 3.3.2 Spatial-Temporal Feature Compensation 32 3.3.2.1 Spatial-Temporal Attention Module (STAM) 32 3.3.2.2 Auxiliary Frames Freeing 36 3.4 Network Training 37 3.4.1 Training Details of IRNet 37 3.4.1.1 Loss Function 37 3.4.1.2 Training Strategy 38 3.4.1.3 Data Augmentation 38 3.4.2 Training Details of T2DNet 38 3.4.2.1 Loss Function 39 3.4.2.2 Training Strategy 40 3.4.2.3 Data Augmentation 42 Chapter 4 Experimental Results 44 4.1 Environmental Settings and Datasets 44 4.2 Ablation Study 46 4.2.1 Verification of Methodology 46 4.2.2 Verification of Keyframe selection 48 4.2.3 Verification of STAMs 49 4.2.4 Verification of Training Strategy 51 4.3 Comparisons on Public Datasets 53 4.4 Comparisons in Actual Applications 59 Chapter 5 Conclusions 62 Chapter 6 Future Work 63 References 64

    [1] S. C. Chan, H. Shum and K. Ng, "Image-based rendering and synthesis," in IEEE Signal Processing Magazine, vol. 24, no. 6, pp. 22-33, Nov. 2007.
    [2] T. Kanade and M. Okutomi, "A stereo matching algorithm with an adaptive window: theory and experiment," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 9, pp. 920-932, Sept. 1994, doi: 10.1109/34.310690.
    [3] J. Sun, N. Zheng and H. Shum, "Stereo matching using belief propagation," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 7, pp. 787-800, July 2003, doi: 10.1109/TPAMI.2003.1206509.
    [4] G. Jason "Structured-light 3D surface imaging: a tutorial," Advances in Optics and Photonics 3.2 (2011): 128-160.
    [5] S. B. Gokturk, H. Yalcin and C. Bamji, "A time-of-flight depth sensor - system description, issues and solutions," 2004 Conference on Computer Vision and Pattern Recognition Workshop, Washington, DC, USA, 2004, pp. 35-35, doi: 10.1109/CVPR.2004.291.
    [6] Y. J. Jung, A. Baik, J. Kim, and D. Park, "A novel 2D-to-3D conversion technique based on relative height-depth cue," In Stereoscopic Displays and Applications XX Vol. 7237. International Society for Optics and Photonics, 2009.
    [7] C.H. Chou, Y. Zhao, and H. Tai, "Vanishing-point detection based on a fuzzy clustering algorithm and new clustering validity measure," 淡江理工學刊18.2 (2015): 105-116.
    [8] P. E. Hart, "How the hough transform was invented [DSP history]," in IEEE Signal Processing Magazine, vol. 26, no. 6, pp. 18-22, November 2009, doi: 10.1109/MSP.2009.934181.
    [9] J. Shi and J. Malik, "Normalized cuts and image segmentation," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888-905, Aug. 2000, doi: 10.1109/34.868688.
    [10] D. Eigen, P. Christian, and F. Rob, "Depth map prediction from a single image using a multi-scale deep network," in Advances in neural information processing systems, 2014, pp. 2366-2374.
    [11] J. Long, E. Shelhamer and T. Darrell, "Fully convolutional networks for semantic segmentation," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 3431-3440, doi: 10.1109/CVPR.2015.7298965.
    [12] M. Song and W. Kim, "Depth estimation from a single image using guided deep network," in IEEE Access, vol. 7, pp. 142595-142606, 2019, doi: 10.1109/ACCESS.2019.2944937.
    [13] X. Duan, X. Ye, Y. Li and H. Li, "High quality depth estimation from monocular images based on depth prediction and enhancement sub-networks," 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, 2018, pp. 1-6, doi: 10.1109/ICME.2018.8486539.
    [14] Y. Kim, H. Jung, D. Min and K. Sohn, "Deep monocular depth estimation via integration of global and local predictions," in IEEE Transactions on Image Processing, vol. 27, no. 8, pp. 4131-4144, Aug. 2018, doi: 10.1109/TIP.2018.2836318.
    [15] C. Godard, O. M. Aodha and G. J. Brostow, "Unsupervised monocular depth estimation with left-right consistency," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 6602-6611, doi: 10.1109/CVPR.2017.699.
    [16] G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger, "Densely connected convolutional networks," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 2261-2269, doi: 10.1109/CVPR.2017.243.
    [17] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. L. Yuille, "DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834-848, 1 April 2018, doi: 10.1109/TPAMI.2017.2699184.
    [18] F. Chollet, "Xception: Deep learning with depthwise separable convolutions," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 1800-1807, doi: 10.1109/CVPR.2017.195.
    [19] K. Sun, B. Xiao, D. Liu and J. Wang, "Deep high-resolution representation learning for human pose estimation," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 5686-5696, doi: 10.1109/CVPR.2019.00584.
    [20] M. Moukari, S. Picard, L. Simon and F. Jurie, "Deep multi-scale architectures for monocular depth estimation," 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, 2018, pp. 2940-2944, doi: 10.1109/ICIP.2018.8451408.
    [21] G. Lin, J. Huang and W. Lie, "Semi-automatic 2D-to-3D video conversion based on depth propagation from key-frames," 2013 IEEE International Conference on Image Processing, Melbourne, VIC, 2013, pp. 2202-2206, doi: 10.1109/ICIP.2013.6738454.
    [22] H. Wang, C. Huang and J. Yang, "Block-based depth maps interpolation for efficient multiview content generation," in IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 12, pp. 1847-1858, Dec. 2011, doi: 10.1109/TCSVT.2011.2148510.
    [23] E. H. Shih, "Texture-based depth frame interpolation for precise 2D to 3D conversion," M. S. Thesis, National Cheng Kung University, Tainan, Taiwan, July 2018.
    [24] T. Mikolov, S. Kombrink, L. Burget, J. Černocký and S. Khudanpur, "Extensions of recurrent neural network language model," 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, 2011, pp. 5528-5531, doi: 10.1109/ICASSP.2011.5947611.
    [25] B. Shi, X. Bai and C. Yao, "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 11, pp. 2298-2304, 1 Nov. 2017, doi: 10.1109/TPAMI.2016.2646371.
    [26] J. Donahue et al., "Long-term recurrent convolutional networks for visual recognition and description," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 2625-2634, doi: 10.1109/CVPR.2015.7298878.
    [27] S. Hochreiter and J. Schmidhuber, "Long short-term memory," in Neural Computation, vol. 9, no. 8, pp. 1735-1780, 15 Nov. 1997, doi: 10.1162/neco.1997.9.8.1735.
    [28] D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri, "Learning spatiotemporal features with 3d convolutional networks," 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 2015, pp. 4489-4497, doi: 10.1109/ICCV.2015.510.
    [29] C. Feichtenhofer, A. Pinz and A. Zisserman, "Convolutional two-stream network fusion for video action recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 1933-1941, doi: 10.1109/CVPR.2016.213.
    [30] H. Liu, J. Tu, and M. Liu. "Two-stream 3d convolutional neural network for skeleton-based action recognition," arXiv preprint, 2017, arXiv:1705.08106.
    [31] J. Carreira and A. Zisserman, "Quo vadis, action recognition? A new model and the kinetics dataset," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 4724-4733, doi: 10.1109/CVPR.2017.502.
    [32] A. Vaswani, N. Shazeer, N. Parma, J. Uszkoreit, L. Jones, A. N. Gomez, …and I. Polosukhin, "Attention is all you need," Advances in neural information processing systems, 2017, pp. 5998-6008.
    [33] X. Wang, R. Girshick, A. Gupta and K. He, "Non-local neural networks," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 7794-7803, doi: 10.1109/CVPR.2018.00813.
    [34] R. Zhang, "Making convolutional networks shift-invariant again," arXiv preprint, 2019, arXiv:1904.11486.
    [35] K. Wagstaff, C. Cardie, S. Rogers and S. Schrödl, "Constrained k-means clustering with background knowledge," Icml, 2001, Vol. 1, pp. 577-584.
    [36] R. Lletı, M.C. Ortiz, L.A. Sarabia and M.S. Sánchez, "Selecting variables for k-means cluster analysis by using a genetic algorithm that optimises the silhouettes." Analytica Chimica Acta, 2004, 515(1), 87-100.
    [37] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, ...and J. Vanderplas, "Scikit-learn: machine learning in python." the Journal of machine Learning research, 2011, 12: 2825-2830.

    無法下載圖示 校內:2025-07-20公開
    校外:不公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE