簡易檢索 / 詳目顯示

研究生: 吳治臻
Wu, Chih-Chen
論文名稱: 基於自適應特徵融合之單視角深度圖預測網路
AFDepth: An Adaptive Fusion Autoencoder for Monocular Depth Estimation
指導教授: 楊家輝
Yang, Jar-Ferr
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 48
中文關鍵詞: 深度學習卷積類神經網路單視角深度圖預測網路3D影像自編碼器視覺轉換器
外文關鍵詞: deep learning, convolutional neural networks, monocular depth estimation network, 3D video, autoencoder, vision transformer
相關次數: 點閱:89下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在電腦視覺任務中,單視角深度圖預測是一個應用廣泛且重要的任務。它在許多領域中扮演著關鍵腳色,包括三維場景重建、虛擬實境、自駕車和人機互動等。單視角深度圖預測的準確性對於這些應用的成功至關重要。近年來深度學習技術的發展突破以往傳統雙相機演算法的瓶頸,為電腦視覺領域帶來更多可能性和應用前景。本論文提出了一個端到端的監督式單視角深度圖預測網路,以生成高精度深度圖。在編碼器,我們利用殘差學習建構多尺度特徵提取器,並利用視覺轉換器加強全局資訊。在解碼器,我們建構自適應融合模塊以融合兩個帶有不同資訊的特徵。最後使用符合人類感知的損失函數訓練,使模型專注於前景的深度值。實驗結果顯示,我們建構的自編碼器能有效預測單視角彩色圖像的深度圖。

    In computer vision tasks, monocular depth estimation is a widely applied and crucial task that plays a key role in various domains, including 3D scene reconstruction, virtual reality, autonomous driving, and human-computer interaction. The precision of depth maps is of utmost importance for the success of these applications. Recent advancements in deep learning technology have surpassed the limitations of traditional stereo-camera algorithms, bringing forth new possibilities and application prospects in the field of computer vision. In this thesis, we propose an end-to-end supervised single-view depth prediction network to generate high-precision depth maps. In the encoder, we construct a multi-scale feature extractor using residual learning and enhance global information using visual transformers. In the decoder, we introduce an adaptive fusion module (AFM) to merge two features with different information. Lastly, the model is trained using a loss function that aligns with human perception, enabling it to focus on the depth values of foreground objects. Experimental results demonstrate the effective prediction of depth maps from single-view color images by the proposed autoencoder.

    摘要 I Abstract II 誌謝 III Contents IV List of Tables VII List of Figures VIII Chapter 1 Introduction 1 1.1 Research Background 1 1.2 Motivations 3 1.3 Thesis Organization 4 Chapter 2 Related Work 5 2.1 Monocular Depth Estimation 6 2.2 Autoencoder 7 2.3 Skip Connection 8 2.4 Vision Transformer 9 2.4.1 Single-head self-attention 10 2.4.2 Multi-head self-attention 12 2.4.3 Channel MLP 13 2.5 Atrous Spatial Pyramid Pooling 13 Chapter 3 The Proposed Adaptive Fusion Depth Estimation Network 16 3.1 Overview of the Proposed AFDepth 17 3.2 Network Architecture 18 3.2.1 Encoder 18 A. Subsampled Residual and Residual Blocks 19 B. Vision Transformer 21 3.2.2 Decoder with Adaptive Fusion Modules 22 A. Adaptive Fusion Modules 23 B. Up-Convolution Block 27 C. Deep ASPP Module 28 3.3 Training Loss Function 28 Chapter 4 Experimental Results 31 4.1 Environmental Settings 31 4.2 Datasets and Training Details 32 4.2.1 Datasets 32 4.2.2 Data Augmentation 34 4.2.3 Training Settings 34 4.3 Ablation Study 34 4.3.1 Encoder with Various ViT Configurations 35 4.3.2 Decoder with Various Fusion Modules 38 4.4 Comparison with other Approaches 38 4.4.1 Comparisons on NYU Depth V2 Dataset 39 4.4.2 Comparisons on KITTI Dataset 41 Chapter 5 Conclusions 43 Chapter 6 Future Work 45 References 46

    [1] H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 328-341, 2007.
    [2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556 [cs.CV], 2014.
    [3] K. He, X. Zhang, S. Ren and J. Sun, “Deep residual learning for image recognition” in Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, pp. 770-778, 2016.
    [4] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020
    [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
    [6] D. Kim, W. Ka, P. Ahn, D. Joo, S. Chun, and J. Kim, “Global-local path networks for monocular depth estimation with vertical cutdepth,” arXiv preprint arXiv:2201.07436, 2022.
    [7] D. Eigen, C. Puhrsch, and R. Fergus. “Depth map prediction from a single image using a multi-scale deep network,” Proceedings of Advances in Neural Information Processing Systems 27, 2014.
    [8] C. Godard, O. Aodha, and G. Brostow, “Unsupervised Monocular Depth Estimation with Left-Right Consistency,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
    [9] L. -C. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. L. Yuille, “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834-848, 1 April 2018.
    [10] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” in European Conference on Computer Vision, 2012, pp. 746-760, Springer.
    [11] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “Denseaspp for semantic segmentation in street scenes,” Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3684-3692, 2018.
    [12] J. H. Lee, M. K. Han, D. W. Ko, and I. H. Suh, “From big to small: Multi-scale local planar guidance for monocular depth estimation,” arXiv preprint arXiv:1907.10326, 2019.
    [13] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, M. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
    [14] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354-3361, June 2012.
    [15] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980.
    [16] W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, ... & S. Yan, "Metaformer is actually what you need for vision," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10819-10829, 2022.

    無法下載圖示 校內:2028-07-31公開
    校外:2028-07-31公開
    電子論文尚未授權公開,紙本請查館藏目錄
    QR CODE