簡易檢索 / 詳目顯示

研究生: 蔡煊
Tsai, Hsuan
論文名稱: 應用於MR眼鏡之低解析轉高精確深度估計系統
Low Resolution to High Precision Depth Estimation for MR Glasses
指導教授: 楊家輝
Yang, Jar-Ferr
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 電腦與通信工程研究所
Institute of Computer & Communication Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 45
中文關鍵詞: 深度學習深度估計深度補全自編碼器混合實境
外文關鍵詞: deep learning, depth estimation, depth completion, autoencoder, mixture reality
相關次數: 點閱:124下載:8
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,隨著三維(3D)以及混合實境(MR)產業的蓬勃發展,深度估計逐漸變成工程應用中的重要問題,例如在3D環境中進行物體定位、3D物體重建和MR手勢識別。在MR產業中,現有的深度感測方法,如光達(LiDAR)、飛行時間(ToF)、結構光等,可以直接獲取環境的深度信息。然而,由於移動式MR設備(如智能眼鏡)考量到設備功耗問題,這些設備之深度感測器獲取的深度信息解析度相對其RGB相機之解析度較低,低解析度深度在工程應用方面較難發揮。此外,由於行動設備的運算資源有限,最先進的深度補全方法在行動設備上實現起來過於複雜。因此,在本研究中,我們提出了基於高效的雙通道自編碼器以及深度動態範圍估計之深度估計網絡的解決方案,成功地將高解析度的RGB彩色影像信息和低解析度深度感測器的深度圖融合,可得到高準確度深度圖。我們進一步將提出的網路實現於MR眼鏡平台上,並且在手勢深度感測任務下得到相當不錯的成果。

    In recent years, with the booming development of the three dimensional (3D) and Mixed Reality (MR) industries, depth estimation has gradually become an important problem in engineering applications. For instance, the depth information can be used for object localization, 3D object reconstruction, and MR gesture recognition in 3D environments. In the MR industry, existing depth sensing methods such as LiDAR, time-of-flight (ToF), and structured light can directly acquire the depth information of the environment. However, due to the consideration of power consumption for mobile devices, such as smart MR glasses, the depth sensors of these devices provide lower resolution depth information compared to their RGB cameras. Such low-resolution results pose challenges for real engineering applications. Additionally, the computational resources of mobile devices are limited, making it overly complex to implement state-of-the-art depth completion methods on these devices. Therefore, in this research, we propose a solution based on an efficient dual-path autoencoder and adaptive bins depth estimation networks. This solution successfully integrates high-resolution RGB information and low-resolution depth sensor data, resulting in high-accuracy depth maps. We further implement the proposed network on an MR glasses platform and achieve promising results in gesture depth sensing tasks.

    摘 要 I Abstract II 誌謝 III Contents V List of Tables VIII List of Figures IX Chapter 1 Introduction 1 1.1. Research Background 1 1.2. Motivations 2 1.3. Thesis Organization 3 Chapter 2 Related Work 4 2.1. Monocular Depth Estimation 4 2.1.1. Fast Monocular Depth Estimation 5 2.1.2. Adaptive Bins Estimation 6 2.2. RGB-D based Depth Estimation 8 2.2.1. Sparse to Dense [12] 8 2.2.2. Convolutional Spatial Propagation Network [19] 10 Chapter 3 The Proposed Depth Completion Network Architecture 12 3.1. Overview of the Proposed Network Architecture 13 3.2. Lightweight Encoder 14 3.3. Designs of Share Depth Decoder 16 3.3.1. SimpleUp 17 3.3.2. UpCSPN-k 18 3.4. Adaptive Bins Estimator 20 3.5. Loss Functions 24 3.5.1. Scale-Invariant Log Loss 25 3.5.2. Bin-center Distribution Loss 25 3.6. LR Depth Simulation Strategies 26 Chapter 4 Experimental Results 28 4.1 Implementation Details 28 4.1.1. Training Environment Settings 28 4.1.2. Datasets 29 4.1.3. Data Augmentation 30 4.1.4. Evaluation Metrics 31 4.2 Experimental Results 31 4.2.1. Quantitative Results 32 4.2.2. Qualitative Results 33 4.3. Ablation Study 33 4.3.1. Comparison with Different Decoding Layers 33 4.3.2. Effectiveness of AdaDRBins 34 4.3.3. Optimal Number of Bins 36 4.3.4. Performance on Sparse Depth Completion Task 36 4.4. Implementation on MR Glasses 37 4.4.1. Hardware Setup 37 4.4.2. Implementation Details 38 4.4.3. Depth Estimation Demos on MR Glasses 39 Chapter 5 Conclusions 41 Chapter 6 Future Work 42 References 43

    [1] J. Suarez and R. R. Murphy, “Hand gesture recognition with depth images: A review,” in 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot And Human Interactive Communication. IEEE, 2012, pp. 411–417.
    [2] C. Fehn, “Depth-image-based rendering (dibr), compression, and trans-mission for a new approach on 3d-tv,” in Stereoscopic Displays and Virtual Reality Systems XI, vol. 5291. SPIE, 2004, pp. 93-104.
    [3] C. Couprie, C. Farabet, L. Najman, and Y. LeCun, “Indoor semantic segmentation using depth information,” arXiv preprint arXiv:1301.3572, 2013.
    [4] H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 328-341, 2007.
    [5] D. Eigen, C. Puhrsch and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,”in Proceedings of Advances in Neural Information Processing Systems, pp. 2366-2374, 2014.
    [6] I. Alhashim and P. Wonka, “High quality monocular depth estimation via transfer learning,” arXiv preprint arXiv:1812.11941, 2018.
    [7] D. Wofk, F. Ma, T.-J. Yang, S. Karaman, and V. Sze, “Fastdepth: Fast monocular depth estimation on embedded systems,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 6101–6108.
    [8] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 2016 Fourth international conference on 3D vision (3DV). IEEE, pp. 239–248, 2016.
    [9] S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4009–4018.
    [10] C. Godard, O. M. Aodha and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270-279, 2017.
    [11] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3828–3838, 2019.
    [12] F. Ma and S. Karaman, “Sparse-to-dense: Depth prediction from sparse depth samples and a single image,” in 2018 IEEE Iinternational Conference on Robotics and Automation (ICRA). IEEE, pp. 4796–4803, 2018.
    [13] J. Tang, F.-P. Tian, W. Feng, J. Li, and P. Tan, “Learning guided convolutional network for depth completion,” in IEEE Transactions on Image Processing, vol. 30, pp. 1116–1129, 2020.
    [14] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in European Conference on Computer Vision. Springer, pp. 746–760, 2012.
    [15] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
    [16] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.,“An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
    [17] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
    [18] K. He, X. Zhang, S. Ren and J. Sun, “Deep residual learning for image recognition” in Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, pp. 770-778, 2016.
    [19] X. Cheng, P. Wang, and R. Yang, “Depth estimation via affinity learned with convolutional spatial propagation network,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 103-119.
    [20] S. Liu, S. De Mello, J. Gu, G. Zhong, M.-H. Yang, and J. Kautz, “Learning affinity via spatial propagation networks,” Advances in Neural Information Processing Systems, vol. 30, 2017.
    [21] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556 [cs.CV], 2014.
    [22] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700-4708, 2017.
    [23] H. Fan, H. Su, and L. Guibas. A point set generation network for 3d object reconstruction from a single image. in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2463–2471, 2017.
    [24] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alche-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 8026–8037. Curran Associates, Inc., 2019.
    [25] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
    [26] Y. Zhang, C. Cao, J. Cheng, and H. Lu, “Egogesture: A new dataset and benchmark for egocentric hand gesture recognition,” IEEE Transactions on Multimedia, vol. 20, no. 5, pp. 1038–1050, 2018.
    [27] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer networks,” Advances in eural information processing systems, vol. 28, 2015.
    [28] T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam, “Netadapt: Platform-aware neural network adaptation for mobile applications,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 285–300.
    [29] W.-J. Yang, J.-F. Yang, G.-C. Chen, P.-C. Chung, and M.-F. Chung,“An assigned color depth packing method with centralized texture depth packing formats for 3d vr broadcasting services,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 1, p. 122–132, 2018.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE