| 研究生: |
吳國緯 Wu, Guo-Wei |
|---|---|
| 論文名稱: |
具高泛化能力之分類回歸型單目深度估計網路 Classification-Regression Based Monocular Depth Estimation Networks with High Generalization Capability |
| 指導教授: |
楊家輝
Yang, Jar-Ferr |
| 學位類別: |
碩士 Master |
| 系所名稱: |
電機資訊學院 - 電腦與通信工程研究所 Institute of Computer & Communication Engineering |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 英文 |
| 論文頁數: | 54 |
| 中文關鍵詞: | 電腦視覺 、單目深度估計 、零樣本學習 、視覺轉換器 |
| 外文關鍵詞: | computer vision, monocular depth estimation, zero-shot learning, vision transformer |
| 相關次數: | 點閱:17 下載:1 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,單目深度估計在電腦視覺領域中受到廣泛關注,特別是在機器人與自動駕駛等應用場景中扮演日益關鍵的角色。傳統深度估計方法仰賴多視角影像以推斷距離,而單目深度估計則僅需單張影像即可估計與場景中物體之間的相對距離。然而,現有模型的泛化能力通常受限於訓練資料的場景分布。例如,於室內資料集訓練的模型雖能在其他室內場景中維持良好表現,但在應用於室外場景時性能卻顯著下降。因此,開發一個能跨場景應用、具備高度泛化能力的單目深度估計模型成為一項重要課題。本文提出一種具備高度泛化能力的單目深度估計模型,能在無需重新訓練的情況下應用於多種場景。在模型設計上,本研究採用具強表徵能力的視覺編碼器,並導入具有空間適應特性的深度估計頭,以結合分類與回歸策略進行深度推論。透過合理的模組整合、網路配置調整與訓練流程設計,本模型在多個資料集上展現出良好的性能與跨域泛化能力,同時兼具參數效率與推論穩定性。
In recent years, monocular depth estimation has received increasing attention in the field of computer vision, particularly in applications such as robotics and autonomous driving, where depth perception plays a critical role. Traditional depth estimation methods rely on multi-view images to infer depth, whereas monocular depth estimation requires only a single RGB image to estimate the relative distance between objects in a scene. However, the generalization ability of existing models is often constrained by the distribution of training data. For instance, the models trained on indoor datasets may perform well in other indoor scenarios but tend to exhibit significant performance degradation when applied to outdoor environments. Therefore, developing a monocular depth estimation model with strong generalization capability across diverse scenes without the need for retraining has become a key challenge. In this study, we propose a monocular depth estimation model that exhibits high generalizability and is capable of performing effectively across various scenarios without additional fine-tuning. In terms of model design, we adopt a powerful visual encoder to extract rich features and incorporate a depth prediction head with spatial adaptability, which combines classification and regression strategies for depth estimation. Through carefully designed module integration, architectural refinements, and a tailored training strategy, the proposed model achieves a favorable balance between parameter efficiency and inference accuracy, while demonstrating robust cross-domain performance and stability on multiple datasets.
[1] Saxena, Ashutosh, Jamie Schulte, and Andrew Y. Ng. "Depth Estimation Using Monocular and Stereo Cues." IJCAI. Vol. 7. 2007.
[2] Sturm, Peter. "Multi-view geometry for general camera models." 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 1. IEEE, 2005.
[3] Mirzaei, K., Arashpour, M., Asadi, E., Masoumi, H., Bai, Y., & Behnood, A. (2022). 3D point cloud data processing with machine learning for construction and infrastructure applications: A comprehensive review. Advanced Engineering Informatics, 51, 101501.
[4] Bhoi, Amlaan. "Monocular depth estimation: A survey." arXiv preprint arXiv:1901.09402 (2019).
[5] Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." arXiv preprint arXiv:1511.05440 (2015).
[6] Sutton, Charles, and Andrew McCallum. "An introduction to conditional random fields." Foundations and Trends® in Machine Learning 4.4 (2012): 267-373.
[7] Fu, H., Gong, M., Wang, C., Batmanghelich, K., & Tao, D. (2018). Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2002-2011).
[8] Bhat, S. F., Birkl, R., Wofk, D., Wonka, P., & Müller, M. (2023). Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288.
[9] Bhat, Shariq Farooq, Ibraheem Alhashim, and Peter Wonka. "Adabins: Depth estimation using adaptive bins." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.
[10] Bhat, Shariq Farooq, Ibraheem Alhashim, and Peter Wonka. "Localbins: Improving depth estimation by learning local distributions." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.
[11] Li, Zhenyu, et al. "Binsformer: Revisiting adaptive bins for monocular depth estimation." IEEE Transactions on Image Processing (2024).
[12] Agarwal, Ashutosh, and Chetan Arora. "Attention attention everywhere: Monocular depth prediction with skip attention." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023.
[13] Wang, W., Zheng, V. W., Yu, H., & Miao, C. (2019). A survey of zero-shot learning: Settings, methods, and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2), 1-37.
[14] Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., ... & Bojanowski, P. (2023). Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193.
[15] Schonberger, J. L., & Frahm, J. M. (2016). Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4104-4113).
[16] Hirschmuller, H. (2007). Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence, 30(2), 328-341.
[17] Salvi, J., Fernandez, S., Pribanic, T., & Llado, X. (2010). A state of the art in structured light patterns for surface profilometry. Pattern recognition, 43(8), 2666-2680.
[18] Lange, R., & Seitz, P. (2001). Solid-state time-of-flight range camera. IEEE Journal of quantum electronics, 37(3), 390-397.
[19] Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems, 27.
[20] Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016, October). Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV) (pp. 239-248). IEEE.
[21] Godard, C., Mac Aodha, O., & Brostow, G. J. (2017). Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 270-279).
[22] Godard, C., Mac Aodha, O., Firman, M., & Brostow, G. J. (2019). Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3828-3838).
[23] Ashish, V. (2017). Attention is all you need. Advances in neural information processing systems, 30, I.
[24] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
[25] Ranftl, R., Bochkovskiy, A., & Koltun, V. (2021). Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 12179-12188).
[26] Saxena, S., Kar, A., Norouzi, M., & Fleet, D. J. (2023). Monocular depth estimation using diffusion models. arXiv preprint arXiv:2302.14816.
[27] Duan, Y., Guo, X., & Zhu, Z. (2024, September). Diffusiondepth: Diffusion denoising approach for monocular depth estimation. In European Conference on Computer Vision (pp. 432-449). Cham: Springer Nature Switzerland.
[28] Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R. C., & Schindler, K. (2024). Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9492-9502).
[29] Piccinelli, L., Yang, Y. H., Sakaridis, C., Segu, M., Li, S., Van Gool, L., & Yu, F. (2024). UniDepth: Universal monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10106-10116).
[30] Piccinelli, L., Sakaridis, C., Yang, Y. H., Segu, M., Li, S., Abbeloos, W., & Van Gool, L. (2025). Unidepthv2: Universal monocular metric depth estimation made simpler. arXiv preprint arXiv:2502.20110.
[31] Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3), 1623-1637.
[32] Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., & Zhao, H. (2024). Depth anything: Unleashing the power of large-scale unlabeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10371-10381).
[33] Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., & Zhao, H. (2024). Depth anything v2. Advances in Neural Information Processing Systems, 37, 21875-21911.
[34] Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016, October). Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth international conference on 3D vision (3DV) (pp. 239-248). IEEE.
[35] Lee, J. H., Han, M. K., Ko, D. W., & Suh, I. H. (2019). From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326.
[36] Roy, A., & Todorovic, S. (2016). Monocular depth estimation using neural regression forest. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5506-5514).
[37] Cao, Y., Wu, Z., & Shen, C. (2017). Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology, 28(11), 3174-3182.
[38] Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448-456). pmlr.
[39] Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
[40] Wu, Y., & He, K. (2018). Group normalization. In Proceedings of the European conference on computer vision (ECCV) (pp. 3-19).
[41] Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012, October). Indoor segmentation and support inference from rgbd images. In European conference on computer vision (pp. 746-760). Berlin, Heidelberg: Springer Berlin Heidelberg.
[42] Geiger, A., Lenz, P., & Urtasun, R. (2012, June). Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition (pp. 3354-3361). IEEE.
[43] Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A., Bautista, M. A., Paczan, N., ... & Susskind, J. M. (2021). Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10912-10922).
[44] Koch, T., Liebel, L., Fraundorfer, F., & Korner, M. (2018). Evaluation of cnn-based single-image depth estimation methods. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops (pp. 0-0).
[45] Song, S., Lichtenberg, S. P., & Xiao, J. (2015). Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 567-576).
[46] Vasiljevic, I., Kolkin, N., Zhang, S., Luo, R., Wang, H., Dai, F. Z., ... & Shakhnarovich, G. (2019). Diode: A dense indoor and outdoor depth dataset. arXiv preprint arXiv:1908.00463.
[47] Cabon, Y., Murray, N., & Humenberger, M. (2020). Virtual kitti 2. arXiv preprint arXiv:2001.10773.