| 研究生: |
陳文正 Chen, Wen-Cheng |
|---|---|
| 論文名稱: |
用於三維場景空間推理的可擴展類神經記憶 Scalable Neural Memory for Spatial Inference in 3D Scenes |
| 指導教授: |
朱威達
Chu, Wei-Ta |
| 共同指導教授: |
胡敏君
Hu, Min-Chun 陳祝嵩 Chen, Chu-Song |
| 學位類別: |
博士 Doctor |
| 系所名稱: |
電機資訊學院 - 資訊工程學系 Department of Computer Science and Information Engineering |
| 論文出版年: | 2022 |
| 畢業學年度: | 111 |
| 語文別: | 英文 |
| 論文頁數: | 70 |
| 中文關鍵詞: | 三維電腦視覺 、空間感知 、場景探索 、機器導航 、強化學習 |
| 外文關鍵詞: | 3D Spatial Perception, Scene Navigation and Exploration, Reinforcement Learning, 3D Computer Vision |
| 相關次數: | 點閱:83 下載:13 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
從二維影像推斷三維空間的結構並建構場景記憶是人類視覺系統的基礎能力。基於腦內建構出的場景,人類可以簡單地完成各種空間相關的導航任務。為了讓機器獲得類似的能力,三維電腦視覺與機器人學領域的研究者利用相機幾何模型與最佳化演算法從影像序列還原場景結構,並結合路徑規劃與反饋控制演算法構建出導航機器人的系統框架。
然而,幾何模型演算法建立出來的結構地圖缺乏抽象的物件概念,因此只能達成給定目的地座標的單純導航任務,對於比較高級的任務如物件搜尋,需要使用額外的影像語義資訊作為輔助才能實現。另外在沒有深度影像資訊的情況下幾何模型演算法往往只能建立出稀疏結構的地圖,可能會導致結構上的誤判並使得路徑規劃出錯。如何解決上述的問題是現今人工智慧領域所致力研究的方向。本研究中我們探索了一種與過往視覺機器人系統完全不同的架構,讓智能體模仿生物在沒有幾何先驗知識的情況下,能透過觀察影像來理解空間感知的概念,並構建抽象的場景記憶來輔助空間相關的導航任務。本研究分為三個部分,第一個部分是利用觀察影像與對應觀察位置的資訊學習空間變換的概念,在此我們提出了空間變換路由機制(Spatial Transformation Routing),讓智能體可以實時地從二維觀察影像推斷出抽象的三維場景表示。第二部分是賦予智能體擴增場景記憶的能力,在此我們提出了場景記憶網路(Scene Memory Network)與空間感知能力記憶體控制機制(Spatial-Aware Memory Controlling),讓其不被限制在小空間中,得以不斷地接受新的觀察而增加對於場景的認識。第三部分則是結合抽象地圖與當前觀察的資訊建構決策模型,並設計鼓勵探索環境的內部好奇心量測以達成空間相關任務的學習。在實驗中我們成功地讓智能體在模擬的環境中學習場景合成與導航任務,相較於過往的場景表示模型,我們所提出的場景記憶網路在影像生成上有更加結構合理的結果,也在物件蒐集任務上取得比基於純影像的強化學習模型還要高的累積獎勵分數。
從工程的角度來看,我們提出了一個新的視覺系統,直接讓智能體學習抽象的場景表示以取代過往的結構地圖,得以應用在更加高級的任務上。另外我們所設計的模型可以透過學習而適應不同的視覺傳感器,不需要額外的傳感器參數量測過程,因此在系統建構上所需的前置步驟較少,並且擁有更大的彈性。從認知神經科學的角度上來看,本論文建構了一個模型來學習三維空間的概念。其中使用簡單的神經網路架構取代幾何相關的先驗模型,盡可能達到「生物合理 (Biologically Plausible) 」的運算,提供了一種生物視覺系統運作的可能性。
Inferring the 3D structure from 2D observation images and constructing the memory of scenes are the basic ability of human vision system. These abilities play import roles in 3D spatial-relate tasks such as exploration and navigation. To equip machine with these abilities, researchers in computer vision and robotic fields adopt the camera geometry model and optimization algorithm to recover the scene structure from image sequences. Based on the constructed scene structure, path planning and feedback control algorithms can be applied to achieve the robotic navigation tasks. However, the maps constructed by the geometry-based model do not contain the abstract concept of objects. Thus the above system can only solve simple navigation tasks such as navigating the agent to a target point.
For more complex navigation tasks such as image-based navigation or object collecting, the above system requires additional semantic information for assisting the decision making process. Furthermore, without the depth information the geometry-based model can only construct the map with sparse structure. The incomplete surface information may lead to the problem of planning an unreasonable path. In this work, we design a novel architecture which is quite different from previous robotic vision systems. Our goal is to let the intelligent agent learn the 3D perception and construct the abstract scene representation for assisting spatial-related navigation tasks. This dissertation contains three parts, the first is to learn the concept of spatial transformation via images and the corresponding poses. We propose the Spatial Transformation Routing Mechanism to let the agent achieve real-time inference of the 3D scene representation from 2D observations.
The second part is to construct the expandable memory for the intelligent agent. We propose Scene Memory Network with Spatial-Aware Memory Controlling, which let the agent incrementally construct the scene representation based on new observations. The third part is to construct the decision making model and the curiosity-based intrinsic reward for spatial-related tasks based on the abstract scene representation. The experiments prove that our proposed Scene Memory Network achieves better and more geometry-plausible rendering results than the previous geometry operation-free scene representation models. Also, based on the learned scene representation, the proposed reinforcement learning model achieves higher scores than the baseline model. From the engineering perspective, we proposed a new vision system for spatial-related task, which directly learn the abstract scene representation instead of explicit structured map, and the scene representation can be applied to more advanced tasks. Furthermore, The proposed model can adapt to different camera sensors via learning without additional measurement of camera parameters. As a result, our proposed model has a simpler system setup process and is more flexible for different sensors. From the perspective of cognitive neuroscience, we construct a mathematical module to capture the concept of 3D space from observations. Moreover, we utilize simple neural network architecture to replace the human designed geometry-based operation to achieve the ``Biologically Plausible" operation as possible. In other words, the proposed model provides a possibility for the simulation of human's biological vision system.
[1] A. P. Badia, P. Sprechmann, A. Vitvitskyi, D. Guo, B. Piot, S. Kapturowski, O. Tieleman, M. Arjovsky, A. Pritzel, A. Bolt, et al. Never give up: Learning directed exploration strategies. arXiv preprint arXiv:2002.06038, 2020.
[2] J.-L. Blanco, J.-A. Fernandez-Madrigal, and J. González. A novel measure of uncertainty for mobile robot slam with rao—blackwellized particle filters. The International Journal of Robotics Research, 27(1):73–89, 2008.
[3] O. Bogdan, V. Eckstein, F. Rameau, and J.-C. Bazin. Deepcalib: a deep learning approach for automatic intrinsic calibration of wide field-of-view cameras. In Proceedings of the 15th ACM SIGGRAPH European Conference on Visual
Media Production, 2018.
[4] L. Carlone, J. Du, M. K. Ng, B. Bona, and M. Indri. Active slam and exploration with particle filters using kullback-leibler divergence. Journal of Intelligent & Robotic Systems, 75(2):291–311, 2014.
[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
[6] T. Chen, S. Gupta, and A. Gupta. Learning exploration policies for navigation. In International Conference on Learning Representations, 2019.
[7] W.-C. Chen, M.-C. Hu, and C.-S. Chen. Str-gqn: Scene representation and rendering for unknown cameras based on spatial transformation routing. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pages 5966–5975, 2021.
[8] X. Chen, J. Song, and O. Hilliges. Monocular neural image based rendering with continuous view control. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4090–4100, 2019.
[9] I. Choi, O. Gallo, A. Troccoli, M. H. Kim, and J. Kautz. Extreme view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7781–7790, 2019.
[10] T. DeVries, M. A. Bautista, N. Srivastava, G. W. Taylor, and J. M. Susskind. Unconstrained scene generation with locally conditioned radiance fields. arXiv preprint arXiv:2104.00670, 2021.
[11] J. Engel, T. Schöps, and D. Cremers. Lsd-slam: Large-scale direct monocular slam. In European conference on computer vision, pages 834–849. Springer, 2014.
[12] S. A. Eslami, D. J. Rezende, F. Besse, F. Viola, A. S. Morcos, M. Garnelo, A. Ruderman, A. A. Rusu, I. Danihelka, K. Gregor, et al. Neural scene representation and rendering. Science, 360(6394):1204–1210, 2018.
[13] H. J. S. Feder. Simultaneous stochastic mapping and localization. PhD thesis, Massachusetts Institute of Technology, 1999.
[14] M. Fraccaro, D. Rezende, Y. Zwols, A. Pritzel, S. A. Eslami, and F. Viola. Generative temporal models with spatial memory for partially observed environments. In International Conference on Machine Learning, pages 1549–1558. PMLR, 2018.
[15] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 2002–2011, 2018.
[16] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 270–279, 2017.
[17] A. Gordon, H. Li, R. Jonschkowski, and A. Angelova. Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras. In
Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8977–8986, 2019.
[18] A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
[19] K. Gregor, F. Besse, D. J. Rezende, I. Danihelka, and D. Wierstra. Towards conceptual compression. In Advances In Neural Information Processing Systems,
pages 3549–3557, 2016.
[20] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference
on Machine Learning-Volume 70, pages 1352–1361. JMLR. org, 2017.
[21] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International
conference on machine learning, pages 1861–1870. PMLR, 2018.
[22] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. Soft actor-critic algorithms and applications.
arXiv preprint arXiv:1812.05905, 2018.
[23] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969,
2017.
[24] W. Hess, D. Kohler, H. Rapp, and D. Andor. Real-time loop closure in 2d lidar slam. In 2016 IEEE International Conference on Robotics and Automation
(ICRA), pages 1271–1278. IEEE, 2016.
[25] M. Keidar and G. A. Kaminka. Efficient frontier detection for robot exploration. The International Journal of Robotics Research, 33(2):215–236, 2014.
[26] G. Klein and D. Murray. Parallel tracking and mapping for small ar workspaces. In 2007 6th IEEE and ACM international symposium on mixed
and augmented reality, pages 225–234. IEEE, 2007.
[27] J. Kopf, X. Rong, and J.-B. Huang. Robust consistent video depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 1611–1621, 2021.
[28] R. Li, S. Wang, Z. Long, and D. Gu. Undeepvo: Monocular visual odometry through unsupervised deep learning. In 2018 IEEE international conference
on robotics and automation (ICRA), pages 7286–7291. IEEE, 2018.
[29] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 3431–3440, 2015.
[30] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.
arXiv preprint arXiv:2003.08934, 2020.
[31] A. H. Miller, A. Fisch, J. Dodge, A.-H. Karimi, A. Bordes, and J. Weston. Keyvalue memory networks for directly reading documents. In EMNLP, 2016.
[32] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Humanlevel
control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
[33] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics, 31(5):1147–
1163, 2015.
[34] J. Oh, V. Chockalingam, H. Lee, et al. Control of memory, active perception, and action in minecraft. In International Conference on Machine Learning,
pages 2790–2799. PMLR, 2016.
[35] E. Parisotto and R. Salakhutdinov. Neural map: Structured memory for deep reinforcement learning. In International Conference on Learning Representations,
2018.
[36] E. Park, J. Yang, E. Yumer, D. Ceylan, and A. C. Berg. Transformationgrounded image generation network for novel 3d view synthesis. In Proceedings
of the ieee conference on computer vision and pattern recognition, pages 3500–3509, 2017.
[37] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-supervised prediction. In ICML, 2017.
[38] B. Planche, X. Rong, Z. Wu, S. Karanam, H. Kosch, Y. Tian, J. Ernst, and A. HUTTER. Incremental scene synthesis. Advances in Neural Information
Processing Systems, 32:1668–1678, 2019.
[39] A. Pritzel, B. Uria, S. Srinivasan, A. P. Badia, O. Vinyals, D. Hassabis, D. Wierstra, and C. Blundell. Neural episodic control. In International Conference
on Machine Learning, pages 2827–2836. PMLR, 2017.
[40] K. Qiu, F. Zhang, and M. Liu. Let the light guide us: Vlc-based localization. IEEE Robotics & Automation Magazine, 23(4):174–183, 2016.
[41] A. Ramezani Dooraki and D.-J. Lee. An end-to-end deep reinforcement learning-based intelligent agent capable of autonomous exploration in unknown
environments. Sensors, 18(10):3575, 2018.
[42] J. L. Schonberger and J.-M. Frahm. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 4104–4113, 2016.
[43] D. F. Sebastian Thrun, Wolfram Burgard. PROBABILISTIC ROBOTICS, chapter Occupancy Grid Mapping. 2005.
[44] V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhofer. Deepvoxels: Learning persistent 3d feature embeddings. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pages 2437–2446, 2019.
[45] V. Sitzmann, M. Zollhöfer, and G. Wetzstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations. In Advances in
Neural Information Processing Systems, pages 1121–1132, 2019.
[46] C. Stachniss, G. Grisetti, and W. Burgard. Information gain-based exploration using rao-blackwellized particle filters. In Robotics: Science and Systems,
volume 2, pages 65–72, 2005.
[47] S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. Advances in Neural Information Processing Systems, 28:2440–2448, 2015.
[48] S.-H. Sun, M. Huh, Y.-H. Liao, N. Zhang, and J. J. Lim. Multi-view to novel view: Synthesizing novel views with self-learned confidence. In Proceedings
of the European Conference on Computer Vision (ECCV), pages 155–171, 2018.
[49] L. Tai, S. Li, and M. Liu. Autonomous exploration of mobile robots through deep neural networks. International Journal of Advanced Robotic Systems,
14(4):1729881417703571, 2017.
[50] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Multi-view 3d models from single images with a convolutional network. In European Conference on Computer
Vision, pages 322–337. Springer, 2016.
[51] S. Thrun. Probabilistic robotics. Communications of the ACM, 45(3):52–57, 2002.
[52] J. Tobin, W. Zaremba, and P. Abbeel. Geometry-aware neural rendering. In Advances in Neural Information Processing Systems, pages 11559–11569,
2019.
[53] S. Tulsiani, A. A. Efros, and J. Malik. Multi-view consistency as supervisory signal for learning shape and pose prediction. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 2897–2905, 2018.
[54] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages 2626–2634, 2017.
[55] H.-Y. F. Tung, R. Cheng, and K. Fragkiadaki. Learning spatial common sense with geometry-aware recurrent networks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 2595–2603, 2019.
[56] H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence,
volume 30, 2016.
[57] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural
information processing systems, pages 5998–6008, 2017.
[58] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas. Dueling network architectures for deep reinforcement learning. In International
conference on machine learning, pages 1995–2003. PMLR, 2016.
[59] O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson. Synsin: End-to-end view synthesis from a single image. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 7467–7477, 2020.
[60] J. Wu, Y. Wang, T. Xue, X. Sun, W. T. Freeman, and J. B. Tenenbaum. Marrnet: 3d shape reconstruction via 2.5 d sketches. arXiv preprint
arXiv:1711.03129, 2017.
[61] B. Yamauchi. A frontier-based approach for autonomous exploration. In cira, volume 97, page 146, 1997.
[62] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In Advances
in neural information processing systems, pages 1696–1704, 2016.
[63] W. Zaremba and I. Sutskever. Reinforcement learning neural turing machinesrevised. arXiv e-prints, pages arXiv–1505, 2015.
[64] H. Zhan, R. Garg, C. Saroj Weerasekera, K. Li, H. Agarwal, and I. Reid. Unsupervised learning of monocular depth estimation and visual odometry with
deep feature reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 340–349, 2018.
[65] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 1851–1858, 2017.
[66] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. View synthesis by appearance flow. In European conference on computer vision, pages 286–301.
Springer, 2016.