簡易檢索 / 詳目顯示

研究生: 姜林毅
Jiang, Lin-Yi
論文名稱: 連結邊緣感知與世界模型:運用深度時空學習建構快速且一致之空間狀態估測
Bridging Edge Perception and World Models: Fast and Consistent Spatial State Estimation via Deep Spatio-Temporal Learning
指導教授: 陳朝鈞
Chen, Chao-Chun
共同指導: 蘇維宗
Su, Wei-Tsung
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 製造資訊與系統研究所
Institute of Manufacturing Information and Systems
論文出版年: 2026
畢業學年度: 114
語文別: 英文
論文頁數: 132
中文關鍵詞: 世界模型物理導向學習時空一致性單眼幾何測距邊緣智能感知控制閉環資源自適應微小目標感知
外文關鍵詞: World Model, Physics-guided Learning, Spatiotemporal Consistency, Monocular Geometric Ranging, Edge Intelligence, Perception-Control Loop, Resource Adaptation, Tiny Object Perception
相關次數: 點閱:4下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 為賦予智慧系統在真實動態環境中進行連續互動與自主決策的能力,建構具備物理理解特性的空間互動世界模型,已成為下一代智慧系統的重要發展方向。相較於僅著重於訊號層面的特徵識別,世界模型更需解決跨時空的物理一致性問題,以確保感知狀態能真實反映世界中的運動法則與因果邏輯,進而將針對單一時刻的離散判讀,提升為對連續時空動態的精確掌握。然而,現有技術多集中於單幀影像的靜態識別與測距優化,缺乏對時空一致性與語意—幾何協同的系統性建模,使得感知輸出難以轉化為可供世界模型長期採用的穩定空間狀態流,成為理論與實務落地之間的關鍵瓶頸。

    為彌補上述落差,本論文提出一套連結邊緣感知與世界模型的工業級跨層整合架構,並建立一項超越單一演算法優化的空間狀態標準協定。該架構涵蓋邊緣資源治理、物理幾何一致性推論,以及以任務為驅動的系統供需調節機制,旨在將分散且異質的感知結果,轉化為具備連續性、可解釋性與可長期採用特性的空間狀態服務。本論文透過三個層面的技術設計與場域實證,系統性地驗證此架構的可行性與實務價值。

    首先,在邊緣裝置層,針對世界模型在連續決策中所需的長期運作韌性,本研究提出一套資源自適應的邊緣—雲端協作架構。透過語意導向的來源治理機制與ECB-Net,系統能在受限的算力與頻寬條件下,主動篩選具決策價值的語意特徵,確保僅有關鍵資訊進入後端推論流程。於智慧畜牧場域的長期實證結果顯示,該架構在大幅降低傳輸負載的同時,仍能維持感知服務的連續性與穩定性,有效提升邊緣系統於惡劣環境中的實務生存力。

    其次,在演算法層,本論文針對連續決策所需的物理一致性,提出一致且精確測距估測(Consistent and Accurate Ranging Estimation, CARE)架構。CARE 將測距問題建模為深度時空學習任務,透過語意—幾何協同建模與時空約束機制,有效消除動態場景中常見的幾何幻覺與尺度漂移。在微型載具授粉任務的模擬驗證中,CARE 展現出優異的跨時間穩定性與微小目標測距精度,證明其能產出符合物理邏輯的連續幾何狀態。

    最後,在系統整合層,為支援連續決策所需的任務適應性,本研究進一步定義一套標準化的空間狀態閉環執行協定。該協定將感測、推論與決策封裝為可交換的狀態契約,並實現以任務需求為核心的供需調節迴圈。在高動態高爾夫揮桿生物力學分析的實務驗證中,本系統成功將連續影像資料轉換為具備生物力學意義的可操作指標,並於區分初學者與專家動作特徵時展現工業級的分析精度。

    綜合而言,本論文建立一套從理論模型延伸至產業落地的完整方法論。透過整合邊緣語意治理、時空一致性推論與標準化閉環協定,本研究證實 AIoT 感知系統得以昇華為穩定、可解釋且具實務生存力的空間狀態服務,並為世界模型於工業自動化、精準農業與人機互動等應用場域中的標準化部署,奠定關鍵且可行的工程基石。

    Enabling intelligent systems to continuously interact with and autonomously make decisions in real-world dynamic environments requires the construction of interactive world models endowed with physical understanding. Beyond signal-level feature recognition, a world model must address spatiotemporal physical consistency to ensure that perceptual states faithfully reflect the underlying laws of motion and causal structures of the physical world. Only through such consistency can discrete, instantaneous interpretations be elevated into accurate representations of continuous spatiotemporal dynamics. However, existing perception systems predominantly focus on static recognition and single-frame ranging optimization, lacking systematic modeling of spatiotemporal consistency and semantic–geometric cooperation. As a result, perception outputs often remain fragmented, discontinuous, and unstable, preventing their direct adoption as reliable spatial state streams for continuous decision-making in world models. This gap between theoretical modeling and practical deployment constitutes a critical bottleneck for industrial-scale world model realization.

    To bridge this gap, this dissertation proposes an industrial-grade cross-layer integration framework that systematically connects edge perception with world model adoption through a standardized spatial state abstraction. Moving beyond isolated algorithmic optimization, the proposed framework establishes a unified spatial state specification encompassing edge resource governance, physically consistent geometric inference, and task-driven supply–demand regulation. The framework aims to transform heterogeneous and distributed perceptual outputs into continuous, interpretable, and sustainably adoptable spatial state services. Its feasibility and practical value are validated through three complementary technical contributions and real-world deployments.

    At the edge device layer, this work addresses the long-term operational resilience required for continuous decision-making by proposing a resource-adaptive edge–cloud collaborative architecture. Through semantic-oriented source governance and the ECB-Net framework, edge devices actively filter and retain decision-relevant semantic features under constrained computational and bandwidth conditions, ensuring that only high-value information is transmitted upstream. Long-term deployments in smart livestock farming environments demonstrate that the proposed architecture significantly reduces transmission load while maintaining stable and continuous monitoring services, effectively enhancing the survivability of edge perception systems under harsh conditions.

    At the algorithmic layer, this dissertation introduces the Consistent and Accurate Ranging Estimation (CARE) framework to ensure physical consistency for continuous decisionmaking. Unlike conventional single-frame approaches, CARE formulates ranging as a deep spatiotemporal learning problem. By jointly modeling semantic and geometric representations under spatiotemporal constraints, CARE suppresses geometric illusions and scale drift in dynamic scenes. Simulation studies on micro-vehicle pollination tasks demonstrate that CARE achieves superior cross-time stability and high-precision ranging for small dynamic targets, validating its capability to produce physically plausible continuous geometric states.

    At the system integration layer, this work defines a standardized closed-loop spatial state execution protocol to support task adaptivity in continuous decision-making. The protocol encapsulates sensing, inference, and decision-making into exchangeable state contracts and enables a task-driven supply–demand regulation loop. In a high-dynamic golf swing biomechanical analysis scenario, the system converts continuous visual observations into biomechanically meaningful actionable indicators, achieving industrial-grade accuracy in distinguishing motion patterns between novices and experts.

    In summary, this dissertation establishes a comprehensive methodology bridging theoretical world model concepts and industrial deployment. By integrating edge semantic governance, spatiotemporally consistent inference, and standardized closed-loop protocols, this work demonstrates that AIoT perception networks can be elevated into stable, interpretable, and practically viable spatial state services. The proposed framework lays a solid engineering foundation for the standardized deployment of world models in industrial automation, precision agriculture, and human–machine interaction applications.

    摘要 i Abstract iii 誌謝 v Table of Contents vii List of Tables x List of Figures xi Chapter 1. Introduction 1 1.1. Applications of World Models in Real-World Understanding and Industrial Demands 1 1.1.1. From Visual Realism to Physical Consistency: World Models as the Core Foundation of Physical AI 3 1.2. The Role of Accurate Ranging in World Model-Based Spatial Interaction for Smart Systems 5 1.3. Evolution of Ranging Technologies: From Traditional Methods to Deep Learning 9 1.3.1. Limitations of Traditional Physics-based Ranging 9 1.3.2. Paradigm Shift to Deep Learning-based Perception 9 1.4. Key Challenges in Real-World Deployment: The Gap between Static Models and Dynamic Interaction 10 1.5. Research Objectives: Building Physically Consistent and Long-Term Robust Perception and Spatial States for World Models 13 1.6. Research Contributions and Impact on Spatial State Modeling and Deployment for World Models 14 Chapter 2. Related works 18 2.1. Classical Ranging Techniques and Limitations 18 2.1.1. Geometric and Time-Domain Ranging Principles 18 2.1.2. Bottlenecks of Traditional Methods in Dynamic Scenes and Semantic Understanding 19 2.2. Deep Learning-based Ranging and Perception Technologies 20 2.2.1. From Monocular Depth to Foundation Models 20 2.2.2. Temporally Consistent Monocular Video Depth Estimation 21 2.2.3. Multimodal Fusion for Robust Ranging 22 2.2.4. Generalization, Bias, and Robustness Analysis 22 2.2.5. Domain-Specific Applications: MUVs and Smart Agriculture 23 2.3. Research on Perceptual Consistency and Spatio-Temporal Stability 23 2.3.1. Time-Series Consistency and Visual Stabilization Techniques 23 2.3.2. Geometric Consistency in Dynamic Scenes 23 2.4. Research on AIoT Perception and Edge-Cloud Collaborative Architectures 24 2.4.1. AIoT Perception Systems and Semantic-Oriented Surveillance Architectures 24 2.4.2. Efficiency Design for Edge Lightweighting and Cloud Collaboration 24 2.5. Research on Lightweight Deep Learning Networks 25 2.5.1. Summary and Research Positioning 26 Chapter 3. Key Requirements for World Model-Centric Spatial Perception 27 3.1. Key Requirements at the Device Level (AIoT/Edge) 28 3.2. Key Requirements at the Method Level (Deep Learning Ranging) 30 3.3. Key Requirements at the Application Level (System and Environment) 32 3.4. Chapter Summary and Dissertation Roadmap 33 Chapter 4. Proposed Method I: AIoT-Cloud-Integrated Smart Livestock Surveillance via Assembling Deep Networks with Considering Robustness and Semantics Availability 35 4.1. Introduction 35 4.2. The Proposed Smart Livestock Surveillance Scheme 38 4.2.1. Deep Network Assembly for Semantic Surveillance Services 38 4.2.2. Proposed Architecture 39 4.2.3. Architectural Overview of Cloud-Side Deep Networks 44 4.3. Core Architectural Mechanisms for Smart Surveillance 45 4.3.1. The Expandable-Convolutional-Block Neural Network (ECB-Net) Architecture 45 4.3.2. Automated Mechanism for ECB-Net Model Generation 50 4.4. Performance Study 51 4.4.1. System Deployment and Experimental Configuration 51 4.4.2. Experiment 1: Evaluation of the Filtering Efficacy of the ECB-Net 52 4.4.3. Experiment 2: Analysis of the Adaptability of the ECB-Net 52 4.4.4. Experiment 3: Assessment of the Quality of the Generated ECB-Net Models 54 4.4.5. Experiment 4: Comparative Evaluation against Representative Lightweight Neural Networks 57 4.5. Summary 58 Chapter 5. Proposed Method II: Consistent and Accurate Ranging Estimation of Tiny Objects for Mobile Uncrewed Vehicles 61 5.1. Introduction 61 5.2. Proposed Method 64 5.2.1. Design of and the Proposed CARE Framework 64 5.2.2. 2D-Sense Feature Recipe 66 5.2.3. Consistent Ranging Net 69 5.3. Case Study 74 5.3.1. Experimental Setup and Evaluation Metrics 74 5.3.2. Analysis of Accuracy and Runtime Efficiency 78 5.3.3. Consistency Evaluation 81 5.3.4. Ablation Analysis 82 5.3.5. Qualitative Comparisons 83 5.4. Summary 85 Chapter 6. An Industrial Cross-Layer Integration Framework for Ranging-Based Spatial States in World Models 87 6.1. Cross-Layer Integration of Ranging-Based Spatial States for World Model Centric Supply-Demand Loop 88 6.1.1. End-to-End Cross-Layer Architecture for Spatial State Service 90 6.1.2. Spatial State Contract and Standardized Closed-Loop Protocol 94 6.2. System Instantiation and End-to-End Integration of Ranging-Based Spatial States for High-Dynamics Biomechanics Analysis 100 6.2.1. Instantiating the Cross-Layer Geometric State Pipeline for Golf Swing Analysis in Edge-Cloud Architecture 100 6.2.2. Validation of Kinematic State Stability: Novice vs. Expert Comparative Analysis 103 6.3. Summary 105 Chapter 7. Conclusions and Future Work 108 Bibliography 111

    [1] Faris A. Almalki and Marios C. Angelides. Deployment of an autonomous fleet of uavs for assessing the NDVI of regenerative farming. In Int. Conf. Intell. Comput., Commun., Netw. Serv. (ICCNS), pages 128–135, 2023.

    [2] Ricardo S. Alonso, Inés Sittón-Candanedo, Óscar García, Javier Prieto, and Sara Rodríguez-González. An intelligent edge-IoT platform for monitoring livestock and crops in a dairy farming scenario. Ad Hoc Networks, 98:102047, 2020.

    [3] Jihyong Bae, Sangryul Moon, and Seung-Hoon Im. Deep digging into the generalization of self-supervised monocular depth estimation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023.

    [4] Daniel Berckmans. Precision livestock farming technologies for welfare management in intensive livestock systems. Rev. Sci. Tech, 33(1):189–196, 2014.

    [5] Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. ZoeDepth: Zero-shot transfer by combining relative and metric depth, 2023.

    [6] Uddhav Bhattarai, Ranjan Sapkota, Safal Kshetri, Changki Mo, Matthew D Whiting, Qin Zhang, and Manoj Karkee. A vision-based robotic system for precision pollination of apples. Computers and Electronics in Agriculture, 234:110158, 2025.

    [7] Aleksei Bochkovskii, Amaresh Amaresh, Balaji HV, et al. Depth pro: Sharp monocular metric depth in less than a second. arXiv preprint arXiv:2410.02073, 2024. Apple’s Zero-shot Metric Depth Model.

    [8] Tim Broedermann, Christos Sakaridis, Luigi Piccinelli, Wim Abbeloos, and Luc Van Gool. Dgfusion: Depth-guided sensor fusion for robust semantic perception. arXiv preprint arXiv:2509.09828, 2025.

    [9] Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025.

    [10] Yuyang Dong et al. Depthsync: Diffusion guidance-based depth synchronization for scale- and geometry-consistent video depth estimation. arXiv preprint arXiv:2507.01603, 2025.

    [11] Zhengqiang Fan, Na Sun, Quan Qiu, Tao Li, and Chunjiang Zhao. Depth ranging performance evaluation and improvement for RGB-D cameras on field-based high-throughput phenotyping robots. In IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), pages 32993304, 2021.

    [12] Muhammad Shoaib Farooq, Shamyla Riaz, Adnan Abid, Kamran Abid, and Muhammad Azhar Naeem. A survey on the role of IoT in agriculture for the implementation of smart farming. IEEE Access, 7:156237–156271, 2019.

    [13] Johann-Friedrich Feiden, Tim Küchler, Denis Zavadski, Bogdan Savchynskyy, and Carsten Rother. Online video depth anything: Temporally-consistent depth prediction with low memory consumption. arXiv preprint arXiv:2510.09182, 2025.

    [14] Qihan Feng, Xinzheng Xu, and Zhixiao Wang. Deep learning-based small object detection: A survey. Mathematical Biosciences and Engineering, 20(4):6551–6590, 2023.

    [15] Tuo Feng, Wenguan Wang, and Yi Yang. A survey of world models for autonomous driving. arXiv preprint arXiv:2501.11260, 2025.

    [16] Xingli Gan, Kuang Zhu, Min Sun, Leyang Zhao, and Canwei Lai. Gac-net: A geometric-attention fusion network for sparse depth completion from lidar and image. Sensors, 25(17):5495, 2025.

    [17] Stefano Gasperini, Nils Morbitzer, HyunJun Jung, Nassir Navab, and Federico Tombari. Robust monocular depth estimation under challenging conditions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8177–8186, 2023.

    [18] Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J. Brostow. Digging into self-supervised monocular depth prediction. In The International Conference on Computer Vision (ICCV), October 2019.

    [19] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.

    [20] Jack Greenfield. Software factories: Assembling applications with patterns, models, frameworks and tools. In Intl. Conf. on Generative Programming and Component Engineering, pages 488–488, 2004.

    [21] Vitor Guizilini, Igor Vasiljevic, Dian Chen, Rareş Ambruş, and Adrien Gaidon. Towards zero-shot scale-aware monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9233–9243, October 2023.

    [22] Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and C. Xu. GhostNet: More features from cheap operations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1580–1589, 2020.

    [23] Wencheng Han and Jianbing Shen. High-precision self-supervised monocular depth estimation with rich-resource prior. In Proceedings of the European Conference on Computer Vision (ECCV), pages 146–162, October 2024.

    [24] Richard Hartley. Multiple view geometry in computer vision, volume 665. Cambridge university press, 2003.

    [25] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In IEEE Intl. Conf. on Computer Vision (ICCV), pages 2980–2988, 2017.

    [26] Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024.

    [27] Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2005–2015, June 2025.

    [28] Liqin Huang, Ting Zhe, Junyi Wu, Qiang Wu, Chenhao Pei, and Dan Chen. Robust inter-vehicle distance estimation method based on monocular vision. IEEE Access, 7:46059–46070, 2019.

    [29] Tak-Wai Hui. Rm-depth: Unsupervised learning of recurrent monocular depth in dynamic scenes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1675–1684, 2022.

    [30] Dawen Jiang, Zhishu Shen, Qiushi Zheng, Tiehua Zhang, Wei Xiang, and Jiong Jin. Farm-lightseek: An edge-centric multimodal agricultural iot data analytics framework with lightweight llms. IEEE Internet of Things Magazine, 8(5):72–79, 2025.

    [31] Lin-Yi Jiang, Chen-Ju Kuo, Tang-Hsuan O, Min-Hsiung Hung, and Chao-Chun Chen. SE-U-Net: Contextual segmentation by loosely coupled deep networks for medical imaging industry. In ACIIDS, 2021.

    [32] Zhongyuan Jiang, Hongbo Fan, Chaowei Cui, Zihan He, Denghui Zhao, and Kunshu Wu. Cw-mobilevit: A lightweight deep learning model for cattle identification in precision livestock farming. In 2024 5th International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE), pages 497–501, 2024.

    [33] Seng-Kyoun Jo, Dae-Heon Park, Hyeon Park, and Se-Han Kim. Smart livestock farms using digital twin: Feasibility study. In 2018 International Conference on Information and Communication Technology Convergence (ICTC), pages 1461–1463, 2018.

    [34] Bingxin Ke, Dominik Narnhofer, Shengyu Huang, Lei Ke, Torben Peters, Katerina Fragkiadaki, Anton Obukhov, and Konrad Schindler. Video depth without video models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025.

    [35] Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

    [36] Numair Khan, Eric Penner, Douglas Lanman, and Lei Xiao. Temporally consistent online depth estimation using point-based fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9119–9129, June 2023.

    [37] Y. J. Kim, D. H. Park, H. Park, and S. H. Kim. Pig datasets of livestock for deep learning to detect posture using surveillance camera. In Intl. Conf. on Information and Communication Technology Convergence, pages 1196–1198, 2020.

    [38] Marvin Klingner, Jan-Aike Termöhlen, Jonas Mikolajczyk, and Tim Fingscheidt. SelfSupervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance. In Eur. Conf. Comput. Vis. (ECCV), 2020.

    [39] Lingdong Kong, Shaoyuan Xie, Hanjiang Hu, Lai Xing Ng, Benoit Cottereau, and Wei Tsang Ooi. Robodepth: Robust out-of-distribution depth estimation under corruptions. In Advances in Neural Information Processing Systems 36 (NeurIPS) Datasets and Benchmarks Track, 2023.

    [40] Yann LeCun. A path towards autonomous machine intelligence. OpenReview, 2022. Position Paper.

    [41] Huadong Li, Minhao Jing, Jiajun Liang, Haoqiang Fan, and Renhe Ji. Sparse beats dense: Rethinking supervision in radar-camera depth completion. In European Conference on Computer Vision (ECCV), 2024.

    [42] Qing Li, Hongcheng Huang, and Pengzhi Chu. The research of vehicle monocular ranging based on YOLOv5. In 4th International Conference on Industrial Artificial Intelligence (IAI), pages 1–5, 2022.

    [43] Y. Li et al. Deep learning in multimodal fusion for sustainable plant care: A comprehensive review. ResearchGate Preprint, 2025. Reviews strategies for handling data heterogeneity in multimodal fusion.

    [44] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.

    [45] Yu-Chuan Lin, Min-Hsiung Hung, Hsien-Cheng Huang, Chao-Chun Chen, Haw-Ching Yang, Yao-Sheng Hsieh, and Fan-Tien Cheng. Development of advanced manufacturing cloud of things (amcot)—a smart manufacturing platform. IEEE Robotics and Automation Letters, 2(3):1809–1816, 2017.

    [46] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.

    [47] Fangchang Ma and Sertac Karaman. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In 2018 IEEE international conference on robotics and automation (ICRA), pages 4796–4803. IEEE, 2018.

    [48] X. Ma, T. Yao, M. Hu, Y. Dong, W. Liu, F. Wang, and J. Liu. A survey on deep learning empowered IoT applications. IEEE Access, 7:181721–181732, 2019.

    [49] Vlad-Cristian Miclea and Sergiu Nedevschi. Monocular depth estimation with improved long-range accuracy for uav environment perception. IEEE Transactions on Geoscience and Remote Sensing, 60:1–15, 2022.

    [50] Aung Si Thu Moe, Pyke Tin, Masaru Aikawa, Ikuo Kobayashi, and Thi Thi Zin. Aipowered visual e-monitoring system for cattle health and wealth. Smart Agricultural Technology, 12:101300, 2025.

    [51] Jaeho Moon, Juan Luis Gonzalez Bello, Byeongjun Kwon, and Munchurl Kim. Fromground-to-objects: Coarse-to-fine self-supervised monocular depth estimation of dynamic objects with ground contact prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10519–10529, June 2024.

    [52] Salaheddin Moradi, Ayub Bokani, and Jahan Hassan. UAV-based smart agriculture: a review of UAV sensing and applications. In 32nd Int. Telecommun. Netw. Appl. Conf., pages 181–184, 2022.

    [53] Mahya Nikouei, Bita Baroutian, Shahabedin Nabavi, Fateme Taraghi, Atefe Aghaei, Ayoob Sajedi, and Mohsen Ebrahimi Moghaddam. Small object detection: A comprehensive survey on challenges, techniques and real-world applications. Intelligent Systems with Applications, 27:200561, 2025.

    [54] Tomas Norton, C Chen, Mona Lilian Vestbjerg Larsen, and Daniel Berckmans. Precision livestock farming: Building `digital representations'to bring the animals closer to the farmer. Animal, 13(12):3009–3017, 2019.

    [55] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin ElNouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.

    [56] Yuanzhi Pan, Yuzhen Zhang, Xiaoping Wang, Xiang Xiang Gao, and Zhongyu Hou. Low-cost livestock sorting information management system based on deep learning. Artificial Intelligence in Agriculture, 9:110–126, 2023.

    [57] Hyeon Park, Dae-Heon Park, and Se-Han Kim. Deep learning-based method for detecting anomalies of operating equipment dynamically in livestock farms. In 2020 International Conference on Information and Communication Technology Convergence (ICTC), pages 1182–1185, 2020.

    [58] Suraj Patni, Aradhye Agarwal, and Chetan Arora. Ecodepth: Effective conditioning of diffusion models for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 28285–28295, June 2024.

    [59] Luigi Piccinelli, Christos Sakaridis, and Fisher Yu. idisc: Internal discretization for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21477–21487, June 2023.

    [60] Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

    [61] Yongliang Qiao, Daobilige Su, He Kong, Salah Sukkarieh, Sabrina Lomax, and Cameron Clark. Data augmentation for deep learning based cattle segmentation in precision livestock farming. In 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE), pages 979–984, 2020.

    [62] Danfeng Qin, Chas Leichner, Manolis Delakis, Marco Fornoni, Shixin Luo, Fan Yang, Weijun Wang, Colby Banarse, Cheng Onana, Ambroise McIntosh, et al. Mobilenetv4: Universal models for the mobile ecosystem. ECCV, 2024.

    [63] René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot crossdataset transfer. IEEE Trans. Pattern Anal. Mach. Intell., 44(3), 2022.

    [64] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788, June 2016.

    [65] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: Inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018.

    [66] Mohamed Sayed, Filippo Aleotti, Jamie Watson, Zawar Qureshi, Guillermo GarciaHernando, Gabriel Brostow, Sara Vicente, and Michael Firman. Doubletake: Geometry guided depth estimation. In Proceedings of the European Conference on Computer Vision (ECCV), 2024.

    [67] Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision, 47(1):742, 2002.

    [68] Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.

    [69] Jiahao Shao, Yuanbo Yang, Hongyu Zhou, Youmin Zhang, Yujun Shen, Vitor Guizilini, Yue Wang, Matteo Poggi, and Yiyi Liao. Learning temporally consistent video depth from video diffusion priors. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 22841–22852, 2025.

    [70] Shuwei Shao, Zhongcai Pei, Weihai Chen, Xingming Wu, and Zhengguo Li. Nddepth: Normal-distance assisted monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7931–7940, October 2023.

    [71] Shuwei Shao, Zhongcai Pei, Xingming Wu, Weihai Chen, and Zhengguo Li. Iebins: Iterative elastic bins for monocular depth estimation. In Advances in Neural Information Processing Systems 36 (NeurIPS), 2023.

    [72] Wei-Tsung Su, Lin-Yi Jiang, Tang-Hsuan O, Yu-Chuan Lin, Min-Hsiung Hung, and Chao-Chun Chen. AIoT-cloud-integrated smart livestock surveillance via assembling deep networks with considering robustness and semantics availability. IEEE Robotics and Automation Letters, 6(4):6140–6147, 2021.

    [73] Mohit Taneja, Nikita Jalodia, Paul Malone, John Byabazaire, Alan Davy, and Cristian Olariu. Connected cows: Utilizing fog and cloud analytics toward data-driven decisions for smart dairy farming. IEEE Internet of Things Magazine, 2(4):32–37, 2019.

    [74] Ruipeng Tang, Tan Jun, Qiushi Chu, Wei Sun, and Yili Sun. Small object detection in agriculture: A case study on durian pests and diseases. Agriculture, 14(10), 2024.

    [75] Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment—a modern synthesis. In International workshop on vision algorithms, pages 298–372. Springer, 1999.

    [76] R. Umega and M. A. Raja. Design and implementation of livestock barn monitoring system. In Intl. Conf. on Innovations in Green Energy and Healthcare Technologies, pages 1–6, 2017.

    [77] F. Wang, M. Zhang, X. Wang, X. Ma, and J. Liu. Deep learning for edge computing applications: A state-of-the-art survey. IEEE Access, 8:58322–58336, 2020.

    [78] Yiran Wang, Min Shi, Jiaqi Li, Zihao Huang, Zhiguo Cao, Jianming Zhang, Ke Xian, and Guosheng Lin. Neural video depth stabilizer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9466–9476, October 2023.

    [79] Xingkui Wei, Yinda Zhang, Zhuwen Li, Yanwei Fu, and Xiangyang Xue. Deepsfm: Structure from motion via deep bundle adjustment. In ECCV, 2020.

    [80] Nansong Wu, William Cowles, and Alex Kudelin. Addressing greenhouse's lack of natural pollinators - a UAV-based artificial pollination system. In 17th Int. Conf. Distrib. Comput. Sens. Syst. (DCOSS), pages 314–318, 2021.

    [81] Q. Xie, D. Li, J. Xu, Z. Yu, and J. Wang. Automatic detection and classification of sewer defects via hierarchical deep learning. IEEE Transactions on Automation Science and Engineering, 16(4):1836–1847, 2019.

    [82] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10371–10381, 2024.

    [83] Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. Advances in Neural Information Processing Systems, 37:21875–21911, 2024.

    [84] Xiaodong Yang, Zhuang Ma, Zhiyu Ji, and Zhe Ren. Gedepth: Ground embedding for monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12719–12727, October 2023.

    [85] Jianping Yao, Yong'an Zhang, Mei'an Li, Jia Li, Yanqiu Liu, Feilong Kang, and Fan Liu. Research on a lightweight recognition model for daily cattle behavior toward real-time monitoring. Veterinary Sciences, 12(12):1166, 2025.

    [86] Rajeev Yasarla, Hong Cai, Jisoo Jeong, Yunxiao Shi, Risheek Garrepalli, and Fatih Porikli. Mamo: Leveraging memory and attention for monocular video depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8754–8764, 2023.

    [87] Pengpeng Yu, Fei Teng, Wenhui Zhu, Chaoping Shen, Zhenping Chen, and Jinxiu Song. Cloud–edge–device collaborative computing in smart agriculture: Architectures, applications, and future perspectives. Frontiers in Plant Science, 16:1668545, 2025.

    [88] Songsong Yu, Yifan Wang, Yunzhi Zhuge, Lijun Wang, and Huchuan Lu. Dme: Unveiling the bias for better generalized monocular depth estimation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 6817–6825, 2024.

    [89] X. Zhang, X. Zhou, Mengxiao Lin, and Jian Sun. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6848–6856, 2018.

    [90] Zhoutong Zhang, Forrester Cole, Richard Tucker, William T. Freeman, and Tali Dekel. Consistent depth of moving objects in video. ACM Transactions on Graphics, 40(4):112, 2021.

    [91] Wending Zhou, Xu Yan, Yinghong Liao, Yuankai Lin, Jin Huang, Gangming Zhao, Shuguang Cui, and Zhen Li. Bev@dc: Bird’s-eye view assisted training for depth completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9233–9242, June 2023.

    [92] Yingshuang Zou, Yikang Ding, Xi Qiu, Haoqian Wang, and Haotian Zhang. M 2 Depth: Self-supervised two-frame multi-camera metric depth estimation. In Proceedings of the European Conference on Computer Vision (ECCV), 2024.

    QR CODE