簡易檢索 / 詳目顯示

研究生: 李貴田
LEE, KUEI-TIEN
論文名稱: 深度強化學習於群集無人機合作策略開發
Deep Reinforcement Learning in the Development of Strategies for Cluster Drones
指導教授: 陳介力
Chen, Chieh-Li
學位類別: 碩士
Master
系所名稱: 工學院 - 航空太空工程學系
Department of Aeronautics & Astronautics
論文出版年: 2025
畢業學年度: 113
語文別: 中文
論文頁數: 80
中文關鍵詞: 強化學習近端策略優化多智能體
外文關鍵詞: Reinforcement Learning, PPO, MARL
相關次數: 點閱:3下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 四旋翼無人機在現代軍事具有機動性強、成本低廉、且能完成零傷亡精準打擊任務等優勢,同時對於戰場環境的高適應性,然而透過人為操作的無人機仍然具有高度不確定性以及誤擊的可能,且最終決策須承受心因性、物理性壓力,因此能完成自主決策的智能體將成為未來發展目標。本研究透過OpenAI gymnasium建立模擬及驗證環境平台,並使用Pybullet套件定義四旋翼無人機模型物理文件,導入強化學習方法進行決策運算,本文選用對於連續動作空間收斂較快、穩定性較佳的近端策略優化(PPO)演算法架構,並調整內部網路及獎勵函數設計以順利達成設計任務。
    本研究分為兩個階段,第一階段為訓練環境驗證與穩定性測試,從速度控制器回饋響應、單智能體訓練測試、到動態超參數設計方法,初步建立實驗流程。第二階段設計多機任務,智能體間將以合作關係完成任務點追蹤與障礙物躲避策略,任務目標為探索出最有效率的策略。演算法則採用應用集中學習(CTDE)多智能體(Multi-Agent)協同合作,智能體彼此藉由中央代理人(Agent)共享獎勵函數及環境參數並同步計算下一步的決策,避免結果發散或陷於局部最佳解,訓練環境則適當增加障礙物作為複雜地形模擬。研究結果顯示在複雜環境下,強化學習依然能藉由探索找到獎勵高的最優決策路徑,在未來國防應用具有高度發展潛力。

    Quad-rotor UAVs have the advantages of high maneuverability, low cost, and the ability to complete precision strike missions with zero casualties in modern military affairs. However, human-operated UAVs are still highly uncertain. The possibility of accidental shooting, and the final decision pressures can cost damage. Therefore, intelligent agents that can complete autonomous decision-making will become the goal of future development. This study uses OpenAI gymnasium to establish a simulation and verification environment platform, Pybullet package to define the quadcopter UAV physical model, and introduces Reinforcement Learning (RL) methods for decision-making operations. This article adapts the PPO algorithm, which converge faster and have better stability in continuous action space, to adjust the internal network architecture and reward function design to successfully achieve the design task.
    This research is divided into two stages. The first stage is the verification of the training environment, from speed controller feedback response, single-agent training, to dynamic hyperparameter design, and establishes the experimental process. The second stage is mission oriented. The agents use a cooperative relationship to complete the drones’ tracking response and obstacle avoidance strategies. The agents’ mission goal is to find out the most efficient methodology. The algorithm architecture uses CTDE multi-agent structure. Through collaborative cooperation, the surrogates share reward functions and parameters with each other through a central agent and synchronously calculate the next decision to avoid divergence of results or getting stuck in the local optimal solution. The training environment adds obstacles to simulating complex terrain. The research shows that RL agents can find the optimal decision path with high rewards through exploration in complex environments. This research has high potential for further applications.

    論文摘要 i ABSTRACT ii 本文致謝 ix 本文目錄 x 圖目錄 xii 表目錄 xv 符號表 xvi 第1章 緒論 1 1.1 前言 1 1.2 研究背景 2 1.3 文獻回顧 3 1.4 本文架構 4 第2章 強化學習與多智能體 6 2.1 強化學習理論 6 2.1.1 強化學習模型 8 2.2 近端策略優化 10 2.2.1 演員-評論家網路 13 2.3 多智能體理論基礎 14 2.3.1 多智能體強化學習架構 18 2.3.2 多智能體強化學習訓練流程 19 第3章 群集無人機強化學習 21 3.1 Open AI Gym結合PyBullet環境建立 21 3.2 無人機靜態參數 23 3.2.1 旋轉力矩與動力輸出 24 3.2.2 碰撞模型 25 3.3 無人機動態控制 25 3.3.1 模擬驗證 29 3.4 強化學習驗證 32 3.4.1 單智能體驗證 34 3.4.2 多智能體驗證 35 第4章 群集無人機學習策略 37 4.1 多智能體穩定性設計 37 4.1.1 學習率調整策略 39 4.1.2 熵調整策略 41 4.1.3 網路調整策略 41 4.2 固定目標點多機追蹤 42 4.2.1 智能體尾隨躲避障礙任務 43 4.2.2 交換點任務 45 4.2.3 雙無人機障礙物任務 47 4.3 移動目標點多機追蹤 50 第5章 結論與未來展望 58 5.1 結論 58 5.2 未來展望 59 參考文獻 60

    [1] Ukrainian Farm Drone Now Armed with PKM 7.62 Mm Machine Gun and Bullspike-AT Grenade Launcher. (2024, January 11). Army Recognition. https://www.armyrecognition.com/focus-analysis-conflicts/army/conflicts-in-the-world/russia-ukraine-war-2022/ukrainian-farm-drone-now-armed-with-pkm-7-62-mm-machine-gun-and-bullspike-at-grenade-launcher
    [2] Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). http://incompleteideas.net/book/bookdraft2018jan1.pdf
    [3] Kaelbling, L. P. (1996). Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research.
    [4] Hwangbo, J. (2017). Control of a Quadrotor with Reinforcement Learning. ArXiv.
    [5] Zhang, H. P. (2022). Maneuver Decision-Making for Autonomous Air Combat Based on FRE-PPO. Applied Sciences
    [6] Wang, L. (2021). Multi-Agent Deep Reinforcement Learning-Based Trajectory Planning for Multi-UAV Assisted Mobile Edge Computing. IEEE.
    [7] Guan, Y. (2023). Cooperative UAV trajectory design for disaster area emergency communications: A multi-agent PPO method. IEEE.
    [8] Watkins, C. (1992). Technical Note Q-Learning. Kluwer Academic Publishers.
    [9] Sutton, R. S. (1999). Policy Gradient Methods for Reinforcement Learning with Function Approximation. NeurIPS, 1057–1063.
    [10] Schulman, J. (2017). Proximal Policy Optimization Algorithms. ArXiv.
    [11] Nguyen, T. T. (2020). Deep Reinforcement Learning for Multiagent Systems: A Review of Challenges, Solutions, and Applications. IEEE.
    [12] Littman, M. L. (1994). Markov Games as a Framework for Multi-Agent Reinforcement Learning.
    [13] Busoniu, L. (2008). A Comprehensive Survey of Multiagent Reinforcement Learning. IEEE.
    [14] Zhang, K. (2021). Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms. ArXiv.
    [15] Raffin, A. (2021). Stable-Baselines3- Reliable Reinforcement Learning Implementations. Journal of Machine Learning Research.
    [16] Brockman, G. (2016). OpenAI Gym. ArXiv.
    [17] PyBullet. (2022, March). PyBullet Organization. https://pybullet.org/wordpress/
    [18] Panerati, J. (2021). Learning to Fly—a Gym Environment with PyBullet Physics for Reinforcement Learning of Multi-Agent Quadcopter Control. ArXiv.
    [19] Bernardes, E. (2022). Quaternion to Euler angles conversion: A direct, general and computationally efficient method. PLOS ONE.
    [20] Fortmann-roe, S. (2012, June). Understanding the Bias-Variance Tradeoff. https://scott.fortmann-roe.com/docs/BiasVariance.html
    [21] Zhang, Z. (2024). Multi- Quadcopter Reinforcement Learning Control Method with Advantaged Clipped Critic Networks. IEEE.

    QR CODE