簡易檢索 / 詳目顯示

研究生: 邵靖安
Shao, Ching-An
論文名稱: 在強化學習架構下生產最佳化之決策
Decision-making for Production Optimization under The Structure of Reinforcement Learning
指導教授: 莊雅棠
Chuang, Ya-Tang
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業與資訊管理學系
Department of Industrial and Information Management
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 57
中文關鍵詞: 生產計劃馬可夫決策過程強化學習
外文關鍵詞: production planning, Markov decision process, reinforcement learning
相關次數: 點閱:71下載:21
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究探討一生產計劃問題,在具有不確定性的環境中,單一製造商的生產決策分析。假設本研究之供應鏈角色只有顧客以及單一製造商,在現實生活中,製造商不能預先觀察環境之實際數值,往往只能從過往生產及銷售經驗,粗略估計市場景氣變動情況,以及在不同景氣下做出決策所獲得的利潤。因此,我們提出一個用於生產計畫的強化學習模型,其主要平衡探索未知環境下所做生產決策及已知環境下所做決策之回報。本研究目的是使製造商學習一策略,使其獲得之利潤值能近似在環境已知下之期望最佳利潤。本研究使用Q-Learning演算法,其基於Q-Table取代原本之利潤函數做出最佳決策,主要目的使用Q-Values衡量不同環境下所做決策之價值,並在進行數值求解後,可以評估使用Q-Learning演算法能否在資訊已知的情況下做出最好生產決策,並與其他演算法進行比較,並在不同情況下做出模擬分析。
    透過比較與在不同情況下的模擬,我們知道Q-Learning在期數有限的情況下,還能保持一定的獲利水準,且隨著期數的增加,Q-Learning的學習效率也跟著提升,因此Q-Learning演算法在我們所建構之模型中表現尤為出色,而UCB演算法在低期數的學習表現較差,隨著期數的提高,UCB演算法的獲得的利潤也能提高,但還是不如Q-Learning演算法,所以或許在前面學習時,製造商可以修改原本的UCB演算法的策略,透過Q-Learning演算法的策略進行學習。

    This study investigates a production planning problem which analyzing the production decisions of a single manufacturer in an environment with uncertainty.We supposed that the supply chain consists only of the customer and a single manufacturer in our study.In real life, the manufacturer cannot observe the actual values of the environment in advance,it can only roughly estimate market fluctuations based on past production and sales experience and make decisions accordingly to achieve profits under different economic conditions. Therefore, we propose a reinforcement learning model for production planning, which primarily balances the returns of production decisions made under unknown and known environments. The objetive of this study is to enable the manufacturer to learn a strategy that approximates the expected optimal profit in a known environment.
    This study applies the Q-Learning algorithm, which replaces the original profit function with a Q-Table to make optimal decisions. The main goal is to use Q-Values to measure the value of decisions made under different environments. After numerical solving, we can assess whether the Q-Learning algorithm can make the best production decisions in a known information scenario and compare it with other algorithms, conducting simulation analysis under different situations.
    Through comparisons and simulations under different conditions, we know that Q-Learning can maintain a certain profit level even in a limited number of periods. As the number of periods increases, the learning efficiency of Q-Learning also improves.
    Therefore, the Q-Learning algorithm performs exceptionally well in our constructed model. In contrast, the UCB algorithm performs poorly in low-period learning, but as the number of periods increases, the profits obtained by the UCB algorithm also improve, though not as much as those of the Q-Learning algorithm. Thus, manufacturers might consider modifying the original UCB algorithm strategy in the initial learning phase by adopting the strategy of the Q-Learning algorithm.

    摘要 i 誌謝 vii 目錄 viii 圖目錄 x 表目錄 xii 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目標 2 1.3 論文架構 4 第二章 文獻回顧 6 2.1 生產規劃 6 2.2 需求不確定性 8 2.3 強化學習 10 2.4 Q-Learning 12 第三章 模型建構 15 3.1 資訊已知模型 15 3.2 問題定義與假設 18 3.3 求解方法 19 3.4 數值分析 20 3.4.1 不同存貨成本下的數值分析 23 3.4.2 與短視近利演算法進行比較 24 第四章 資訊未知模型 26 4.1 資訊未知模型 26 4.2 求解方法 30 4.2.1 Q-Learning模型求解步驟 30 4.3 數值分析 32 4.3.1 資訊未知與資訊已知之成效分析 32 4.3.2 不同學習率下之數值分析 34 4.4 與不同演算法比較 35 4.4.1 UCB演算法 35 4.4.2 短視近利演算法 36 4.4.3 不同演算法之成效分析 36 第五章 結論與未來研究方向 40 5.1 結論 40 5.2 未來研究方向 40 參考文獻 42

    Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47, 235–256.
    Azar, M. G., Osband, I., & Munos, R. (2017). Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, (pp. 263–272). PMLR.
    Bellman, R. (1966). Dynamic programming. Science, 153(3731), 34–37.
    Bitran, G. R., & Tirupati, D. (1993). Hierarchical production planning. Handbooks in operations research and management science, 4, 523–568.
    Chen, C.-H., & Lee, L. H. (2011). Stochastic simulation optimization: an optimal computing budget allocation, vol. 1. World scientific.
    Devraj, A. M., & Meyn, S. P. (2017). Fastest convergence for q-learning. arXiv preprint arXiv:1707.03770.
    Escudero, L. F., Kamesam, P. V., King, A. J., & Wets, R. J. (1993). Production planning via scenario modelling. Annals of Operations Research, 43, 309–335.
    Gox, R. F. (2002). Capacity planning and pricing under uncertainty. ¨ Journal of Management Accounting Research, 14(1), 59–78.
    Ho, C.-J. (1989). Evaluating the impact of operating environments on mrp system nervousness. International Journal of Production Research, 27(7).
    Jang, B., Kim, M., Harerimana, G., & Kim, J. W. (2019). Q-learning algorithms: A comprehensive classification and applications. IEEE access, 7, 133653–133667.
    Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285.
    Lan, Q., Pan, Y., Fyshe, A., & White, M. (2020). Maxmin q-learning: Controlling the estimation bias of q-learning. arXiv preprint arXiv:2002.06487.
    Melo, F. S. (2001). Convergence of q-learning: A simple proof. Institute of Systems and Robotics, Tech. Rep, (pp. 1–4).
    Mula, J., Poler, R., Garc´ıa-Sabater, J. P., & Lario, F. C. (2006). Models for production planning under uncertainty: A review. International Journal of Production Economics,103(1), 271–285.
    Paraskevopoulos, D., Karakitsos, E., & Rustem, B. (1991). Robust capacity planning under uncertainty. Management Science, 37(7), 787–800.
    Tokic, M. (2010). Adaptive ε-greedy exploration in reinforcement learning based on value differences. In Annual conference on artificial intelligence, (pp. 203–210). Springer.
    Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and q-learning. Machine Learning, 16, 185–202.
    Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8, 279–292.
    Yang, L., & Wang, M. (2020). Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, (pp. 10746–10756). PMLR.
    Zhu, Y., Dong, J., & Lam, H. (2023). Uncertainty quantification and exploration for reinforcement learning. Operations Research.
    周世元 (2023). 景氣不確定性下產能擴充決策分析.

    下載圖示 校內:立即公開
    校外:立即公開
    QR CODE