| 研究生: |
邵靖安 Shao, Ching-An |
|---|---|
| 論文名稱: |
在強化學習架構下生產最佳化之決策 Decision-making for Production Optimization under The Structure of Reinforcement Learning |
| 指導教授: |
莊雅棠
Chuang, Ya-Tang |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 工業與資訊管理學系 Department of Industrial and Information Management |
| 論文出版年: | 2024 |
| 畢業學年度: | 112 |
| 語文別: | 中文 |
| 論文頁數: | 57 |
| 中文關鍵詞: | 生產計劃 、馬可夫決策過程 、強化學習 |
| 外文關鍵詞: | production planning, Markov decision process, reinforcement learning |
| 相關次數: | 點閱:71 下載:21 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究探討一生產計劃問題,在具有不確定性的環境中,單一製造商的生產決策分析。假設本研究之供應鏈角色只有顧客以及單一製造商,在現實生活中,製造商不能預先觀察環境之實際數值,往往只能從過往生產及銷售經驗,粗略估計市場景氣變動情況,以及在不同景氣下做出決策所獲得的利潤。因此,我們提出一個用於生產計畫的強化學習模型,其主要平衡探索未知環境下所做生產決策及已知環境下所做決策之回報。本研究目的是使製造商學習一策略,使其獲得之利潤值能近似在環境已知下之期望最佳利潤。本研究使用Q-Learning演算法,其基於Q-Table取代原本之利潤函數做出最佳決策,主要目的使用Q-Values衡量不同環境下所做決策之價值,並在進行數值求解後,可以評估使用Q-Learning演算法能否在資訊已知的情況下做出最好生產決策,並與其他演算法進行比較,並在不同情況下做出模擬分析。
透過比較與在不同情況下的模擬,我們知道Q-Learning在期數有限的情況下,還能保持一定的獲利水準,且隨著期數的增加,Q-Learning的學習效率也跟著提升,因此Q-Learning演算法在我們所建構之模型中表現尤為出色,而UCB演算法在低期數的學習表現較差,隨著期數的提高,UCB演算法的獲得的利潤也能提高,但還是不如Q-Learning演算法,所以或許在前面學習時,製造商可以修改原本的UCB演算法的策略,透過Q-Learning演算法的策略進行學習。
This study investigates a production planning problem which analyzing the production decisions of a single manufacturer in an environment with uncertainty.We supposed that the supply chain consists only of the customer and a single manufacturer in our study.In real life, the manufacturer cannot observe the actual values of the environment in advance,it can only roughly estimate market fluctuations based on past production and sales experience and make decisions accordingly to achieve profits under different economic conditions. Therefore, we propose a reinforcement learning model for production planning, which primarily balances the returns of production decisions made under unknown and known environments. The objetive of this study is to enable the manufacturer to learn a strategy that approximates the expected optimal profit in a known environment.
This study applies the Q-Learning algorithm, which replaces the original profit function with a Q-Table to make optimal decisions. The main goal is to use Q-Values to measure the value of decisions made under different environments. After numerical solving, we can assess whether the Q-Learning algorithm can make the best production decisions in a known information scenario and compare it with other algorithms, conducting simulation analysis under different situations.
Through comparisons and simulations under different conditions, we know that Q-Learning can maintain a certain profit level even in a limited number of periods. As the number of periods increases, the learning efficiency of Q-Learning also improves.
Therefore, the Q-Learning algorithm performs exceptionally well in our constructed model. In contrast, the UCB algorithm performs poorly in low-period learning, but as the number of periods increases, the profits obtained by the UCB algorithm also improve, though not as much as those of the Q-Learning algorithm. Thus, manufacturers might consider modifying the original UCB algorithm strategy in the initial learning phase by adopting the strategy of the Q-Learning algorithm.
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47, 235–256.
Azar, M. G., Osband, I., & Munos, R. (2017). Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, (pp. 263–272). PMLR.
Bellman, R. (1966). Dynamic programming. Science, 153(3731), 34–37.
Bitran, G. R., & Tirupati, D. (1993). Hierarchical production planning. Handbooks in operations research and management science, 4, 523–568.
Chen, C.-H., & Lee, L. H. (2011). Stochastic simulation optimization: an optimal computing budget allocation, vol. 1. World scientific.
Devraj, A. M., & Meyn, S. P. (2017). Fastest convergence for q-learning. arXiv preprint arXiv:1707.03770.
Escudero, L. F., Kamesam, P. V., King, A. J., & Wets, R. J. (1993). Production planning via scenario modelling. Annals of Operations Research, 43, 309–335.
Gox, R. F. (2002). Capacity planning and pricing under uncertainty. ¨ Journal of Management Accounting Research, 14(1), 59–78.
Ho, C.-J. (1989). Evaluating the impact of operating environments on mrp system nervousness. International Journal of Production Research, 27(7).
Jang, B., Kim, M., Harerimana, G., & Kim, J. W. (2019). Q-learning algorithms: A comprehensive classification and applications. IEEE access, 7, 133653–133667.
Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 237–285.
Lan, Q., Pan, Y., Fyshe, A., & White, M. (2020). Maxmin q-learning: Controlling the estimation bias of q-learning. arXiv preprint arXiv:2002.06487.
Melo, F. S. (2001). Convergence of q-learning: A simple proof. Institute of Systems and Robotics, Tech. Rep, (pp. 1–4).
Mula, J., Poler, R., Garc´ıa-Sabater, J. P., & Lario, F. C. (2006). Models for production planning under uncertainty: A review. International Journal of Production Economics,103(1), 271–285.
Paraskevopoulos, D., Karakitsos, E., & Rustem, B. (1991). Robust capacity planning under uncertainty. Management Science, 37(7), 787–800.
Tokic, M. (2010). Adaptive ε-greedy exploration in reinforcement learning based on value differences. In Annual conference on artificial intelligence, (pp. 203–210). Springer.
Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and q-learning. Machine Learning, 16, 185–202.
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8, 279–292.
Yang, L., & Wang, M. (2020). Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, (pp. 10746–10756). PMLR.
Zhu, Y., Dong, J., & Lam, H. (2023). Uncertainty quantification and exploration for reinforcement learning. Operations Research.
周世元 (2023). 景氣不確定性下產能擴充決策分析.