| 研究生: |
林志南 Lin, Chih-Nan |
|---|---|
| 論文名稱: |
以資源限制ETL排程提升資料倉儲效能之精確與啟發式演算法效能比較研究 Enhancing Data Warehouse Performance Through Resource-Constrained ETL Scheduling: A Comparative Study of Exact and Heuristic Algorithms |
| 指導教授: |
王逸琳
Wang, I-Lin |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 工業與資訊管理學系 Department of Industrial and Information Management |
| 論文出版年: | 2025 |
| 畢業學年度: | 113 |
| 語文別: | 中文 |
| 論文頁數: | 62 |
| 中文關鍵詞: | ETL 、資料倉儲 、具資源限制之專案排程問題 、整數規劃 、基因演算法 |
| 外文關鍵詞: | ETL, Data Warehouse, RCPSP, Integer Programming, Genetic Algorithm |
| 相關次數: | 點閱:66 下載:0 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著企業資料規模擴增,ETL(Extract-Transform-Load)在資料倉儲系統中扮 演關鍵角色。然而,當ETL任務數量急遽增加,傳統的排程方法往往忽略系統資源 使用狀況,導致資源競爭與效能下降。本研究首次將具資源限制之專案排程問題 (Resource-Constrained Project Scheduling Problem,RCPSP)應用於ETL任務排程最 佳化,並提出創新的解決方案。
本研究主要貢獻有三:首先,建立ETL_CSS(Candidate Statistics Set)統計框 架,系統化收集並分析ETL任務執行期間的資源參數,包括CPU可使用率配額(譬 如最多70%)、記憶體可使用率配額(譬如最多50%)等;其次,建構整數規劃數學 模式,將ETL任務間的相依性與資源限制轉化為嚴謹的數學表示,以最小化專案完 工時間(Makespan)為目標;第三,開發並比較三種演算法:(1)使用Gurobi求解器 的精確解法(簡稱GB),(2)結合序列法的貪婪演算法(簡稱GR),以及(3)改良式基 因演算法(簡稱GA)。
實驗結果分析如下:在小規模問題(20個任務以下),GB能在36秒內求得最佳 解;在中等規模問題(21-50個任務)中,以50個任務為例,GB的求解時間大幅增 加至3353秒才能得到最佳解;而GR雖能在42秒內快速求得解,但該解之排程比最 佳解再晚15.4%的時間才完工;GA則透過改良式編碼與客製化交配運算子,可在59 秒內求出較最佳解僅再晚0.9%時間的排程,展現出優異的效能平衡。在大規模問題 (50個任務以上)中,GB已無法在6小時內求得可行解,而GA仍能維持穩定的求解 效能,於85秒內得到可行解。
本研究不僅首次提出將RCPSP應用於ETL任務排程的完整解決方案,更透過嚴 謹的實驗設計與數據分析,量化驗證三種演算法在不同規模問題下的效能表現。研 究成果除了為企業在選擇ETL排程策略時提供具體的效能評估準則,也為資料倉儲 系統效能最佳化建立了理論基礎。
In modern enterprises, Extract-Transform-Load (ETL) plays a crucial role in data warehouse systems. As data volumes grow, traditional ETL scheduling methods often overlook resource constraints, leading to performance degradation. This research pioneers the application of the Resource-Constrained Project Scheduling Problem (RCPSP) to optimize ETL task scheduling.
We develop a comprehensive framework integrating three key components: a statistical collection system for monitoring resource parameters, an integer programming model for minimizing project makespan under resource constraints, and three solution approaches - Gurobi solver (GB), greedy algorithm (GR), and genetic algorithm (GA). Experimental results demonstrate distinct performance characteristics across different problem scales. For small instances (up to 20 tasks), GB achieves optimal solutions within 36 seconds. For medium-scale problems (50 tasks), while GB requires 3,353 seconds to obtain optimal solutions, GA achieves near-optimal results (0.9% gap) in 59 seconds, significantly outperforming GR's solutions which deviate by 15.4% from optimal. For large instances (55 or more tasks), GB fails to converge within 6 hours, while GA maintains consistent performance, generating feasible solutions in 85 seconds.
This research not only establishes a theoretical foundation for resource-aware ETL scheduling but also provides quantitative evidence for selecting appropriate scheduling strategies in practical data warehouse implementations.
中文文獻
許展維 (2012) 以啟發式演算法求解資源受限條件下 資源分派與排程問題。國立中正大學。
英文文獻
Alcaraz, J., Maroto, C., & Ruiz, R. (2003). Solving the multi-mode resource-constrained project scheduling problem with genetic algorithms. Journal of the Operational Research Society, 54(6), 614-626.
Berliantara, A. Y., Wicaksono, S. A., & Pinandito, A. (2017). Scheduling optimization for extract, transform, load (ETL) process on data warehouse using round robin method (case study: University Of XYZ). JITeCS (Journal of Information Technology and Computer Science), 2(2).
Berro, A., Megdiche, I., & Teste, O. (2015). Holistic statistical open data integration based on integer linear programming. IEEE 9th International Conference on Research Challenges in Information Science (RCIS) (pp. 468-479). IEEE.
Biswas, N., Sarkar, A., & Mondal, K. C. (2020). Efficient incremental loading in ETL processing for real-time data integration. Innovations in Systems and Software Engineering, 16(1), 53-61.
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
Chowdhury, M., Stoica, I. (2015). Efficient coflow scheduling without prior knowledge. In:Proceedings of the ACM Conference on Special Interest Group on Data Communication. SIGCOMM (pp. 393–406)
Coffman, E.G., Bruno, J.L. (1976). Computer and job-shop scheduling theory. Wiley Publishing, Toronto.
Colak, S., Agarwal, A., & Erenguc, S. (2013). Multi-mode resource-constrained project-scheduling problem with renewable resources: new solution approaches. Journal of Business & Economics Research (Online), 11(11), 455.
Damay, J., Quilliot, A., & Sanlaville, E. (2007). Linear programming based algorithms for preemptive and non-preemptive RCPSP. European Journal of Operational Research, 182(3), 1012-1022.
de Melo, L. V., & de Queiroz, T. A. (2021). Integer linear programming formulations for the RCPSP considering multi-skill, multi-mode, and minimum and maximum time lags. IEEE Latin America Transactions, 19(01), 5-16.
Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. A. M. T. (2002). A fast and elitist multi-objective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation, 6(2), 182-197.
Eberhart, R., & Kennedy, J. (1995, October). A new optimizer using particle swarm theory. In MHS'95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science (pp. 39-43). IEEE.
Elshaer, R., Shawky, M., Elawady, H., & Nawara, G. (2017). Solving resource-constrained project scheduling problem using genetic algorithm. Journal of Al-Azhar University Engineering Sector, 12(42), 187-198.
Elzeki, O. M., Rashad, M. Z., & Elsoud, M. A. (2012). Overview of scheduling tasks in distributed computing systems. International Journal of Soft Computing and Engineering, 2(3), 470-475.
Garey, M. R., Johnson, D. S., & Sethi, R. (1976). The complexity of flowshop and jobshop scheduling. Mathematics of operations research, 1(2), 117-129.
Grandl, R., Chowdhury, M., Akella, A., & Ananthanarayanan, G. (2016). Altruistic scheduling in multi-resource clusters. In 12th {USENIX} symposium on operating systems design and implementation (pp. 65-80).
Halasipuram, R., Deshpande, P. M., & Padmanabhan, S. (2014). Determining essential statistics for cost based optimization of an ETL workflow. In EDBT (pp. 307-318).
Hartmann, S., & Briskorn, D. (2010). A survey of variants and extensions of the resource-constrained project scheduling problem. European Journal of operational research, 207(1), 1-14.
Karagiannis, A., Vassiliadis, P., & Simitsis, A. (2013). Scheduling strategies for efficient ETL execution. Information Systems, 38(6), 927-945.
Kelley Jr, J. E. (1962). Critical-path planning and scheduling: Mathematical basis. Operations research, 10, 912-915.
Kelley, J. E. (1963). The critical-path method: resource planning and scheduling. Industrial scheduling.
Klimek, M. (2017). Priority algorithms for the problem of financial optimization of a multi-stage project. Applied Computer Science, 13(4).
Li, L. (2018). One scheduling method for ETL task cluster. Computer Technology and Development. 28(11), 41–44
Rivera, J. C., Moreno V, L. F., Díaz S, F. J., & Peña Z, G. E. (2013). A hybrid heuristic algorithm for solving the resource constrained project scheduling problem (RCPSP). Revista EIA, (20), 87-100.
Silberschatz, A., Galvin, P., Gagne, G. (2005). Operating system concepts. Wiley Publishing, Toronto.
Sharma, S., & Jain, R. (2014). Modeling ETL process for data warehouse: an exploratory study. In 2014 Fourth International Conference on Advanced Computing & Communication Technologies (pp. 271-276). IEEE.
Sprecher, A., & Drexl, A. (1998). Multi-mode resource-constrained project scheduling by a simple, general and powerful sequencing algorithm. European Journal of Operational Research, 107(2), 431-450.
Tang, L., Li, H., Chen, M., Dai, Z., & Zhu, M. (2018, September). Enhancing concurrent ETL task schedule with altruistic strategy. In Proceedings of the Computational Methods in Systems and Software (pp. 201-212). Springer, Cham.
Xu, Z., Li, H., Chen, M., Dai, Z., Li, H., & Zhu, M. (2020, July). Multi-objective scheduling optimization for ETL tasks in cluster environment. In Computer Science On-line Conference (pp. 165-177). Springer, Cham.
Wrembel, R. (2019). Still open issues in ETL design and optimization. http://www.cs.put.poznan.pl/rwrembel/ETL-open-issues.pdf. Res. seminar, BarcelonaTech.
Yang, D., & Xu, C. (2019). A distributed scheduling framework of service based ETL process. In CCF Conference on Big Data (pp. 16-32). Springer, Singapore.
Zhang, H., Tam, C. M., & Li, H. (2006). Multimode project scheduling based on particle swarm optimization. Computer‐Aided Civil and Infrastructure Engineering, 21(2), 93-103.
校內:2030-02-08公開