簡易檢索 / 詳目顯示

研究生: 羅基堯
Lo, Chi-Yao
論文名稱: 智慧型計算機效能評估器
Intelligent Computer Performance Evaluator
指導教授: 楊浩青
Yang, Haw-Ching
鄭芳田
Cheng, Fan-Tien
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 製造工程研究所
Institute of Manufacturing Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 中文
論文頁數: 96
中文關鍵詞: 資源評估器失效偵測失效時間失效病徵失效備援Near-Zero-Down-Time關鍵資源
外文關鍵詞: Failure Detection, Fail-Over Scheme (FOS), Time-to-Failure (TTF), Counter, Sick Pattern, Near-Zero-Down-Time, Intelligent Performance Evaluator
相關次數: 點閱:86下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 對生產事業而言,具高可使用度 (High-Availability) 的生產製造環境是相當重要的,若產線中發生無預警停機,將造成相當大的成本損失。為確保產線中各設備的可使用度,以Heartbeat判別節點異常之失效偵測為典型方法;然此類方法除存在著對設備之應用程式失效誤警率高與不易追蹤發生原因之限制外,其亦難事先偵測應用程式之失效。

    本研究提出設備之計算機資源的智慧型可用性評估器 (Intelligent Performance Evaluator, IPE) ,其具有失效偵測與失效時間預測等功能。在計算機資源的失效偵測上,IPE可監控計算機的關鍵資源,以動態主成份分析法彙整關鍵資源之狀態,透過反應時間的效能量測,採病徵類比法可歸納分類出其可能之失效病徵。在計算機資源的失效預測上,IPE評估其資源的消耗程度,採Linear Regression預測其失效時間(Time-to-Failure, TTF),可估算關鍵資源可消耗時間;最後在計算機資源的可使用度上,當資源耗盡前,IPE可對失效備援機制提出警訊,配合備援機制切換至另一計算機,可達到Near-Zero-Down-Time之境界。

    在結果上,透過本研究IPE之動態主成份分析法與病徵類比法,不但可將應用程式失效誤警率從43%降低趨近於零,並能提供計算機資源失效之可能原因;藉由最短失效時間之預測,可供維修人員參考或其它系統應用 (失效備援系統) ,其整體應用程式之可使用度可由Heartbeat的99.86% (即每年停機時間703.60分鐘) 改善至Response Time的99.91% (即每年停機時間456.12分鐘),再由前置90%之效能趨勢相似,並輔以提前失效時間預測,而加以改善至99.99% (即每年停機時間27.6分鐘)。

    For manufacturers, it is essential to build up a manufacturing environment with high availability. When an accident occurs during the manufacturing process, it will induce large manufacturing loss. To ensure availability of manufacturing equipment, the heartbeat strategy is typically used to judge the failure node of equipment. However, this strategy still has three major limits, the high false alarm rate for the application failure of computer, difficulty in tracing the root causes, and inability to predict the application failure.
    This thesis proposed an Intelligent Performance Evaluator (IPE), which consists of two kernel modules: failure detection and time-to-failure (TTF) prediction. In failure detection module, IPE monitors various node counters and generalizes failure patterns and response time by Principal Component Analysis to figure out the failure causes. In TTF prediction module, IPE evaluates the node’s resource utilization via linear regression analysis to estimate how much time left before the resources exhaustion. Then, IPE can trigger a failover scheme for backup original node and swap to another node. In this way, IPE can reach the near-zero-down-time level.
    As a result, failure detection module not only reduced the false alarm rate of application from 43% (detected by using heartbeat strategy) to near 0% (detected by IPE), but also provided the possible root causes of failures as well as shortened the repair time for engineers. In general, TTF prediction module improved the overall application availability from 99.86% to 99.99%, which means the equipment shut-down time was shortened from 703.60 minutes to 27.6 minutes per year.

    中文摘要 英文摘要 致謝 目錄 i 圖目錄 iii 表目錄 vii 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機與目的 4 1.3 研究範圍與限制 5 1.4 研究流程 6 1.5 論文架構 7 第二章 文獻探討與理論基礎 8 2.1 相關文獻探討 8 2.1.1 計算機叢集服務機制之預測系統架構 8 2.1.2 計算機效能評估 10 2.2 相關理論基礎 13 2.2.1 主成份分析 13 2.2.2 群集分析 16 2.2.3 線性廻歸分析 18 2.3 FMECA 19 2.3.1 失效分析 19 2.3.2 監控變數分析 23 第三章 計算機效能評估器 28 3.1 IPE設計機制 28 3.2 IPE設計 31 3.2.1 需求分析及使用者案例 31 3.2.2 流程 34 3.3 病徵管理問題 57 3.4 設計失效案例及訓練病徵 58 第四章 實作與實驗結果比較 60 4.1 實作環境 60 4.2 IPE 架構 61 4.2.1 硬體架構 61 4.2.2 軟體架構 63 4.3 案例研究 68 4.3.1 設計失效因子 68 4.3.2 訓練病徵 69 4.3.3 調整精度 69 4.3.4 實際應用結果 69 4.3.5 其它案例之應用結果 71 4.4 整合FOS 75 4.5 失效偵測實驗結果與比較 80 4.6 失效時間預測實驗結果 81 4.7 配合失效備援實驗結果 83 4.8 停機時間及可用度計算 83 第五章 結論 86 5.1 論文總結 86 5.2 未來研究方向 88 參考文獻 89 附錄 A. 物件導向設計之順序圖 94

    [1] Jungyun Choi, “Introduction to Semiconductor Manufacturing,” A Special Session for ICRA 2001: Automation in Semiconductor Industry.
    [2] 半導體及平面顯示器(IC& FPD)生產線智慧型維修系統國際研討會, May 2005.[Online]. Available: http://en.fpd.edu.tw:8088/news_data/
    [3] Y.-C. Su, M.-H. Hung, F.-T. Cheng, and Y.-T. Chen, “A Processing Quality Prognostics Scheme for Plasma Sputtering in TFT-LCD Manufacturing,” IEEE Transactions on Semiconductor Manufacturing, vol.19, no2, pp. 183-194, May 2006.
    [4] Y.-C. Su, F.-T. Cheng, M.-H. Hung, and H.-C. Huang, “Intelligent Prognostics System Design and Implementation,” IEEE Transactions on Semiconductor Manufacturing, vol. 19, no.2, pp.195-207, May 2006.
    [5] D.R. Jeske, “Estimating the Cumulative Downtime Distribution of a Highly Reliable Component,” IEEE Transaction on Reliability, Vol. 45, no. 3, pp.369-374, Sept. 1996.
    [6] R. Gamache, R. Short, and M. Massa, “Windows NT Clustering Service, ”IEEE Computer, vol. 31, no. 10, pp.55-62, Oct. 1998.
    [7] HACMP. [Online]. Available: http://news.e800.com.cn/articles/server/
    servertech/200409/1096510697223.html
    [8] Red Hat Cluster Suite, redhat.com. [Online]. Available: http://www.redhat.com/software/rha/cluster/
    [9] F.-T. Cheng, S.-L. Wu, P.-Y. Tsai, Y.-T. Chung, and H.-C. Yang, ”Application Cluster Service Scheme for Near-Zero-Downtime Services, ” in Proc. 2005 IEEE Conference on Robotics and Automation, pp. 4062 - 4067, Apr. 2005.
    [10] S. Kumarp, “Real-Time Load Balancing in EDA, ”HCL Technologies Ltd., AECAPC SYMPOSIUM-ASIA, Taipei, Nov. 2006
    [11] J. Gray and D. P. Siewiorek, “High-availability Computer Systems, ”IEEE Transactions on Computer, vol. 24, no. 9, pp. 39–48, Sept. 1991.
    [12] IMS Vision, Center for Intelligent Maintenance Systems. [Online]. Available: http://wumrc.engin.umich.edu/ims/?page=home
    [13] A. Gavrilovska, K. Schwan, and V. Oleson, “A practical approach for 'zero' downtime in an operational information system, ”in Proc. 2002 IEEE International Conference on Distributed Computing Systems, pp. 345 – 352, July 2002.
    [14] MIT Web系統管理導論Document. [Online]. Available: http://web.mit.edu/rhel-doc/4/RH-DOCS/rhel-isa-zh_tw-4/
    [15] F.-T. Cheng, H.-C. Yang, and C.-Y. Tsai, “Developing a Service Management Scheme for Semiconductor Factory Management Systems, ”IEEE Robotics and Automation Management, vol. 11, no. 1, pp. 26-40, March 2004.
    [16] Microsoft TechNet Chapter 10 - Performance Tuning and Monitoring [Online]. Available:http://www.microsoft.com/technet/
    prodtechnol/windows2000serv/technologies/iis/
    maintain/optimize/c10iis.mspx
    [17] S. Garg, A.V. Moorsel, K. Vaidyanathan, and K. S. Trivedi, “A Methodology for Detection and Estimation of Software Aging, ”in Proc. 1998 International Symposium on Software Reliability Engineering (ISSRE 1998), pp 283-292, Nov. 1998.
    [18] Y. Huang, C. Kintala, N. Kolettis, and N. D. Fulton, “Software Rejuvenation: Analysis, Module and Applications,” in Proc. 25th Annual International Symposium on Fault-Tolerant Computing (FTCS-25), pp 381-390, June 1995.
    [19] K. Vaidyanathan and K. S. Trivedi, “A Measurement-based Model for Estimation of Resource Exhaustion in Operational Software Systems,” in Proc.1999 International Symposium on Software Reliability Engineering (ISSRE 1999), pp. 84-93, Nov. 1999.
    [20] P. K. Sen, “Estimates of the Regression Coefficient Based on Kendall’s Tau,” Journal of the American Statistical Association, vol. 63, pp. 1379–1389, 1968.
    [21] L. Lei, K. Vaidyanathan, and K. S. Trivedi, “An Approach for Estimation of Software Aging in a Web Server”, in Proc. 2002 IEEE International Symposium, pp. 91-100, Oct. 2002.
    [22] C. Fezter, “Perfect Failure Detection in Timed Asynchronous System,” IEEE Transaction on Computers, Vol. 52, No. 2, pp 99-112, Feb. 2003.
    [23] F. Quaglia, A. Santoro, and B. Ciciani, “Tuning of the Checkpointing and Communication Library for optimistic simulation on Myrinet based NOWs,” In Proc. 2001 IEEE Ninth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, pp. 241 – 248, Aug. 2001.
    [24] S. Neogy, A. Sinha, and P.-K. Das, “Checkpoint Processing in Distributed Systems Software Using Synchronized Clocks,” In Proc. 2001 IEEE International Conference on Information Technology: Coding and Computing, pp. 555 - 559, April 2001.
    [25] O. Laadan, D. Phung, and J. Nieh, “Transparent Checkpoint-Restart of Distributed Applications on Commodity Clusters,” IEEE International Cluster Computing, pp. 1 - 13, Sept. 2005.
    [26] Q. Jiang and D. Manivannan, “An Optimistic Checkpointing and Selective Message Logging Approach for Consistent Global Checkpoint Collection in Distributed Systems,” 2007 IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007), pp. 1 - 10, March 2007.
    [27] J. Wei and C.-Z. Xu, “eQoS: Provisioning of Client-Perceived End-to-End QoS Guarantees in Web Servers,” IEEE Transactions on Computers, Vol. 55, No. 12, pp. 1543-1556, Nov. 2006.
    [28] N. Nakamura, A. Murashige, and Y. Yamada, “Distributed Management System for IP Network-Implementation and Performance Evaluation,” in Proc. 1997 IEEE conference on Global Convergence of Telecommunications and Distributed Object Computing, pp.139-143, Nov. 1997.
    [29] H. H. Yue and M. Tomoyasu, “Weighted Principal Component Analysis and its Applications to Improve FDC Performance,” in Proc. 2004 IEEE Conference on Decision and Control, pp. 4262-4267, Dec. 2004.
    [30] 江巧雯,「應用時間序列叢集技術於網路流量分級之研究」,元智大學資訊管理研究所碩士論文,2000.
    [31] 涂哲源,「建構在ARM平台上的IPE」,國立成功大學製造工程研究所碩士論文,2006.
    [32] 鍾昀達,「具預測應用伺服器失效能力之IPE」,國立成功大學製造工程研究所碩士論文,2004.
    [33] 陳順宇 「多變量分析」,三版,華泰書局,2004.
    [34] 陳順宇 「廻歸分析」,三版,華泰書局,2004.
    [35] V. Cuppu and B. Jacob, “Concurrency, Latency, or System Overhead: Which Has the Largest Impact on Uniprocessor DRAM-System Performance?” In Proc. 2001 IEEE 28th Annual International Symposium on Computer Architecture, pp. 62-71, Jul. 2001.
    [36] Memory Leak. [Online]. Available: http://en.wikipedia.org/wiki/Memory_leak
    [37] B. Willard and O. Frieder, “Autonomous Garbage Collection: Resolving Memory Leaks in Long Running Network Applications,” In Proc. 1998 IEEE 7th International Conference on Computer Communications and Networks, pp. 886 - 896, Oct. 1998.
    [38] G. Zhang, J. Shu, W. Xue, and W. Zheng, “Design and Implementation of an Out-of-Band Virtualization System for Large SANs,” IEEE Transactions on Computers : Accepted for future publication, Oct. 2007.
    [39] A.-P. Wood, “Software reliability from the customer view,” IEEE Transactions on Computer, Vol. 36, No. 8, pp. 37-42, Aug. 2003.
    [40] D.-R. Kuhn, D.-R. Wallace, and A.-M. Gallo Jr. “Software Fault Interactions and Implications for Software Testing,” IEEE Transactions on Software Engineering, Vol. 30, No. 6, pp. 418 - 421, June 2004.
    [41] A. Thomasian and M. Blaum, “Mirrored Disk Organization Reliability Analysis,” IEEE Transactions on Computers, Vol. 55 , No. 12, pp. 1640 - 1644, Dec. 2006
    [42] B.-A. Movsichoff, C.-M. Lagoa, and H. Che, “End-to-End Optimal Algorithms for Integrated QoS, Traffic Engineering, and Failure Recovery,” IEEE/ACM Transactions on Networking, Vol. 15 , No. 4, pp. 813-823, Aug. 2007.
    [43] A. Thomasian, G. Fu, and C. Han, “Performance of Two-Disk Failure-Tolerant Disk Arrays,” IEEE Transactions on Computers, Vol. 56, No. 6, pp. 799 - 814. Jun. 2007.
    [44] 楊善國 「可靠度工程概論」,初版,全華科技圖書,2005.
    [45] 張苑蓉 「SNMP 網路管理協定」,歐萊禮,1999.
    [46] ISMI, “Data Quality Proof of Concept Test Method,” 2005.
    [47] C.-J. Spanos, H.-F. Guo, A. Miller, J. Levine-Parrill, “Real-Time Statistical Process Control Using Tool Data,” IEEE Transactions on Semiconductor Manufacturing, Vol. 5, No. 4, pp. 308 – 318, Nov. 1992.

    下載圖示 校內:2010-08-20公開
    校外:2012-08-20公開
    QR CODE