| 研究生: |
陳韋翰 Chen, Wei-Han |
|---|---|
| 論文名稱: |
應用在訊息傳遞系統中的複合式檢查點機制 A Hybrid Checkpointing Scheme in Message Passing Systems |
| 指導教授: |
徐立群
Shu, LihChyun |
| 學位類別: |
碩士 Master |
| 系所名稱: |
管理學院 - 會計學系 Department of Accountancy |
| 論文出版年: | 2007 |
| 畢業學年度: | 95 |
| 語文別: | 中文 |
| 論文頁數: | 39 |
| 中文關鍵詞: | 骨牌效應 、檢查點 、容錯 |
| 外文關鍵詞: | checkpoint, domino effect, fault tolerance |
| 相關次數: | 點閱:118 下載:3 |
| 分享至: |
| 查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
當採用檢查點基礎式機制以達成容錯時,除了設定檢查點之外,確保系統狀態可從錯誤中回復至一致性是該領域重要的課題。因此,額外的協同運作負擔經常無法避免,甚而降低了程式應有的效能。近年來,許多針對降低協同運作負擔檢查點的研究正廣泛地展開,結合靜態程式分析與事先在原始碼中插入檢查點的想法為其中之一。我們在此篇論文提出複合式檢查點機制,希望能兼具靜態程式分析與即時檢查點法的優勢,利用Find Orphan-free Coupling nodes 演算法在extended control flow graph 中決定one consistent cut of checkpoints 並應用在許多常見的inter-process interacting pattern。我們用tightly coupling strategy 解決trouble path 問題,亦即checkpoint X happened before 其他checkpoint Y(X 與Y 分屬不同process)。此外,如果靜態程式分析發現application 在執行迴圈時可能產生trouble path,tightly coupling strategy 不須將checkpoint statements 移出迴圈就可解決該問題。在我們的容錯方法下,錯誤復原可控制在一個checkpoint interval 內完成,避免骨牌效應產生。
If we apply checkpoint-based protocols to achieve fault-tolerance, besides taking checkpoints, it is a significant issue to ensure that consistent global states can be
recovered when failures occur. Additional failure-free coordination overheads are ineluctable so that reduce the performance. Recently, many intensive researches have been studied to eliminate such overheads including by analyzing distributed programs and statically inserting checkpoint statements at the proper places in the source code. In this thesis, we propose a hybrid checkpoint scheme to leverage the advantages of both static analysis and online checkpointing. An algorithm to find orphan-free coupling nodes in extended control flow graph is shown and we apply it to several commonly used inter-process interacting paradigms. Tightly coupling strategy is to avoid any trouble path that checkpoint X happened before checkpoint Y from different processes in the CFG. However,if the application being analyzed may have trouble paths while executing operations in loops, it is unnecessary for tightly coupling strategy to move the checkpoint statement outside the loop to avoid trouble paths. Under our hybrid checkpoint scheme, the extent of recovery from failures can be bounded to at most one checkpoint interval such that domino effect will never appear.
[AERHM99] L. Alvisi, E.N. Elnozahy, S. Rao, S.A. Husain, and
A.D. Mel. “An Analysis of Communication-Induced Checkpointing
,” in Digest of Papers, FTCS-29, The 29th Annual International
Symposium on Fault-Tolerant Computing, 1999.
[AFF03] A. Agbaria, A. Freund, and R. Friedman. “Evaluating
Distributed Checkpointing Protocols,” in Proceedings of the 23rd
International Conference on Distributed Computing Systems, 2003.
[ALRL04] A. Avizienis, J.-C. Laprie, B. Randell, and C.Landwehr.
Basic Concepts and Taxonomy of Dependable and Secure Computing, ”
IEEE Trans. on Dependable and Secure Computing, 1(1): 11-33,2004.
[Andrews99] Gregory R. Andrews. “Foundations of Multithreaded,
Parallel, and Distributed Programming,”Addison-Wesley, 2000.
[AS05] Adnan Agbaria and William H. Sanders.“Application-Driven Coordination-Free Distributed Checkpointing,”in Proceedings of the 25th IEEE International Conference on Distributed Computing Systems, 2005.
[BHMR95] R. Baldoni, J. M. Helary, A. Mostefaoui, M. Raynal,
“On Modeling Consistent Checkpoints and the Domino Effect in Distributed Systems,” Tech. Rep.RR-2564, IRISA, July 1995.
[BL88] B. Bhargava and S. R. Lian. “ Independent checkpointing and concurrent rollback for recovery—An optimistic approach, “ In Proceedings of the Sixth International Conference on Data Engineering, pp 182-189 ,1988.
[CL85] K.M. Chandy and L. Lamport. “Distributed Snapshots: Determining Global States of Distributed Systems, " ACM Trans. on Computer Systems,3(1):63-75, Feb. 1985.
[CR72] K.M. Chandy and C.V. Ramamoorthy. “Rollback and Recovery Strategies for Computer Programs,"IEEE Trans. on Computers, 21(6):546-556, June 1972.
[EAWJ02] E.N. Elnozahy, L. Alvisi, Y.-M. Wang and D.B. Johnson.“A Survey of Rollback-Recovery Protocols in Message-Passing System, " ACM Computing Surveys, 34(3): 375-408, Sept. 2002.
[LWK03] C.Y. Lin, S.C. Wang and S.Y. Kuo. “An Efficient Time-Based Checkpointing Protocol for Mobile Computing Systems over Mobile IP,” Mobile Network Applications, 8:687-697, 2003.
[Neumann] Peter G. Neumann. “Illustrative Risks to the Public in the Use of Computer Systems and Related Technology,”http://www.csl.sri.com/users/neumann/illustrative.html
[Plank93] J.S. Plank. “Efficient Checkpointing on MIMD Architectures,” Ph.D. thesis, Princeton University,1993.
[PT01] J. S. Planck and M. G. Thomason, “Processor allocation and checkpoint interval selection in cluster computing systems,” Journal of Parallel and distributed Computing, 2001.
[Randell75] B. Randell. “System structure for software fault tolerance,” IEEE Trans. on Software. Engineering.1(2):220-232, 1975.
[ROC] The Berkeley/Stanford Recovery-Oriented Computing Project. http://roc.cs.berkeley.edu/
[Tsai05] Jichiang Tsai. “An Efficient Index-Based Checkpointing Protocol with Constant-Size Control Information on Messages,”IEEE Trans. on Dependable and Secure Computing, 2(4): 287-296,Oct.-Dec. 2005.
[Wang93] Y.-M Wang. “Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems,” Ph.D. Thesis, University of Illinois, Department of Computer Science, 1993.
[Wang97] Y.-M Wang. “Consistent Global Checkpoints that Contain A Set of Local Checkpoints,” IEEE Trans. on Computers, 46(4):456-468, 1997.
[WF93] K. F. Wong and M. A. Franklin, “Distributed computing systems and checkpointing,” in 2nd Int. Symp. On High Performance Distributed Computing (HPDC’93). IEEE CS Press, pp. 224–23, July 1993.