A Non-checkpoint/restart, Non-algorithm-specific Approach to Fault-tolerance

Speaker:  Dorian C Arnold – Orange County, CA, United States
Topic(s):  Computational Theory, Algorithms and Mathematics

Abstract

Hierarchical or tree-based overlay networks (TBONs) are often used to execute data  aggregation operations in a scalable, piecewise fashion. We present state compensation, a scalable failure recovery model for high-bandwidth, low-latency TBON computations. By leveraging inherently redundant state information found in many TBON computations, state compensation avoids explicit state replication (for example, process checkpoints and message logging) and incurs no overhead in the absence of failures. Further, when failures do occur, state compensation uses a weak data consistency model and localized protocols that allow processes to recover from failures independently and responsively. We describe the fundamental state compensation concepts and a prototype implementation integrated into the MRNet TBON infrastructure. Our experiments with this framework suggest that for TBONs supporting up to millions of application processes, state compensation can yield millisecond recovery latencies and inconsequential application perturbation.

About this Lecture

Number of Slides:  45
Duration:  50 minutes
Languages Available:  English
Last Updated: 

Request this Lecture

To request this particular lecture, please complete this online form.

Request a Tour

To request a tour with this speaker, please complete this online form.

All requests will be sent to ACM headquarters for review.