System Resilience: Amplify Failures, Detect, or Both?

Speaker:  Ganesh Lalitha Gopalakrishnan – Salt Lake City, UT, United States
Topic(s):  Hardware, Power and Energy

Abstract

As we cram billion of transistors into a chip, and build computers with thousands of such chips, the probability of system state bits transiently getting corrupted due to system noise and high energy particle strikes goes up. Such "soft errors" factors are exacerbated by manufacturing variability that is higher in smaller lithographies.

Many types of software-based error detectors have been proposed to detect these soft errors and trigger recomputation from state checkpoints.  Unfortunately, most of these detection schemes introduce unacceptable computational overheads and also have unacceptably high false positive rates.

In one line of work, we have ameliorated this situation by focusing on applications such as stencils.  In this domain, we guarantee near 100% detection based on rigorous floating-point error analysis based on affine arithmetic.  We also reduce overheads by covering multiple steps of the stencil application per detector deployment.

This error analysis tells us how many mantissa bits of floating point results to expect to be preserved with respect to the semantics of real arithmetic. If we do not find this many bits being preserved during runtime (known by comparing the detector output and the stencil output), we can attribute the discrepancy to soft errors.

Another contribution we have made to the space of soft error detection is to protect only the address calculation steps involved in indexing arrays and structs. This approach rewrites an LLVM representation of the initial program to a new LLVM representation that exaggerates (amplifies) the first failure to a cascading failure that manifests more readily. This approach offers low overheads and can exploit instructions found in ARM, offering 100% detection on address (AGU) faults.

The main take-away message is that system resilience solutions developed with attention to higher accuracy and lower overheads may prove to be the inevitable safety net based on which designers attempt to reduce energy consumption in this period of ending Moore's law.

About this Lecture

Number of Slides:  60
Duration:  50 minutes
Languages Available:  English
Last Updated: 

Request this Lecture

To request this particular lecture, please complete this online form.

Request a Tour

To request a tour with this speaker, please complete this online form.

All requests will be sent to ACM headquarters for review.