The SMURFS Project: Simulation and Modeling for Understanding Resilience and Faults at Scale

Speaker:  Dorian C Arnold – Orange County, CA, United States
Topic(s):  Applied Computing

Abstract

Current HPC research explorations target computer systems with exaflop (10^18 or a quintillion floating point operations per second) capabilities. Such computational power will enable new, important discoveries across all basic science domains. Application resilience is a major challenge to the realization of extreme scale computing systems. The SMURFS Project addresses this challenge by developing methods to improve our predictive understanding of the complex interactions amongst a given application, a given real or hypothetical hardware and software system environment and a given fault-tolerance strategy at extreme scale. Specifically, SMURFS explores: (1) Advanced simulation and modeling capabilities for studying application resilience at scale; (2) Comprehensive, comparative studies of existing and new fault-tolerance strategies; (3) Detailed understandings of how application features interplay with different fault-tolerance strategies and hardware technologies; and (4) Effective prescriptions to guide application developers, hardware architects and system designers to realize efficient, resilient extreme scale capabilities. (This project is a collaboration amongst Emory University, the University of Tennessee and the Sandia National Labs. It is funded in part by the National Science Foundation.)

About this Lecture

Number of Slides:  45
Duration:  50 minutes
Languages Available:  English
Last Updated: 

Request this Lecture

To request this particular lecture, please complete this online form.

Request a Tour

To request a tour with this speaker, please complete this online form.

All requests will be sent to ACM headquarters for review.