Can Formal Methods (alone) Rescue HPC Debugging?

Speaker:  Ganesh Lalitha Gopalakrishnan – Salt Lake City, UT, United States
Topic(s):  Software Engineering and Programming


High Performance Computing systems are extremely complex ensembles of CPUs and GPUs, powering supercomputers that help understand and manage the earth. The same compute power permeates autonomous systems that one day may drive all of us around. In fact, ideas developed in the field of HPC have successfully transplanted into many areas of machine learning as well as embedded systems.

Due to the extreme degree of concurrency and heterogeneity, HPC systems are very difficult to design and debug. Also, one has to develop deep trust of the numbers these systems spew out.  Formal methods continue to be important in properly describing as well as debugging safety-critical concurrent and parallel systems -- HPC or autonomous.

This talk will cover our formal methods inspired research covering multiple concurrency models, applied to the area of High Performance Computing.  We will describe many of the facets of correctness in HPC software, including issues such as system resilience, floating-point error analysis and tuning, data race checking, and result-reproducibility of everyday tools such as compilers.  While some of these facets impinge on the reliability of the computed results, others are central to extracting the maximum amount of performance from the codes.  Performance enhancement is in fact the central driving force behind the complexity escalation we see in HPC systems.  It is therefore possible to view correctness tools as safety nets for carrying out performance enhancements.

In HPC debugging, however, more than textbook formal methods are needed.  We recently developed an OpenMP data race checker based on collaborations with real users, but more importantly, a mix of many ideas all helping to rein in complexity. Some of these ideas included the use of efficient event tracing from multicore CPUs, data compression to limit memory bandwidth, on-the-fly synthesis of Integer Linear Programs to summarize the traces emitted by loops, and optimized Red-Black Trees to store race descriptions.  We also wrote a Structural Operational Semantic description to guide the construction of our race checker.  The resulting race checker has negligible memory overheads for any program size, while detecting a significant proportion of the lurking races without generating any false positives.

The main take-away message from this talk is that formal methods are central to design trustworthiness all the way to performance improvements. The talk will also emphasize ways in which HPC software correctness differs from that of other types of software.  It will also show through examples that practically relevant formal methods cannot be developed in a vacuum: collaborations with practitioners is essential. These ideas are well-known and this talk provides additional examples from HPC where there is a dearth of such examples.

About this Lecture

Number of Slides:  60
Duration:  50 minutes
Languages Available:  English
Last Updated: 

Request this Lecture

To request this particular lecture, please complete this online form.

Request a Tour

To request a tour with this speaker, please complete this online form.

All requests will be sent to ACM headquarters for review.