Ganesh L. Gopalakrishnan is a Professor of Computer Science in the School of Computing at the University of Utah. He is serving as the Director of Graduate Studies and Director of the Center for Parallel Computing at Utah ("CPU").
His recent work covers many aspects of the correctness of parallel /concurrent programs that are used to program today's supercomputers. These aspects include data race checking, system resilience, and floating-point reasoning. See Research Highlights below.
He has mentored 38 undergraduate REU students, 25 PhD students (21 finished PhDs), and 4 postdoctoral researchers. He has written two textbooks: "Computation Engineering: Applied Automata Theory and Logic" (2006), and "Automata and Computability: A Programmer's Perspective" (2019). The latter book is accompanied by a software system "Jove" written in Python to interactively explore central concepts in automata and formal verification using Jupyter notebooks.
Prof. Gopalakrishnan is an ACM Distinguished Scientist, a co-founder of the conference series Advanced Research on Asynchronous Circuits and Systems (1994), co-organizer of the Formal Methods in Computer Aided Design conference (1998), co-organizer of the Computer Aided Verification conference (2011), and area vice-chair of the Programming Languages track of Supercomputing (2016). In 2018, he gave an invited keynote at the Second International Workshop on Software Correctness for HPC Applications, and also taught a one-week international course (sponsored by the Government of India) on High Performance Computing basics.
Research Highlights Summary:
Data Race Detection: Shared memory threads created using languages such as OpenMP running on multicore CPUs can suffer from data races due to unsynchronized memory conflicts. Such races may remain lurking until revealed by compiler optimizations, and cause data corruptions. The Archer tool developed jointly with LLNL and RWTH Aachen is within production use within LLNL for detecting data races within OpenMP programs that are central to the nation's scientific research. Sword, another race checker developed by our group, requires only a fixed and small (about 3 MB) memory buffer per thread during trace collection for race checking. Thus, it can be applied to any OpenMP program without fear of exceeding available memory. It effectively shifts race checking to novel offline analysis algorithms.
System Resilience: As we cram billion of transistors into a chip, and build computers with thousands of such chips, the probability of system state bits transiently getting corrupted due to system noise and high energy particle strikes goes up. In one line of work, we have developed an efficient and accurate race checker for stencil programs called FPDetect. Another research effort is focused on making the software fail even faster when hit with a bit-flip, thus manifesting the error (i.e. not hiding and letting it fester). The "fail fast" method essentially is an LLVM compiler transformation that rewrites the given program into one that manifests address generation faults with low overheads and guaranteed detection.
Floating-point Reasoning: His group's work includes rigorous methods for floating-point round-off analysis to assist with precision reduction that helps save energy. Another line of work supported by a tool called FLiT helps in the process of porting critically important codes---such as climate codes---reliably across machines, without changing answers.
To request a single lecture/event, click on the desired lecture and complete the Request Lecture Form.
Automata and Computability using Jupyter and Formal Methods
Automata and computability form central theoretical pillars of computer science, and as such are included in almost all undergraduate curricula. In the study of these topics, it is...
- Can Formal Methods (alone) Rescue HPC Debugging?High Performance Computing systems are extremely complex ensembles of CPUs and GPUs, powering supercomputers that help understand and manage the earth. The same compute power permeates...
- Keeping Science on Keel when Software MovesSignificant investments are made into the creation and maintenance of high-performance computing software, involving dozens of computer scientists and domain scientists working hard over...
- System Resilience: Amplify Failures, Detect, or Both?As we cram billion of transistors into a chip, and build computers with thousands of such chips, the probability of system state bits transiently getting corrupted due to system noise and...
To request a tour with this speaker, please complete this online form.
If you are not requesting a tour, click on the desired lecture and complete the Request this Lecture form.
All requests will be sent to ACM headquarters for review.
- Can Formal Methods (alone) Rescue HPC Debugging?