Catalog: Characterizing Faults, Errors, and Failures in Extreme-Scale Systems

US Department of Energy (DOE) leadership computing facilities are in the process of deploying extreme-scale high-performance computing (HPC) systems with the long-range goal of building exascale systems that perform more than a quintillion (a billion billion) operations per second. More powerful computers mean researchers can simulate biological, chemical, and other physical interactions with an unprecedented amount of realism. However, as HPC systems become more complex, system integrators, component manufacturers as well as computing facilities have to and are preparing for unique computing challenges. Of particular concern are occurrences of unfamiliar or more frequent faults in both hardware technologies and software applications that can lead to computational errors or system failures.

This project will help DOE computing facilities protect extreme-scale systems by characterizing potential faults and creating models that predict their propagation and impact. The Collaboration of Oak Ridge, Argonne and Lawrence Livermore National Laboratories (CORAL) is a private/public partnership that will stand up three extreme-scale systems in 2017/2018, each operating at about 150 to 200 petaflops, or nearly 10 times more power than the 27-petaflop Titan at Oak Ridge National Laboratory (currently the fastest system in the United States) and about a tenth of exascale power.

By monitoring hardware and software performance on current DOE systems, such as Titan, and applying the data to fault analysis and vulnerability studies, this effort will capture observed and inferred fault conditions and extrapolate this knowledge to CORAL and other extreme-scale systems. Using these analyses, the project team will create assessment tools, including a fault taxonomy and catalog as well as fault models, to provide computing facilities with a clear picture of the fault characteristics in DOE computing environments and inform technical and operational decisions to improve resilience. The catalog, models, and the software resulting from this project, will be made publicly available.

This research is funded by the Office of Advanced Scientific Computing Research, Office of Science,
U.S. Department of Energy – Resilience for Extreme Scale Supercomputing Systems Program.

In the News

2018-11-19: HPCwire: What’s New in HPC Research: Thrill for Big Data, Scaling Resilience and More
2018-09-18: What’s New in HPC Research: September (Part 1)
2018-08-05: inside HPC: Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems

Accomplishments

The to-date accomplishments of this project are:

Developed a common taxonomy of faults, errors and failures in HPC systems
Created a catalog and models of faults, errors and failures of:
- 5 systems at the DOE’s Oak Ridge Leadership Computing Facility (OLCF) — Jaguar XT4, Jaguar XT5, Jaguar XK6, Titan and Eos
- Network errors and failures in OLCF's Titan supercomputer
- GPGPU soft errors and the interplay between temperature, power and GPGPU soft errors in OLCF's Titan supercomputer
- the Mira system at the DOE’s Argonne Leadership Computing Facility (ALCF)
- the spatial correlation of DRAM errors in Tri-Lab Linux clusters at the DOE’s Livermore Computing facility
Developed offline analysis software tools to study past fault, error and failure events using system logs:
- ORNL’s RAVEN tool uses an interactive graphical user interface to visualize and analyze events in logs from OLCF’s Titan system
- ANL’s LogAider tool mines correlations between events in logs from ALCF’s Mira system
- ANL’s La VALSE tool uses an interactive graphical user interface to visualize and automatically analyze events in logs from ALCF’s Mira system
- ORNL's multi-user Big Data analytics framework for supercomputer log data
Implemented LLNL’s REFINE fault injection framework to study the error masking and propagation properties of scientific applications running on DOE’s HPC systems using realistic fault scenarios
Created error and failure propagation and containment models and performed corresponding stuties:
- Modeled the propagation of errors in different code regions of HPC codes using LLNL’s REFINE fault injection framework
- Developed a model to reason about resilience code patterns, i.e., sequences or groups of operations that help to explain why certain codes are naturally resilient to errors

Impact

The research results of this project provide a clear picture of the fault characteristics in DOE computing environments. They improve resilience through reliable fault detection at an early stage and actionable information for efficient mitigation during system design, software development, and runtime. Ultimately, the results enable parallel application correctness and execution efficiency despite frequent faults, errors and failures in extreme-scale systems. This substantially contributes to advances in the race for scientific discovery through computation with the efficient use of DOE’s extreme-scale HPC systems.

Project Personnel

Role	Name	Institution	E-mail Address	Phone Number
Principal Investigator (PI)	Christian Engelmann	Oak Ridge National Laboratory	engelmannc@ornl.gov	(865) 574-3132
Institutional Co-PI	Ignacio Laguna	Lawrence Livermore National Laboratory	ilaguna@llnl.gov	(925) 422-7308
Institutional Co-PI	Franck Cappello	Argonne National Laboratory	cappello@anl.gov	(630) 252-7198
Post-doctoral Researcher	Rizwan Ashraf	Oak Ridge National Laboratory	ashrafra@ornl.gov	(865) 576-6897
Additional Investigator	Swen Boehm	Oak Ridge National Laboratory	boehms@ornl.gov	(865) 576-6125
Additional Investigator	Sheng Di	Argonne National Laboratory	sdi1@anl.gov	(630) 252-1520
Post-doctoral Researcher	Hanqi Guo	Argonne National Laboratory	hguo@anl.gov	(630) 252-7225
Additional Investigator	Rinku Gupta	Argonne National Laboratory	rgupta@anl.gov	(630) 252-6266
Post-Master's Researcher	Yawei Hui	Oak Ridge National Laboratory	huiy@ornl.gov
Additional Investigator	Byung-Hoon (Hoony) Park	Oak Ridge National Laboratory	parkbh@ornl.gov	(865) 576-3365
Additional Investigator	Devesh Tiwari	Northeastern University	d.tiwari@northeastern.edu	(617) 373-8999

Project Wiki

This project wiki is located at https://ornlwiki.atlassian.net/wiki/display/CFEFIES. Access to most content requires authentication and is intended primarily for internal use.

Getting Access

For internal-use access, e-mail Christian Engelmann with your full name and desired username, and request to be added to the "CFEFIES users" group. When you receive the e-mail from the wiki (ornlwiki.atlassian.net), complete the setup process.

Forgot your password?

Follow the "Forgot your password?" link on the wiki login page.