Extreme-scale, high-performance computing (HPC) significantly advances discovery in fundamental scientific processes by enabling multiscale simulations that range from the very small, on quantum and atomic scales, to the very large, on planetary and cosmological scales. Computing at scales in the hundreds of petaflops, exaflops—quintillions (billion billions) operations per second—, and beyond will also lend a competitive advantage to the US energy and industrial sectors by providing the computing power for rapid design and prototyping and big data analysis.

To build and effectively operate extreme-scale HPC systems, the US Department of Energy cites several key challenges, including resilience, or efficient and correct operation despite the occurrence of faults or defects in system components that can cause errors. These innovative systems require equally innovative components designed to communicate and compute at unprecedented rates, scales, and levels of complexity, increasing the probability for hardware and software faults.

This research project offers a structured hardware and software design approach for improving resilience in extreme-scale HPC systems so that scientific applications running on these systems generate accurate solutions in a timely and efficient manner. Frequently used in computer engineering, design patterns identify problems and provide generalized solutions through reusable templates.

Using a novel resilience design pattern concept, this project identifies and evaluates repeatedly occurring resilience problems and coordinates solutions throughout hardware and software components in HPC systems. This effort will create comprehensive methods and metrics by which system vendors and computing centers can establish mechanisms and interfaces to coordinate flexible fault management across hardware and software components and optimize the cost-benefit trade-offs among performance, resilience, and power consumption. Reusable programming templates of these patterns will offer resilience portability across different HPC system architectures and permit design space exploration and adaptation to different design trade-offs.

This research is funded by the Office of Advanced Scientific Computing Research,
Office of Science, U.S. Department of Energy – Early Career Research Program.

In the News

2015-07-15: ASCR Discovery – Mounting a charge. Early-career awardees attack exascale computing on two fronts: power and resilience.
2015-07-15: HPC Wire – Tackling Power and Resilience at Exascale.

Accomplishments

The to-date accomplishments of this project are:

The resilience design patterns in production HPC systems and recent resilience technologies have been identified and incorporated into a design pattern specification.
These patterns were evaluated and their efficiency and reliability was modeled. The results were incorporated into the design pattern specification.
A fault-tolerant generalized minimal residual method (FT-GMRES) linear solver with portable resilience was developed that is capable of dealing with Message Passing Interface (MPI) process failures and data corruption.
The resilience design pattern implementations and resilience-performance trade-offs of this multi-resilient FT-GMRES solver have been evaluated using fault injection.
New, outcome-based metrics for HPC resilience were created to better measure the impact on correctness and time-to-solution of a HPC resilience solution.
The design pattern specification was documented in form of a technical report:
- Saurabh Hukerikar and Christian Engelmann. Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (Version 1.2). Technical Report, ORNL/TM-2017/745, Oak Ridge National Laboratory, Oak Ridge, TN, USA, August, 2017. DOI 10.2172/1436045.

Impact

This project enables the systematic improvement of resilience in extreme-scale HPC systems. It identifies and evaluates repeatedly occurring resilience problems and coordinates solutions throughout hardware and software components. Ultimately, the results enable parallel application correctness and execution efficiency despite frequent faults, errors and failures in extreme-scale systems. This substantially contributes to advances in the race for scientific discovery through computation with the efficient use of DOE’s extreme-scale HPC systems.

Project Personnel

Role	Name	Institution	E-mail Address	Phone Number
Principal Investigator (PI)	Christian Engelmann	Oak Ridge National Laboratory	engelmannc@ornl.gov	(865) 574-3132
Post-doctoral Research Associate	Rizwan Ashraf	Oak Ridge National Laboratory	ashrafra@ornl.gov	(865) 576-6897
Post-doctoral Research Associate	Piyush Sao	Oak Ridge National Laboratory	saopk@ornl.gov

Project Wiki

This project wiki is located at https://ornlwiki.atlassian.net/wiki/display/RDP. Access to most content requires authentication and is intended primarily for internal use.

Getting Access

For internal-use access, e-mail Christian Engelmann with your name and desired username, and request to be added to the "resilience design patterns users" group. When you receive the e-mail from the wiki (ornlwiki.atlassian.net), complete the setup process.

Forgot your password?

Follow the "Forgot your password?" link on the wiki login page.

Browser not supported