Software

La Valse: Visual Analysis Tool for Fault Characterization of Supercomputers

The La Valse tool is designed to visualize and analyze large-scale heterogeneous logs on supercomputers for characterizing their faults, errors and failures. Currently, the tool provides a user interface to explore logs on the Mira supercomputer, an IBM Blue Gene/Q system at Argonne National Laboratory. Three types of supercomputer logs are involved:

  • RAS (Reliability, Availability, Serviceability) logs
  • Cobalt resource manager and backend job logs
  • Darshan I/O monitoring logs (not yet supported)
Source Code

Bitbucket: https://bitbucket.org/hanqiguo/lavalse

LogAider

The LogAider tool permits the mining of log data from supercomputers to identify correlations between events, such as faults, errors and failures. It supports several different analysis algorithms, including across-field correlation analysis, temporal correlation analysis, and spatial correlation analysis (k-means). The tool is primarily designed to analyze the logs of the Mira supercomputer, an IBM Blue Gene/Q system at Argonne National Laboratory.

Related Publications

Source Code

GitHub: https://github.com/disheng222/LogAider

REFINE: REalistic Fault INjEction using compiler-based instrumentation

Compiler-based fault injection (FI) has become a popular technique for resilience studies to understand the impact of soft errors in supercomputing systems. Compiler-based FI frameworks inject faults at a high intermediate-representation level. However, they are less accurate than machine code, binary-level FI because they lack access to all dynamic instructions, thus they fail to mimic certain fault manifestations. REFINE, a novel framework that addresses these limitations, performs FI in a compiler backend. This approach provides the portability and efficiency of compiler-based FI while keeping accuracy comparable to binary-level FI methods.

Related Publications

Source Code

GitHub: https://github.com/ggeorgakoudis/REFINE