Software
La Valse: Visual Analysis Tool for Fault Characterization of Supercomputers
The La Valse tool is designed to visualize and analyze large-scale heterogeneous logs on supercomputers for characterizing their faults, errors and failures. Currently, the tool provides a user interface to explore logs on the Mira supercomputer, an IBM Blue Gene/Q system at Argonne National Laboratory. Three types of supercomputer logs are involved:
- RAS (Reliability, Availability, Serviceability) logs
- Cobalt resource manager and backend job logs
- Darshan I/O monitoring logs (not yet supported)
Source Code
Bitbucket: https://bitbucket.org/hanqiguo/lavalse
LogAider
The LogAider tool permits the mining of log data from supercomputers to identify correlations between events, such as faults, errors and failures. It supports several different analysis algorithms, including across-field correlation analysis, temporal correlation analysis, and spatial correlation analysis (k-means). The tool is primarily designed to analyze the logs of the Mira supercomputer, an IBM Blue Gene/Q system at Argonne National Laboratory.
Related Publications
- Sheng Di, Rinku Gupta, Marc Snir, Eric Pershey, and Franck Cappello. LOGAIDER: A tool for mining potential correlations of HPC log events. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) 2017, pages 442-451, Madrid, Spain, May 14-17, 2017. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5090-6610-0. (Paper)
Source Code
GitHub: https://github.com/disheng222/LogAider
REFINE: REalistic Fault INjEction using compiler-based instrumentation
Compiler-based fault injection (FI) has become a popular technique for resilience studies to understand the impact of soft errors in supercomputing systems. Compiler-based FI frameworks inject faults at a high intermediate-representation level. However, they are less accurate than machine code, binary-level FI because they lack access to all dynamic instructions, thus they fail to mimic certain fault manifestations. REFINE, a novel framework that addresses these limitations, performs FI in a compiler backend. This approach provides the portability and efficiency of compiler-based FI while keeping accuracy comparable to binary-level FI methods.
Related Publications
- Giorgis Georgakoudis, Ignacio Laguna, Dimitrios S. Nikolopoulos, Martin Schulz. REFINE: Realistic Fault Injection via Compiler-based Instrumentation for Accuracy, Portability and Speed. In Proceedings of the 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017, Denver, CO, USA, November 12-17, 2017. IEEE Computer Society, Los Alamitos, CA, USA. Acceptance rate 18.7% (61/327). (Paper)
Source Code
GitHub: https://github.com/ggeorgakoudis/REFINE