Journal Papers
- Omer Subasi, Sheng Di, Leonardo Bautista-Gomez, Prasanna Balaprakash, Osman Unsal, Jesus Labarta, Adrian Cristal, Sriram Krishnamoorthy, and Franck Cappello. Exploring The Capabilities of Support Vector Machines in Detecting Silent Data Corruptions. Journal of Sustainable Computing: Informatics and Systems (SUSCOM), 2018. ISSN 2210-5379. DOI 10.1016/j.suscom.2018.01.004
- Sheng Di, Hanqi Guo, Rinku Gupta, Eric R. Pershey, Marc Snir, and Franck Cappello. Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System. Transactions of Parallel and Distributed System (TPDS), 2018. To appear.
Conference Papers
...
Journal Papers
- Omer Subasi, Sheng Di, Leonardo Bautista-Gomez, Prasanna Balaprakash, Osman Unsal, Jesus Labarta, Adrian Cristal, Sriram Krishnamoorthy, and Franck Cappello. Exploring The Capabilities of Support Vector Machines in Detecting Silent Data Corruptions. Journal of Sustainable Computing: Informatics and Systems (SUSCOM), 2018. ISSN 2210-5379. DOI 10.1016/j.suscom.2018.01.004
- Sheng Di, Hanqi Guo, Rinku Gupta, Eric R. Pershey, Marc Snir, and Franck Cappello. Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System. Transactions of Parallel and Distributed System (TPDS), 2018. To appear.
Conference Papers
- Luanzheng Guo, Dong Li, Ignacio Laguna, and Martin Schulz. FlipTracker: Understanding Natural Error Resilience in HPC Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 2018, Dallas, TX, November 11–16, 2018. IEEE Computer Society, Los Alamitos, CA, USA. To appear.
- Wenbin He, Hanqi Guo, Tom Peterka, Sheng Di, Franck Cappello, and Han-Wei Shen. Parallel Partial Reduction for Large-Scale Data Analysis and Visualization.” In Proceedings of the IEEE Symposium on Large Data Analysis and Visualization (LDAV) 2018, Berlin, Germany, October 21, 2018. IEEE Computer Society, Los Alamitos, CA, USA. To appear.
- Mohit Kumar, Saurabh Gupta, Tirthak Patel, Michael Wilder, Weisong Shi, Song Fu, Christian Engelmann, and Devesh Tiwari. Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System. In Proceedings of the 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018, pages 107-114, Luxembourg City, Luxembourg, June 25-28, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-5596-2. ISSN 2158-3927. DOI 10.1109/DSN.2018.00023. Acceptance rate 27.2% (62/228). (Paper)
- Bin Nie, Ji Xue, Saurabh Gupta, Tirthak Patel, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. Machine Learning Models for GPU Error Prediction in a Large Scale HPC System. In Proceedings of the 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018, pages 95-106, Luxembourg City, Luxembourg, June 25-28, 2018. IEEE Computer Society, Los Alamitos, CA, USA. To appear.
- Wenbin He, Hanqi Guo, Tom Peterka, Sheng Di, Franck Cappello, and Han-Wei Shen. Parallel Partial Reduction for Large-Scale Data Analysis and Visualization.” In Proceedings of the IEEE Symposium on Large Data Analysis and Visualization (LDAV) 2018, Berlin, Germany, October 21, 2018. IEEE Computer Society, Los Alamitos, CA, USA. To appear.
- Mohit Kumar, Saurabh Gupta, Tirthak Patel, Michael Wilder, Weisong Shi, Song Fu, Christian Engelmann, and Devesh Tiwari. Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System. In Proceedings of the 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018, pages 107-114, Luxembourg City, Luxembourg, June 25-28, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-5596-2. ISSN 2158-3927. DOI 10.1109/DSN.2018.00023. Acceptance rate 27.2% (62/228). (Paper)
- Bin Nie, Ji Xue, ISBN 978-1-5386-5596-2. ISSN 2158-3927. DOI 10.1109/DSN.2018.00022. Acceptance rate 27.2% (62/228). (Paper)
- Hanqi Guo, Sheng Di, Rinku Gupta, Tom Peterka, and Franck Cappello. La VALSE: Scalable Log Visualization for Fault Characterization in Supercomputers. In Proceedings of EuroGraphics Symposium on Parallel Graphics and Visualization (EGPGV) 2018, pages 91-100, Brno, Czech Republic, June 4, 2018. ISBN 978-3-03868-054-3. ISSN 1727-348X. DOI 10.2312/pgv.20181099.
- Saurabh Gupta, Tirthak Patel, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. Machine Learning Models for GPU Error Prediction in a Large Scale HPC SystemFailures in Large Scale Systems: Long-term Measurement, Analysis, and Implications. In Proceedings of the 48th 30th IEEE/IFIP ACM International Conference on Dependable Systems and Networks (DSN) 2018, pages 95-106, Luxembourg City, Luxembourg, June 25-28, 2018. IEEE Computer Society, Los Alamitos, CAHigh Performance Computing, Networking, Storage and Analysis (SC) 2017, pages 44:1-44:12, Denver, CO, USA, November 12-17, 2017. ACM Press, New York, NY, USA. ISBN 978-1-53864503-55965114-20. ISSN 2158-3927. DOI 10.11091145/DSN3126908.20183126937.00022. Acceptance rate 2718.2% 7% (6261/228327). (Paper | Presentation)
- Hanqi Guo, Sheng Di, Rinku Gupta, Tom Peterka, and Franck Cappello. La VALSE: Scalable Log Visualization for Fault Characterization in Supercomputers. In Proceedings of EuroGraphics Symposium on Parallel Graphics and Visualization (EGPGV) 2018, pages 91-100, Brno, Czech Republic, June 4, 2018. ISBN 978-3-03868-054-3. ISSN 1727-348X. DOI 10.2312/pgv.20181099.
- Saurabh Gupta, Tirthak Patel, Christian Engelmann, and Devesh Tiwari. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications. In Proceedings of the 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017, pages 44:1-44:12, Denver, CO, USA, November 12-17, 2017. ACM Press, New York, NYGiorgis Georgakoudis, Ignacio Laguna, Dimitrios S. Nikolopoulos, Martin Schulz. REFINE: Realistic Fault Injection via Compiler-based Instrumentation for Accuracy, Portability and Speed. In Proceedings of the 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017, pages 29:1-29:14, Denver, CO, USA, November 12-17, 2017. ACM Press, New York, NY, USA. ISBN 978-1-4503-5114-0. DOI 10.1145/3126908.3126972. Acceptance rate 18.7% (61/327). (Paper)
- Bin Nie, Ji Xue, Saurabh Gupta, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities. In Proceedings of the 25th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) 2017, pages 22-31, Banff, AB, Canada, September 20-22, 2017. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-4503-5114-02764-8. ISSN 2375-0227. DOI 10.11451109/3126908MASCOTS.2017.312693712. Acceptance rate 1830.7% 95% (6126/32784). (Paper | Presentation)Giorgis Georgakoudis, Ignacio Laguna, Dimitrios S. Nikolopoulos, Martin Schulz. REFINE: Realistic Fault Injection via Compiler-based Instrumentation for Accuracy, Portability and Speed. In
- Proceedings of the 30th Sheng Di, Rinku Gupta, Marc Snir, Eric Pershey, and Franck Cappello. LOGAIDER: A tool for mining potential correlations of HPC log events. In Proceedings of the 17th IEEE/ACM International Conference Symposium on High Performance Computing, Networking, Storage and Analysis (SCCluster, Cloud and Grid Computing (CCGrid) 2017, pages 29:1-29:14, Denver, CO, USA, November 12442-451, Madrid, Spain, May 14-17, 2017. ACM Press, New York, NYIEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-45035090-51146610-0. DOI 10.1145/3126908.3126972. Acceptance rate 18.7% (61/327). ( (Paper)
- Bin Nie, Ji XueKun Tang, Devesh Tiwari, Saurabh Gupta, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities. In Proceedings of the 25th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) 2017, pages 22-31, Banff, AB, Canada, September 20-22, 2017. Ping Huang, QiQi Lu, Christian Engelmann, and Xubin He. Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy. In Proceedings of the 46th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2016, pages 311-322, Toulouse, France, June 28 - July 1, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-2764-8. ISSN 23752158-02273927. DOI 10.1109/MASCOTSDSN.20172016.1236. Acceptance rate 3022.95% 4% (2658/84259). (Paper)Sheng Di, Rinku Gupta, Marc Snir, Eric Pershey,
- and Franck Cappello. LOGAIDER: A tool for mining potential correlations of HPC log events. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) 2017, pages 442-451, Madrid, Spain, May 14-17, 2017Leonardo Bautista-Gomez, Ana Gainaru, Swann Perarnau, Devesh Tiwari, Saurabh Gupta, Franck Cappello, Christian Engelmann, and Marc Snir. Reducing Waste in Extreme Scale Systems Through Introspective Analysis. In Proceedings of the 30th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016, pages 212-221, Chicago, IL, USA, May 23-27, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5090-6610-0. (Paper)Kun Tang, Devesh Tiwari, Saurabh Gupta, Ping Huang, QiQi Lu, Christian Engelmann, and Xubin He. Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy. In Proceedings of the 46th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2016, pages 311-322, Toulouse, France, June 28 - July 1, 2016. ISSN 1530-2075. DOI 10.1109/IPDPS.2016.100. Acceptance rate 23.0% (114/496). (Paper | Presentation)
Workshop Papers
- Yawei Hui, Byung Hoon (Hoony) Park, and Christian Engelmann. A Comprehensive Informative Metric for Analyzing HPC System Status using the LogSCAN Platform. In Proceedings of the 31st International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2018: 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2018, Dallas, TX, USA, November 16, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 2158-3927. DOI 10. 1109/DSN.2016.36. Acceptance rate 2245.4% 0% (589/25920). (Paper)Leonardo Bautista-Gomez, Ana Gainaru, Swann Perarnau, Devesh Tiwari, Saurabh Gupta, Franck Cappello, Christian Engelmann, and Marc Snir. Reducing Waste in Extreme Scale Systems Through Introspective Analysis. In Proceedings of the 30th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016, pages 212-221, Chicago, IL, USA, May 23-27, 2016.
- Rizwan Ashraf and Christian Engelmann. Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer. In Proceedings of the 31st International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2018: 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2018, Dallas, TX, USA, November 16, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 1530-2075. DOI 10.1109/IPDPS.2016. 100. Acceptance rate 2345.0% (1149/49620). (Paper | Presentation)
...
- Byung Hoon (Hoony) Park, Yawei Hui, Swen Boehm, Rizwan Ashraf, Christian Engelmann, and Christopher Layton. A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log. In Proceedings of the 19th IEEE International Conference on Cluster Computing (Cluster) 2018: 5th Workshop on Monitoring and Analysis for High Performance Systems Plus Applications (HPCMASPA) 2018, Belfast, UK, September 10, 2018. IEEE Computer Society, Los Alamitos, CA, USA. To appear.
- Byung Hoon (Hoony) Park, Saurabh Hukerikar, Christian Engelmann, and Ryan Adamson. Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale. In Proceedings of the 18th IEEE International Conference on Cluster Computing (Cluster) 2017: 4th Workshop on Monitoring and Analysis for High Performance Systems Plus Applications (HPCMASPA) 2017, pages 758-765, Honolulu, HI, USA, September 5, 2017. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-2327-5. ISSN 2168-9253. DOI 10.1109/CLUSTER.2017.113. (Paper | Presentation)
- Franck Cappello, Rinku Gupta, Sheng Di, Emil Constantinescu, Thomas Peterka, and Stefan M. Wild. Understanding and improving the trust in results of numerical simulations and scientific data analytics. In Lecture Notes in Computer Science: Proceedings of the 23rd European Conference on Parallel and Distributed Computing (Euro-Par) 2017 Workshops: 10th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 557-568, Santiago de Compostela, Spain, August 29, 2017. Springer Verlag, Berlin, Germany. ISBN 978-3-319-75177-1. DOI 10.1007/978-3-319-75178-8_44. Acceptance rate 66.7% (4/6).
- Ayush Patwari, Ignacio Laguna, Martin Schulz, Saurabh Bagchi. Understanding the Spatial Characteristics of DRAM Memory Errors in HPC Clusters. In Proceedings of the 26th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) 2017: 7th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2017, pages 17-22, Washington, D.C., June 26-30, 2017. ACM Press, New York, NY, USA. ISBN 978-1-4503-5001-3. DOI 10.1145/3086157.3086164. (Paper)
White Papers
- Devesh Tiwari, Saurabh Gupta, and Christian Engelmann. Lightweight, Actionable Analytical Tools Based on Statistical Learning for Efficient System Operations. White paper submitted to the U.S. Department of Energy's Workshop on Modeling & Simulation of Systems & Applications (ModSim) 2016, August 10-12, 2016. (Paper)
Talks
- Christian Engelmann. Characterizing Faults, Errors, and Failures in Extreme-Scale Systems. Invited talk at the Platform for Advanced Scientific Computing (PASC) Conference 2018, Basel, Switzerland, July 2-4, 2018. (Presentation)
- Christian Engelmann. Characterizing Faults, Errors, and Failures in Extreme-Scale Systems. Invited talk at the 6th Accelerated Data Analytics and Computing (ADAC) Institute Workshop, Zurich, Switzerland, June 20-21, 2018. (Presentation)
- Christian Engelmann. A Catalog of Faults, Errors, and Failures in Extreme-Scale Systems. Invited talk at the SIAM Annual Meeting (AM) 2017, Pittsburgh, PA, USA, July, 2017. (Presentation)
- Christian Engelmann. Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems. Invited talk at the International Supercomputing Conference (ISC) 2017, Frankfurt am Main, Germany, June 16-22, 2017. (Presentation)
- Christian Engelmann. A Catalog of Faults, Errors, and Failures in Extreme-Scale Systems. Invited talk at the 12th Scheduling for Large Scale Systems Workshop (SLSSW) 2017, Knoxville, TN, USA, May 24-26, 2017. (Presentation)
- Christian Engelmann. The Missing High-Performance Computing Fault Model. Invited talk at the 17th SIAM Conference on Parallel Processing for Scientific Computing (PP) 2016, Paris, France, April 12-15, 2016. (Presentation)
- Martin Schulz. Characterizing Faults on Production Systems. Invited talk at the 17th SIAM Conference on Parallel Processing for Scientific Computing (PP) 2016, Paris, France, April 12-15, 2016.
- Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the United States Naval Academy, Annapolis, MD, USA, February 18, 2016. (Presentation)
- Ignacio Laguna. Compiler-level Techniques to Improve the Reliability of High-performance Computing Applications. Invited talk at the School of Engineering, University of California, Merced, USA, December 4, 2015.
- Christian Engelmann. Toward A Fault Model And Resilience Design Patterns For Extreme Scale Systems. Keynote talk at the 8th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, held in conjunction with the 21st European Conference on Parallel and Distributed Computing (Euro-Par) 2015, Vienna, Austria, August 24-28, 2015. (Presentation)