Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Journal Papers

  1. Omer Subasi, Sheng Di, Leonardo Bautista-Gomez, Prasanna Balaprakash, Osman Unsal, Jesus Labarta, Adrian Cristal, Sriram Krishnamoorthy, and Franck Cappello. Exploring The Capabilities of Support Vector Machines in Detecting Silent Data Corruptions. Journal of Sustainable Computing: Informatics and Systems (SUSCOM), 2018. ISSN 2210-5379. DOI 10.1016/j.suscom.2018.01.004

  2. Sheng Di, Hanqi Guo, Rinku Gupta, Eric R. Pershey, Marc Snir, and Franck Cappello. Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System. Transactions of Parallel and Distributed System (TPDS), 2018. ISSN 1045-9219. DOI 10.1109/TPDS.2018.2864184.

Conference Papers

  1. Sheng Di, Hanqi Guo, Eric Pershey, Marc Snir, and Franck Cappello. Characterizing and Understanding HPC Job Failures over The 2K-day Life of IBM BlueGene/Q SystemIn Proceedings of the 49th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2019, Portland, OR, USA, June 24-27, 2019. IEEE Computer Society, Los Alamitos, CA, USA.
  2. Luanzheng Guo, Dong Li, Ignacio Laguna, and Martin Schulz. FlipTracker: Understanding Natural Error Resilience in HPC Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 2018, pages 8:1-8:14, Dallas, TX, November 11–16, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-8384-2. (Paper)
  3. Wenbin He, Hanqi Guo, Tom Peterka, Sheng Di, Franck Cappello, and Han-Wei Shen. Parallel Partial Reduction for Large-Scale Data Analysis and Visualization.” In Proceedings of the IEEE Symposium on Large Data Analysis and Visualization (LDAV) 2018, Berlin, Germany, October 21, 2018. IEEE Computer Society, Los Alamitos, CA, USATo appear.
  4. Mohit Kumar, Saurabh Gupta, Tirthak Patel, Michael Wilder, Weisong Shi, Song Fu, Christian Engelmann, and Devesh Tiwari. Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System. In Proceedings of the 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018, pages 107-114, Luxembourg City, Luxembourg, June 25-28, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-5596-2. ISSN 2158-3927. DOI 10.1109/DSN.2018.00023Acceptance rate 27.2% (62/228). (Paper)
  5. Bin Nie, Ji Xue, Saurabh Gupta, Tirthak Patel, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. Machine Learning Models for GPU Error Prediction in a Large Scale HPC System. In Proceedings of the 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018, pages 95-106, Luxembourg City, Luxembourg, June 25-28, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-5596-2. ISSN 2158-3927. DOI 10.1109/DSN.2018.00022Acceptance rate 27.2% (62/228). (Paper)
  6. Hanqi Guo, Sheng Di, Rinku Gupta, Tom Peterka, and Franck Cappello. La VALSE: Scalable Log Visualization for Fault Characterization in Supercomputers. In Proceedings of EuroGraphics Symposium on Parallel Graphics and Visualization (EGPGV) 2018, pages 91-100, Brno, Czech Republic, June 4, 2018. ISBN 978-3-03868-054-3. ISSN 1727-348X. DOI 10.2312/pgv.20181099.
  7. Saurabh Gupta, Tirthak Patel, Christian Engelmann, and Devesh Tiwari. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications. In Proceedings of the 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017, pages 44:1-44:12, Denver, CO, USA, November 12-17, 2017. ACM Press, New York, NY, USA. ISBN 978-1-4503-5114-0. DOI 10.1145/3126908.3126937Acceptance rate 18.7% (61/327). (Paper | Presentation)
  8. Giorgis Georgakoudis, Ignacio Laguna, Dimitrios S. Nikolopoulos, Martin Schulz. REFINE: Realistic Fault Injection via Compiler-based Instrumentation for Accuracy, Portability and Speed. In Proceedings of the 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017, pages 29:1-29:14, Denver, CO, USA, November 12-17, 2017. ACM Press, New York, NY, USA. ISBN 978-1-4503-5114-0. DOI 10.1145/3126908.3126972. Acceptance rate 18.7% (61/327). (Paper)
  9. Bin Nie, Ji Xue, Saurabh Gupta, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities. In Proceedings of the 25th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) 2017, pages 22-31, Banff, AB, Canada, September 20-22, 2017. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-2764-8. ISSN 2375-0227. DOI 10.1109/MASCOTS.2017.12Acceptance rate 30.95% (26/84). (Paper)
  10. Sheng Di, Rinku Gupta, Marc Snir, Eric Pershey, and Franck Cappello. LOGAIDER: A tool for mining potential correlations of HPC log events. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) 2017, pages 442-451, Madrid, Spain, May 14-17, 2017. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5090-6610-0. DOI 10.1109/CCGRID.2017.18. (Paper)
  11. Kun Tang, Devesh Tiwari, Saurabh Gupta, Ping Huang, QiQi Lu, Christian Engelmann, and Xubin He. Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy. In Proceedings of the 46th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2016, pages 311-322, Toulouse, France, June 28 - July 1, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 2158-3927. DOI 10.1109/DSN.2016.36Acceptance rate 22.4% (58/259). (Paper)
  12. Leonardo Bautista-Gomez, Ana Gainaru, Swann Perarnau, Devesh Tiwari, Saurabh Gupta, Franck Cappello, Christian Engelmann, and Marc Snir. Reducing Waste in Extreme Scale Systems Through Introspective Analysis. In Proceedings of the 30th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016, pages 212-221, Chicago, IL, USA, May 23-27, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 1530-2075. DOI 10.1109/IPDPS.2016.100Acceptance rate 23.0% (114/496). (PaperPresentation)

...