Envelope

Envelope

The ENVELOPE project encompasses a set of Topics related to proactive failure prediction and tolerance in HPC clusters. Among its goals, the ENVELOPE project aims to analyze monitoring data in HPC-Centers, develop automated methods for failure prediction and create systems to proactively handle these errors using job migration techniques at system and user level.

The ZDV will create and perform a survey of German HPC centers to gather information about the available monitoring infrastructures and methods. The survey will be the foundation to collect and analyze monitoring data from German HPC centers. Machine learning methods will be used to predict component failures in these centers.

Project Partners

Funding Period

01/2017 - 12/2019

External Links

Project Website

Contact

Publications

2021

  • Alvaro Frank, Manuel Baumgartner, Reza Salkhordeh, and André Brinkmann. 2021. Improving checkpointing intervals by considering individual job failure probabilities. In 35th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 209–309. DOI

2018

  • Ramy Gad, Simon Pickartz, Tim Süß, Lars Nagel, Stefan Lankes, Antonello Monti, and André Brinkmann. 2018. Zeroing memory deallocator to reduce checkpoint sizes in virtualized HPC environments. The Journal of Supercomputing 74: 6236–6257. DOI

2019

  • Alvaro Frank, Dai Yang, Tim Süß, Martin Schulz, and André Brinkmann. Reducing False Node Failure Predictions in HPC. In 26th IEEE International Conference on High Performance Computing, Data and Analytics (HiPC). DOI
  • Alvaro Frank, Tim Süß, and André Brinkmann. Effects and benefits of node sharing strategies in HPC batch systems. In Proceedings of the 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS), 43–53. DOI