The ENVELOPE project encompasses a set of Topics related to proactive failure prediction and tolerance in HPC clusters. Among its goals, the ENVELOPE project aims to analyze monitoring data in HPC-Centers, develop automated methods for failure prediction and create systems to proactively handle these errors using job migration techniques at system and user level.
The ZDV will create and perform a survey of German HPC centers to gather information about the available monitoring infrastructures and methods. The survey will be the foundation to collect and analyze monitoring data from German HPC centers. Machine learning methods will be used to predict component failures in these centers.
- Chair for Computer Architecture and Parallel Processing, Karlsruhe Institute of Technology
- Rechnertechnik und Rechnerorganisation, TU München
- Institute for Automation of Complex Power Systems, RWTH Aachen
- Johannes Gutenberg University Mainz, Zentrum für Datenverarbeitung
01/2017 - 12/2019
- Alvaro Frank, Manuel Baumgartner, Reza Salkhordeh, and André Brinkmann. 2021. Improving checkpointing intervals by considering individual job failure probabilities. In 35th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 209–309. DOI
- Ramy Gad, Simon Pickartz, Tim Süß, Lars Nagel, Stefan Lankes, Antonello Monti, and André Brinkmann. 2018. Zeroing memory deallocator to reduce checkpoint sizes in virtualized HPC environments. The Journal of Supercomputing 74: 6236–6257. DOI
- Alvaro Frank, Dai Yang, Tim Süß, Martin Schulz, and André Brinkmann. Reducing False Node Failure Predictions in HPC. In 26th IEEE International Conference on High Performance Computing, Data and Analytics (HiPC). DOI
- Alvaro Frank, Tim Süß, and André Brinkmann. Effects and benefits of node sharing strategies in HPC batch systems. In Proceedings of the 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS), 43–53. DOI