The ENVELOPE project encompasses a set of Topics related to proactive failure prediction and tolerance in HPC clusters. Among its goals, the ENVELOPE project aims to analyze monitoring data in HPC-Centers, develop automated methods for failure prediction and create systems to proactively handle these errors using job migration techniques at system and user level.
The ZDV will create and perform a survey of German HPC centers to gather information about the available monitoring infrastructures and methods. The survey will be the foundation to collect and analyze monitoring data from German HPC centers. Machine learning methods will be used to predict component failures in these centers.
- Chair for Computer Architecture and Parallel Processing, Karlsruhe Institute of Technology
- Rechnertechnik und Rechnerorganisation, TU München
- Institute for Automation of Complex Power Systems, RWTH Aachen
- Johannes Gutenberg University Mainz, Zentrum für Datenverarbeitung
01/2017 - 12/2019
- Frank, Alvaro ; Süß, Tim ; Brinkmann, André:
Effects and benefits of node sharing strategies in HPC batch systems
Proceedings of the 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2019. (Konferenzbeitrag)
- Gad, Ramy ; Pickartz, Simon ; Süß, Tim ; Nagel, Lars ; Lankes, Stefan ; Monti, Antonello ; Brinkmann, André:
Zeroing memory deallocator to reduce checkpoint sizes in virtualized HPC environments
The Journal of Supercomputing. Vol. 74, Issue . 2018. P. 6236 - 6257