HPC applications are running on HPC systems including thousands of computing nodes. The growing complexity and number of nodes in an HPC system increase the probability of at least one compute node failing during an application's execution time. Predicting when a failure might occur can allow for preemptive mitigation techniques like checkpoints or migrations to salvage computations. For such failure prediction systems, one must use past data for predicting failures as well as data to define when a failure occurs.
We used this SLURM and sensor data to investigate the reduction of false positive predictions in failure prediction methods for HPC nodes. The traces provided here offer failure events defined using the help of hardware sensors, Linux counters and batch system events of failed user Jobs. Expert knowledge was used to determine thresholds at which data is deemed intolerable for a production system a tend to require hardware maintenance.
Data Set Description
The failure traces provided here were collected from the end-of-life Mogon I HPC system at the Johannes Gutenberg University Mainz over a period of 6 months between August 2018 and January 2019. The Mogon system introduced in 2012 consisted of 555 sampled nodes. Each node was equipped with four AMD CPUs and 16 cores, resulting in a total of 35,520 cores. The system had 444 nodes with 128 GiB RAM, 96 nodes with 256 GiB RAM and 15 nodes with 512 GiB RAM. Each node also included 1.5 TB of local hard drive space.
The data was collected from the system's SLURM batch system and hardware sensor databases. The data contains two lists of events collected from SLURM (failures and non-failures) with node name and timestamps from the time at which the specified event occurred. For each event listed the archive includes a separate file containing sensor data from the related node. Each sensor data file contains sensor data collected in 15 second intervals up to the timestamp of the failure.
Please cite the following paper in case you are using our traces for your research:
- Frank, Alvaro ; Yang, Dai ; Süß, Tim ; Schulz, Martin ; Brinkmann, André:
Reducing False Node Failure Predictions in HPC
26th IEEE International Conference on High Performance Computing, Data and Analytics (HiPC). 2019. P. 323 - 332 (Konferenzbeitrag)
The metadata of the collection can be downloaded using the iRODS link https://irods-web.zdv.uni-mainz.de/irods-rest/rest/dataObject/zdv/project/zdvresearch/mogon1_failures/failures_mogon1.zip//metadata?ticket=Bc79MdMPptvjuiz
The data set can be downloaded using the iRODS link https://irods-web.zdv.uni-mainz.de/irods-rest/rest/fileContents/zdv/project/zdvresearch/mogon1_failures/failures_mogon1.zip?ticket=Bc79MdMPptvjuiz into the iRODS archive at the JGU Mainz.