Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing

Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR, which is a methodology that improves system reliability against tr...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Montezanti, Diego Miguel, Rucci, Enzo, De Giusti, Armando Eduardo, Naiouf, Marcelo, Rexachs del Rosario, Dolores, Luque Fadón, Emilio
Formato: Articulo Preprint
Lenguaje:Inglés
Publicado: 2020
Materias:
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/124463
Aporte de:
id I19-R120-10915-124463
record_format dspace
institution Universidad Nacional de La Plata
institution_str I-19
repository_str R-120
collection SEDICI (UNLP)
language Inglés
topic Ciencias Informáticas
Soft error detection
Automatic recovery
System-level checkpoint
User-level checkpoint
spellingShingle Ciencias Informáticas
Soft error detection
Automatic recovery
System-level checkpoint
User-level checkpoint
Montezanti, Diego Miguel
Rucci, Enzo
De Giusti, Armando Eduardo
Naiouf, Marcelo
Rexachs del Rosario, Dolores
Luque Fadón, Emilio
Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing
topic_facet Ciencias Informáticas
Soft error detection
Automatic recovery
System-level checkpoint
User-level checkpoint
description Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR, which is a methodology that improves system reliability against transient faults when running parallel message-passing applications. Our approach, based on process replication for detection, combined with different levels of checkpointing for automatic recovery, has the goal of helping users of scientific applications to obtain executions with correct results. SEDAR is structured in three levels: (1) only detection and safe-stop with notification; (2) recovery based on multiple system-level checkpoints; and (3) recovery based on a single valid user-level checkpoint. As each of these variants supplies a particular coverage but involves limitations and implementation costs, SEDAR can be adapted to the needs of the system. In this work, a description of the methodology is presented and the temporal behavior of employing each SEDAR strategy is mathematically described, both in the absence and presence of faults. A model that considers all the fault scenarios on a test application is introduced to show the validity of the detection and recovery mechanisms. An overhead evaluation of each variant is performed with applications involving different communication patterns; this is also used to extract guidelines about when it is beneficial to employ each SEDAR protection level. As a result, we show its efficacy and viability to tolerate transient faults in target HPC environments.
format Articulo
Preprint
author Montezanti, Diego Miguel
Rucci, Enzo
De Giusti, Armando Eduardo
Naiouf, Marcelo
Rexachs del Rosario, Dolores
Luque Fadón, Emilio
author_facet Montezanti, Diego Miguel
Rucci, Enzo
De Giusti, Armando Eduardo
Naiouf, Marcelo
Rexachs del Rosario, Dolores
Luque Fadón, Emilio
author_sort Montezanti, Diego Miguel
title Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing
title_short Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing
title_full Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing
title_fullStr Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing
title_full_unstemmed Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing
title_sort soft errors detection and automatic recovery based on replication combined with different levels of checkpointing
publishDate 2020
url http://sedici.unlp.edu.ar/handle/10915/124463
work_keys_str_mv AT montezantidiegomiguel softerrorsdetectionandautomaticrecoverybasedonreplicationcombinedwithdifferentlevelsofcheckpointing
AT ruccienzo softerrorsdetectionandautomaticrecoverybasedonreplicationcombinedwithdifferentlevelsofcheckpointing
AT degiustiarmandoeduardo softerrorsdetectionandautomaticrecoverybasedonreplicationcombinedwithdifferentlevelsofcheckpointing
AT naioufmarcelo softerrorsdetectionandautomaticrecoverybasedonreplicationcombinedwithdifferentlevelsofcheckpointing
AT rexachsdelrosariodolores softerrorsdetectionandautomaticrecoverybasedonreplicationcombinedwithdifferentlevelsofcheckpointing
AT luquefadonemilio softerrorsdetectionandautomaticrecoverybasedonreplicationcombinedwithdifferentlevelsofcheckpointing
bdutipo_str Repositorios
_version_ 1764820450428321793