Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR, which is a methodology that improves system reliability against tr...
Guardado en:
Autores principales: | , , , , , |
---|---|
Formato: | Articulo Preprint |
Lenguaje: | Inglés |
Publicado: |
2020
|
Materias: | |
Acceso en línea: | http://sedici.unlp.edu.ar/handle/10915/124463 |
Aporte de: |
id |
I19-R120-10915-124463 |
---|---|
record_format |
dspace |
institution |
Universidad Nacional de La Plata |
institution_str |
I-19 |
repository_str |
R-120 |
collection |
SEDICI (UNLP) |
language |
Inglés |
topic |
Ciencias Informáticas Soft error detection Automatic recovery System-level checkpoint User-level checkpoint |
spellingShingle |
Ciencias Informáticas Soft error detection Automatic recovery System-level checkpoint User-level checkpoint Montezanti, Diego Miguel Rucci, Enzo De Giusti, Armando Eduardo Naiouf, Marcelo Rexachs del Rosario, Dolores Luque Fadón, Emilio Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing |
topic_facet |
Ciencias Informáticas Soft error detection Automatic recovery System-level checkpoint User-level checkpoint |
description |
Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR, which is a methodology that improves system reliability against transient faults when running parallel message-passing applications. Our approach, based on process replication for detection, combined with different levels of checkpointing for automatic recovery, has the goal of helping users of scientific applications to obtain executions with correct results. SEDAR is structured in three levels: (1) only detection and safe-stop with notification; (2) recovery based on multiple system-level checkpoints; and (3) recovery based on a single valid user-level checkpoint. As each of these variants supplies a particular coverage but involves limitations and implementation costs, SEDAR can be adapted to the needs of the system. In this work, a description of the methodology is presented and the temporal behavior of employing each SEDAR strategy is mathematically described, both in the absence and presence of faults. A model that considers all the fault scenarios on a test application is introduced to show the validity of the detection and recovery mechanisms. An overhead evaluation of each variant is performed with applications involving different communication patterns; this is also used to extract guidelines about when it is beneficial to employ each SEDAR protection level. As a result, we show its efficacy and viability to tolerate transient faults in target HPC environments. |
format |
Articulo Preprint |
author |
Montezanti, Diego Miguel Rucci, Enzo De Giusti, Armando Eduardo Naiouf, Marcelo Rexachs del Rosario, Dolores Luque Fadón, Emilio |
author_facet |
Montezanti, Diego Miguel Rucci, Enzo De Giusti, Armando Eduardo Naiouf, Marcelo Rexachs del Rosario, Dolores Luque Fadón, Emilio |
author_sort |
Montezanti, Diego Miguel |
title |
Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing |
title_short |
Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing |
title_full |
Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing |
title_fullStr |
Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing |
title_full_unstemmed |
Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing |
title_sort |
soft errors detection and automatic recovery based on replication combined with different levels of checkpointing |
publishDate |
2020 |
url |
http://sedici.unlp.edu.ar/handle/10915/124463 |
work_keys_str_mv |
AT montezantidiegomiguel softerrorsdetectionandautomaticrecoverybasedonreplicationcombinedwithdifferentlevelsofcheckpointing AT ruccienzo softerrorsdetectionandautomaticrecoverybasedonreplicationcombinedwithdifferentlevelsofcheckpointing AT degiustiarmandoeduardo softerrorsdetectionandautomaticrecoverybasedonreplicationcombinedwithdifferentlevelsofcheckpointing AT naioufmarcelo softerrorsdetectionandautomaticrecoverybasedonreplicationcombinedwithdifferentlevelsofcheckpointing AT rexachsdelrosariodolores softerrorsdetectionandautomaticrecoverybasedonreplicationcombinedwithdifferentlevelsofcheckpointing AT luquefadonemilio softerrorsdetectionandautomaticrecoverybasedonreplicationcombinedwithdifferentlevelsofcheckpointing |
bdutipo_str |
Repositorios |
_version_ |
1764820450428321793 |