A methodology for soft errors detection and automatic recovery
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes...
Guardado en:
Autores principales: | , , , , , |
---|---|
Formato: | Objeto de conferencia |
Lenguaje: | Inglés |
Publicado: |
2017
|
Materias: | |
Acceso en línea: | http://sedici.unlp.edu.ar/handle/10915/129169 |
Aporte de: |
id |
I19-R120-10915-129169 |
---|---|
record_format |
dspace |
institution |
Universidad Nacional de La Plata |
institution_str |
I-19 |
repository_str |
R-120 |
collection |
SEDICI (UNLP) |
language |
Inglés |
topic |
Ciencias Informáticas Soft error detection Automatic recovery Systemlevel checkpoint User-level checkpoint |
spellingShingle |
Ciencias Informáticas Soft error detection Automatic recovery Systemlevel checkpoint User-level checkpoint Montezanti, Diego Miguel De Giusti, Armando Eduardo Naiouf, Marcelo Villamayor, Jorge Rexachs del Rosario, Dolores Luque Fadón, Emilio A methodology for soft errors detection and automatic recovery |
topic_facet |
Ciencias Informáticas Soft error detection Automatic recovery Systemlevel checkpoint User-level checkpoint |
description |
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems. |
format |
Objeto de conferencia Objeto de conferencia |
author |
Montezanti, Diego Miguel De Giusti, Armando Eduardo Naiouf, Marcelo Villamayor, Jorge Rexachs del Rosario, Dolores Luque Fadón, Emilio |
author_facet |
Montezanti, Diego Miguel De Giusti, Armando Eduardo Naiouf, Marcelo Villamayor, Jorge Rexachs del Rosario, Dolores Luque Fadón, Emilio |
author_sort |
Montezanti, Diego Miguel |
title |
A methodology for soft errors detection and automatic recovery |
title_short |
A methodology for soft errors detection and automatic recovery |
title_full |
A methodology for soft errors detection and automatic recovery |
title_fullStr |
A methodology for soft errors detection and automatic recovery |
title_full_unstemmed |
A methodology for soft errors detection and automatic recovery |
title_sort |
methodology for soft errors detection and automatic recovery |
publishDate |
2017 |
url |
http://sedici.unlp.edu.ar/handle/10915/129169 |
work_keys_str_mv |
AT montezantidiegomiguel amethodologyforsofterrorsdetectionandautomaticrecovery AT degiustiarmandoeduardo amethodologyforsofterrorsdetectionandautomaticrecovery AT naioufmarcelo amethodologyforsofterrorsdetectionandautomaticrecovery AT villamayorjorge amethodologyforsofterrorsdetectionandautomaticrecovery AT rexachsdelrosariodolores amethodologyforsofterrorsdetectionandautomaticrecovery AT luquefadonemilio amethodologyforsofterrorsdetectionandautomaticrecovery AT montezantidiegomiguel methodologyforsofterrorsdetectionandautomaticrecovery AT degiustiarmandoeduardo methodologyforsofterrorsdetectionandautomaticrecovery AT naioufmarcelo methodologyforsofterrorsdetectionandautomaticrecovery AT villamayorjorge methodologyforsofterrorsdetectionandautomaticrecovery AT rexachsdelrosariodolores methodologyforsofterrorsdetectionandautomaticrecovery AT luquefadonemilio methodologyforsofterrorsdetectionandautomaticrecovery |
bdutipo_str |
Repositorios |
_version_ |
1764820452073537537 |