A methodology for soft errors detection and automatic recovery

Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Montezanti, Diego Miguel, De Giusti, Armando Eduardo, Naiouf, Marcelo, Villamayor, Jorge, Rexachs del Rosario, Dolores, Luque Fadón, Emilio
Formato: Objeto de conferencia
Lenguaje:Inglés
Publicado: 2017
Materias:
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/129169
Aporte de:
id I19-R120-10915-129169
record_format dspace
institution Universidad Nacional de La Plata
institution_str I-19
repository_str R-120
collection SEDICI (UNLP)
language Inglés
topic Ciencias Informáticas
Soft error detection
Automatic recovery
Systemlevel checkpoint
User-level checkpoint
spellingShingle Ciencias Informáticas
Soft error detection
Automatic recovery
Systemlevel checkpoint
User-level checkpoint
Montezanti, Diego Miguel
De Giusti, Armando Eduardo
Naiouf, Marcelo
Villamayor, Jorge
Rexachs del Rosario, Dolores
Luque Fadón, Emilio
A methodology for soft errors detection and automatic recovery
topic_facet Ciencias Informáticas
Soft error detection
Automatic recovery
Systemlevel checkpoint
User-level checkpoint
description Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems.
format Objeto de conferencia
Objeto de conferencia
author Montezanti, Diego Miguel
De Giusti, Armando Eduardo
Naiouf, Marcelo
Villamayor, Jorge
Rexachs del Rosario, Dolores
Luque Fadón, Emilio
author_facet Montezanti, Diego Miguel
De Giusti, Armando Eduardo
Naiouf, Marcelo
Villamayor, Jorge
Rexachs del Rosario, Dolores
Luque Fadón, Emilio
author_sort Montezanti, Diego Miguel
title A methodology for soft errors detection and automatic recovery
title_short A methodology for soft errors detection and automatic recovery
title_full A methodology for soft errors detection and automatic recovery
title_fullStr A methodology for soft errors detection and automatic recovery
title_full_unstemmed A methodology for soft errors detection and automatic recovery
title_sort methodology for soft errors detection and automatic recovery
publishDate 2017
url http://sedici.unlp.edu.ar/handle/10915/129169
work_keys_str_mv AT montezantidiegomiguel amethodologyforsofterrorsdetectionandautomaticrecovery
AT degiustiarmandoeduardo amethodologyforsofterrorsdetectionandautomaticrecovery
AT naioufmarcelo amethodologyforsofterrorsdetectionandautomaticrecovery
AT villamayorjorge amethodologyforsofterrorsdetectionandautomaticrecovery
AT rexachsdelrosariodolores amethodologyforsofterrorsdetectionandautomaticrecovery
AT luquefadonemilio amethodologyforsofterrorsdetectionandautomaticrecovery
AT montezantidiegomiguel methodologyforsofterrorsdetectionandautomaticrecovery
AT degiustiarmandoeduardo methodologyforsofterrorsdetectionandautomaticrecovery
AT naioufmarcelo methodologyforsofterrorsdetectionandautomaticrecovery
AT villamayorjorge methodologyforsofterrorsdetectionandautomaticrecovery
AT rexachsdelrosariodolores methodologyforsofterrorsdetectionandautomaticrecovery
AT luquefadonemilio methodologyforsofterrorsdetectionandautomaticrecovery
bdutipo_str Repositorios
_version_ 1764820452073537537