SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters

The challenge of improving the performance of current processors is achieved by increasing the integration scale. This carries a growing vulnerability to transient faults, which increase their impact on multicore clusters running large scientific parallel applications. The  requirement for enhancing...

Descripción completa

Detalles Bibliográficos
Autores principales: Montezanti, Diego Miguel, Frati, Fernando Emmanuel, Rexachs del Rosario, Dolores, Luquet, Emilio, Naiouf, Marcelo, De Giusti, Armando Eduardo
Formato: Articulo
Lenguaje:Inglés
Publicado: 2012
Materias:
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/96550
https://ri.conicet.gov.ar/11336/66780
http://www2.clei.org/cleiej/paper.php?id=250
Aporte de:
id I19-R120-10915-96550
record_format dspace
institution Universidad Nacional de La Plata
institution_str I-19
repository_str R-120
collection SEDICI (UNLP)
language Inglés
topic Ingeniería en Computación
Ciencias Informáticas
Parallel scientific application
Multicore cluster
Transient fault
Soft error detection
spellingShingle Ingeniería en Computación
Ciencias Informáticas
Parallel scientific application
Multicore cluster
Transient fault
Soft error detection
Montezanti, Diego Miguel
Frati, Fernando Emmanuel
Rexachs del Rosario, Dolores
Luquet, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
topic_facet Ingeniería en Computación
Ciencias Informáticas
Parallel scientific application
Multicore cluster
Transient fault
Soft error detection
description The challenge of improving the performance of current processors is achieved by increasing the integration scale. This carries a growing vulnerability to transient faults, which increase their impact on multicore clusters running large scientific parallel applications. The  requirement for enhancing the reliability of these systems, coupled with the high cost of rerunning the application from the beginning, create the motivation for having specific software strategies for the target systems. This paper introduces SMCV, which is a fully distributed technique that provides fault detection for message-passing parallel applications, by validating the contents of the messages to be sent, preventing the transmission of errors to other processes and leveraging the intrinsic hardware redundancy of the multicore. SMCV achieves a wide robustness against transient faults with a reduced overhead, and accomplishes a trade-off between moderate detection latency and low additional workload.
format Articulo
Articulo
author Montezanti, Diego Miguel
Frati, Fernando Emmanuel
Rexachs del Rosario, Dolores
Luquet, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
author_facet Montezanti, Diego Miguel
Frati, Fernando Emmanuel
Rexachs del Rosario, Dolores
Luquet, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
author_sort Montezanti, Diego Miguel
title SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
title_short SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
title_full SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
title_fullStr SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
title_full_unstemmed SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
title_sort smcv: a methodology for detecting transient faults in multicore clusters
publishDate 2012
url http://sedici.unlp.edu.ar/handle/10915/96550
https://ri.conicet.gov.ar/11336/66780
http://www2.clei.org/cleiej/paper.php?id=250
work_keys_str_mv AT montezantidiegomiguel smcvamethodologyfordetectingtransientfaultsinmulticoreclusters
AT fratifernandoemmanuel smcvamethodologyfordetectingtransientfaultsinmulticoreclusters
AT rexachsdelrosariodolores smcvamethodologyfordetectingtransientfaultsinmulticoreclusters
AT luquetemilio smcvamethodologyfordetectingtransientfaultsinmulticoreclusters
AT naioufmarcelo smcvamethodologyfordetectingtransientfaultsinmulticoreclusters
AT degiustiarmandoeduardo smcvamethodologyfordetectingtransientfaultsinmulticoreclusters
bdutipo_str Repositorios
_version_ 1764820492241338373