A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters

Transient faults are becoming a critical concern among current trends of design of generalpurpose multiprocessors. Because of their capability to corrupt programs outputs, their impact gains importance when considering long duration, parallel scientific applications, due to the high cost of re-launc...

Descripción completa

Detalles Bibliográficos
Autores principales: Montezanti, Diego Miguel, Rucci, Enzo, Rexachs del Rosario, Dolores, Luque Fadón, Emilio, Naiouf, Marcelo, De Giusti, Armando Eduardo
Formato: Articulo
Lenguaje:Inglés
Publicado: 2014
Materias:
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/34544
http://journal.info.unlp.edu.ar/wp-content/uploads/JCST-Apr14-5.pdf
Aporte de:
id I19-R120-10915-34544
record_format dspace
institution Universidad Nacional de La Plata
institution_str I-19
repository_str R-120
collection SEDICI (UNLP)
language Inglés
topic Ciencias Informáticas
transient fault
parallel scientific application
soft error detection tool
message content validation
spellingShingle Ciencias Informáticas
transient fault
parallel scientific application
soft error detection tool
message content validation
Montezanti, Diego Miguel
Rucci, Enzo
Rexachs del Rosario, Dolores
Luque Fadón, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
topic_facet Ciencias Informáticas
transient fault
parallel scientific application
soft error detection tool
message content validation
description Transient faults are becoming a critical concern among current trends of design of generalpurpose multiprocessors. Because of their capability to corrupt programs outputs, their impact gains importance when considering long duration, parallel scientific applications, due to the high cost of re-launching execution from the beginning in case of incorrect results. This paper introduces SMCV tool which improves reliability for high-performance systems. SMCV replicates application processes and validates the contents of the messages to be sent, preventing the propagation of errors to other processes and restricting detection latency and notification. To assess its utility, the overhead of SMCV tool is evaluated with three computationallyintensive, representative parallel scientific applications. The obtained results demonstrate the efficiency of SMCV tool to detect transient faults occurrences.
format Articulo
Articulo
author Montezanti, Diego Miguel
Rucci, Enzo
Rexachs del Rosario, Dolores
Luque Fadón, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
author_facet Montezanti, Diego Miguel
Rucci, Enzo
Rexachs del Rosario, Dolores
Luque Fadón, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
author_sort Montezanti, Diego Miguel
title A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title_short A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title_full A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title_fullStr A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title_full_unstemmed A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title_sort tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
publishDate 2014
url http://sedici.unlp.edu.ar/handle/10915/34544
http://journal.info.unlp.edu.ar/wp-content/uploads/JCST-Apr14-5.pdf
work_keys_str_mv AT montezantidiegomiguel atoolfordetectingtransientfaultsinexecutionofparallelscientificapplicationsonmulticoreclusters
AT ruccienzo atoolfordetectingtransientfaultsinexecutionofparallelscientificapplicationsonmulticoreclusters
AT rexachsdelrosariodolores atoolfordetectingtransientfaultsinexecutionofparallelscientificapplicationsonmulticoreclusters
AT luquefadonemilio atoolfordetectingtransientfaultsinexecutionofparallelscientificapplicationsonmulticoreclusters
AT naioufmarcelo atoolfordetectingtransientfaultsinexecutionofparallelscientificapplicationsonmulticoreclusters
AT degiustiarmandoeduardo atoolfordetectingtransientfaultsinexecutionofparallelscientificapplicationsonmulticoreclusters
AT montezantidiegomiguel toolfordetectingtransientfaultsinexecutionofparallelscientificapplicationsonmulticoreclusters
AT ruccienzo toolfordetectingtransientfaultsinexecutionofparallelscientificapplicationsonmulticoreclusters
AT rexachsdelrosariodolores toolfordetectingtransientfaultsinexecutionofparallelscientificapplicationsonmulticoreclusters
AT luquefadonemilio toolfordetectingtransientfaultsinexecutionofparallelscientificapplicationsonmulticoreclusters
AT naioufmarcelo toolfordetectingtransientfaultsinexecutionofparallelscientificapplicationsonmulticoreclusters
AT degiustiarmandoeduardo toolfordetectingtransientfaultsinexecutionofparallelscientificapplicationsonmulticoreclusters
bdutipo_str Repositorios
_version_ 1764820470031450113