SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems

In the context of high error rates, unreliable results and high verification costs, the aim of this thesis is to help scientists and programmers of parallel applications to provide reliability to their results, within a predictable time. To accomplish this goal, we have designed and developed the S...

Descripción completa

Detalles Bibliográficos
Autor principal: Montezanti, Diego Miguel
Formato: Articulo Revision
Lenguaje:Inglés
Publicado: 2020
Materias:
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/108015
Aporte de:
id I19-R120-10915-108015
record_format dspace
institution Universidad Nacional de La Plata
institution_str I-19
repository_str R-120
collection SEDICI (UNLP)
language Inglés
topic Ciencias Informáticas
Soft Error Detection and Automatic Recovery
spellingShingle Ciencias Informáticas
Soft Error Detection and Automatic Recovery
Montezanti, Diego Miguel
SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems
topic_facet Ciencias Informáticas
Soft Error Detection and Automatic Recovery
description In the context of high error rates, unreliable results and high verification costs, the aim of this thesis is to help scientists and programmers of parallel applications to provide reliability to their results, within a predictable time. To accomplish this goal, we have designed and developed the SEDAR (Soft Error Detection and Automatic Recovery) methodology, which provides tolerance to transient faults in systems consisting in message passing applications that run in multicore clusters. SEDAR is based on process replication and monitoring of messages to be sent and of local computation, taking advantage of the intrinsic hardware redundancy of the multicores. SEDAR provides three variants: detection and automatic relaunch from the beginning; automatic recovery, based on the storage of multiple system-level checkpoints (periodic or synchronized with events); and automatic recovery, based on a single safe application-level checkpoint. The main goal is the design of the methodology and the functional validation of its effectiveness to detect transient faults and automatically recover executions, using an analytical verification model; a SEDAR prototype is also implemented. From the tests carried out with this prototype, the temporal behavior is characterized, i.e. the overhead introduced by each variant. The flexibility to dynamically choose the most convenient alternative to adapt to system requirements (such as maximum allowed overhead or completion time) is also evidenced, showing that SEDAR is a viable and effective methodology to tolerate transient faults in HPC. Unlike specific strategies, which provide partial resilience for certain applications, at the cost of modifying them, SEDAR is essentially transparent and agnostic regarding the protected algorithm.
format Articulo
Revision
author Montezanti, Diego Miguel
author_facet Montezanti, Diego Miguel
author_sort Montezanti, Diego Miguel
title SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems
title_short SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems
title_full SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems
title_fullStr SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems
title_full_unstemmed SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems
title_sort sedar: soft error detection and automatic recovery in high performance computing systems
publishDate 2020
url http://sedici.unlp.edu.ar/handle/10915/108015
work_keys_str_mv AT montezantidiegomiguel sedarsofterrordetectionandautomaticrecoveryinhighperformancecomputingsystems
bdutipo_str Repositorios
_version_ 1764820443920859137