Characterizing a Detection Strategy for Transient Faults in HPC

Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and that they will propagate to generate errors that will ran...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Montezanti, Diego Miguel, Rexachs del Rosario, Dolores, Rucci, Enzo, Luque Fadón, Emilio, Naiouf, Marcelo, De Giusti, Armando Eduardo, Feierherd, Guillermo Eugenio, Pesado, Patricia Mabel, Russo, Claudia Cecilia
Formato: Libro Capitulo de libro
Lenguaje:Inglés
Publicado: Editorial de la Universidad Nacional de La Plata (EDULP) 2016
Materias:
HPC
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/81217
Aporte de:
Descripción
Sumario:Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and that they will propagate to generate errors that will range from process crashes to corrupted results, with undetected errors in applications that are still running. In this article, we analyze a methodology for transient fault detection (called SMCV) for MPI applications. The methodology is based on software replication, and it assumes that data corruption is made apparent producing different messages between replicas. SMCV allows obtaining reliable executions with correct results, or, at least, leading the system to a safe stop. This work presents a complete characterization, formally defining the behavior in the presence of faults and experimentally validating it in order to show its efficacy and viability to detect transient faults in HPC systems.