Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

Mostrar todas las versiones(2)

Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In pa...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	Morán, Marina, Balladini, Javier, Rexachs, Dolores, Rucci, Enzo
Formato:	Articulo article acceptedVersion
Lenguaje:	Inglés
Publicado:	Elsevier 2024
Materias:	Energy saving Fault tolerance methods Checkpoint parallel Applications ACPI DVFS Ciencias de la Computación e Información
Acceso en línea:	http://rdi.uncoma.edu.ar/handle/uncomaid/18119
Aporte de:	Repositorio Institucional UNCo de Universidad Nacional del Comahue

Descripción
Sumario:	Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure

Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

Ejemplares similares