Some Issues to Consider in the Management of Energy Consumption in HPC Systems with Fault Tolerance

Inquiring about different ways to reduce energy consumption during the execution of large-scale applications is essential to maintain and increase the enormous computing power achieved in HPC systems. Fault tolerance methods can have an impact on power consumption. In particular, rollback-recovery...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Morán, Marina, Balladini, Javier, Rexachs del Rosario, Dolores, Rucci, Enzo
Formato: Objeto de conferencia
Lenguaje:Inglés
Publicado: 2022
Materias:
HPC
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/140642
Aporte de:
Descripción
Sumario:Inquiring about different ways to reduce energy consumption during the execution of large-scale applications is essential to maintain and increase the enormous computing power achieved in HPC systems. Fault tolerance methods can have an impact on power consumption. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing in the event of a failure. In this context, it is possible to take actions on the nodes of the processes that do not re-execute to reduce energy consumption. In this work, we describe some issues to consider when we extend the application of energy-saving strategies beyond the nodes that communicate directly with the failed one.