Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In pa...

Descripción completa

Detalles Bibliográficos
Autores principales: Morán, Marina, Balladini, Javier, Rexachs, Dolores, Rucci, Enzo
Formato: Articulo article acceptedVersion
Lenguaje:Inglés
Publicado: Elsevier 2024
Materias:
Acceso en línea:http://rdi.uncoma.edu.ar/handle/uncomaid/18119
Aporte de:
id I22-R178-uncomaid-18119
record_format dspace
spelling I22-R178-uncomaid-181192024-09-05T11:58:23Z Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems Morán, Marina Balladini, Javier Rexachs, Dolores Rucci, Enzo Energy saving Fault tolerance methods Checkpoint parallel Applications ACPI DVFS Ciencias de la Computación e Información Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina. Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina. Fil: Rexachs, Dolores. Universidad Autónoma de Barcelona. Departamento Arquitectura de Computadores y Sistemas Operativos; España. Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina. 2024 2024-09-04T15:39:28Z 2024-09-04T15:39:28Z Articulo article acceptedVersion http://rdi.uncoma.edu.ar/handle/uncomaid/18119 eng https://doi.org/10.1016/j.jpdc.2023.104797 Atribución-NoComercial-CompartirIgual 2.5 Argentina https://creativecommons.org/licenses/by-nc-sa/2.5/ar/ application/pdf pp. 1-36 application/pdf Elsevier Journal of Parallel and Distributed Computing Volume 185, March 2024
institution Universidad Nacional del Comahue
institution_str I-22
repository_str R-178
collection Repositorio Institucional UNCo
language Inglés
topic Energy saving
Fault tolerance methods
Checkpoint parallel
Applications ACPI DVFS
Ciencias de la Computación e Información
spellingShingle Energy saving
Fault tolerance methods
Checkpoint parallel
Applications ACPI DVFS
Ciencias de la Computación e Información
Morán, Marina
Balladini, Javier
Rexachs, Dolores
Rucci, Enzo
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
topic_facet Energy saving
Fault tolerance methods
Checkpoint parallel
Applications ACPI DVFS
Ciencias de la Computación e Información
description Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure
format Articulo
article
acceptedVersion
author Morán, Marina
Balladini, Javier
Rexachs, Dolores
Rucci, Enzo
author_facet Morán, Marina
Balladini, Javier
Rexachs, Dolores
Rucci, Enzo
author_sort Morán, Marina
title Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title_short Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title_full Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title_fullStr Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title_full_unstemmed Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title_sort exploring energy saving opportunities in fault tolerant hpc systems
publisher Elsevier
publishDate 2024
url http://rdi.uncoma.edu.ar/handle/uncomaid/18119
work_keys_str_mv AT moranmarina exploringenergysavingopportunitiesinfaulttoleranthpcsystems
AT balladinijavier exploringenergysavingopportunitiesinfaulttoleranthpcsystems
AT rexachsdolores exploringenergysavingopportunitiesinfaulttoleranthpcsystems
AT ruccienzo exploringenergysavingopportunitiesinfaulttoleranthpcsystems
_version_ 1823260095150030848