Towards Management of Energy Consumption in HPC Systems with Fault Tolerance

High-performance computing continues to increase its computing power and energy efficiency. However, energy consumption continues to rise and finding ways to limit and/or decrease it is a crucial point in current research. For high-performance MPI applications, there are rollback recovery based faul...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Morán, Marina, Balladini, Javier, Rexachs del Rosario, Dolores, Rucci, Enzo
Formato: Objeto de conferencia
Lenguaje:Inglés
Publicado: 2020
Materias:
HPC
MPI
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/139146
Aporte de:
id I19-R120-10915-139146
record_format dspace
institution Universidad Nacional de La Plata
institution_str I-19
repository_str R-120
collection SEDICI (UNLP)
language Inglés
topic Informática
Energy consumption
energy saving
Power management
Fault tolerance
uncoordinated checkpoint
HPC
Distributed memory
MPI
DVFS
ACPI
spellingShingle Informática
Energy consumption
energy saving
Power management
Fault tolerance
uncoordinated checkpoint
HPC
Distributed memory
MPI
DVFS
ACPI
Morán, Marina
Balladini, Javier
Rexachs del Rosario, Dolores
Rucci, Enzo
Towards Management of Energy Consumption in HPC Systems with Fault Tolerance
topic_facet Informática
Energy consumption
energy saving
Power management
Fault tolerance
uncoordinated checkpoint
HPC
Distributed memory
MPI
DVFS
ACPI
description High-performance computing continues to increase its computing power and energy efficiency. However, energy consumption continues to rise and finding ways to limit and/or decrease it is a crucial point in current research. For high-performance MPI applications, there are rollback recovery based fault tolerance methods, such as uncoordinated checkpoints. These methods allow only some processes to go back in the face of failure, while the rest of the processes continue to run. In this article, we focus on the processes that continue execution, and propose a series of strategies to manage energy consumption when a failure occurs and uncoordinated checkpoints are used. We present an energy model to evaluate strategies and through simulation we analyze the behavior of an application under different configurations and failure time. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.
format Objeto de conferencia
Objeto de conferencia
author Morán, Marina
Balladini, Javier
Rexachs del Rosario, Dolores
Rucci, Enzo
author_facet Morán, Marina
Balladini, Javier
Rexachs del Rosario, Dolores
Rucci, Enzo
author_sort Morán, Marina
title Towards Management of Energy Consumption in HPC Systems with Fault Tolerance
title_short Towards Management of Energy Consumption in HPC Systems with Fault Tolerance
title_full Towards Management of Energy Consumption in HPC Systems with Fault Tolerance
title_fullStr Towards Management of Energy Consumption in HPC Systems with Fault Tolerance
title_full_unstemmed Towards Management of Energy Consumption in HPC Systems with Fault Tolerance
title_sort towards management of energy consumption in hpc systems with fault tolerance
publishDate 2020
url http://sedici.unlp.edu.ar/handle/10915/139146
work_keys_str_mv AT moranmarina towardsmanagementofenergyconsumptioninhpcsystemswithfaulttolerance
AT balladinijavier towardsmanagementofenergyconsumptioninhpcsystemswithfaulttolerance
AT rexachsdelrosariodolores towardsmanagementofenergyconsumptioninhpcsystemswithfaulttolerance
AT ruccienzo towardsmanagementofenergyconsumptioninhpcsystemswithfaulttolerance
bdutipo_str Repositorios
_version_ 1764820457136062465