Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In pa...
Guardado en:
| Autores principales: | , , , |
|---|---|
| Formato: | Articulo article acceptedVersion |
| Lenguaje: | Inglés |
| Publicado: |
arXiv
2023
|
| Materias: | |
| Acceso en línea: | https://rdi.uncoma.edu.ar/handle/uncomaid/19175 |
| Aporte de: |
| id |
I22-R178-uncomaid-19175 |
|---|---|
| record_format |
dspace |
| spelling |
I22-R178-uncomaid-191752025-12-23T14:10:45Z Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems Morán, Marina Balladini, Javier Rexachs, Dolores Rucci, Enzo Energy saving Fault Tolerance Methods Checkpoint Parallel Applications ACPI DVFS Ciencias de la Computación e Información Artículos Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure. Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática; Argentina. Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática; Argentina. Fil: Rexachs, Dolores. Universitat Autónoma de Barcelona. Departamento de Arquitectura de Computadores y Sistemas Operativos; España. Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina. 2023 2025-12-17T15:59:17Z 2025-12-17T15:59:17Z Articulo article acceptedVersion https://rdi.uncoma.edu.ar/handle/uncomaid/19175 eng https://arxiv.org/abs/2311.06419 https://doi.org/10.1016/j.jpdc.2023.104797 Atribución-NoComercial-CompartirIgual 4.0 https://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf application/pdf arXiv Journal of Parallel and Distributed Computing, october 2023 |
| institution |
Universidad Nacional del Comahue |
| institution_str |
I-22 |
| repository_str |
R-178 |
| collection |
Repositorio Institucional UNCo |
| language |
Inglés |
| topic |
Energy saving Fault Tolerance Methods Checkpoint Parallel Applications ACPI DVFS Ciencias de la Computación e Información Artículos |
| spellingShingle |
Energy saving Fault Tolerance Methods Checkpoint Parallel Applications ACPI DVFS Ciencias de la Computación e Información Artículos Morán, Marina Balladini, Javier Rexachs, Dolores Rucci, Enzo Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
| topic_facet |
Energy saving Fault Tolerance Methods Checkpoint Parallel Applications ACPI DVFS Ciencias de la Computación e Información Artículos |
| description |
Nowadays, improving the energy efficiency of high-performance com-
puting (HPC) systems is one of the main drivers in scientific and techno-
logical research. As large-scale HPC systems require some fault-tolerant
method, the opportunities to reduce energy consumption should be ex-
plored. In particular, rollback-recovery methods using uncoordinated
checkpoints prevent all processes from re-executing when a failure oc-
curs. In this context, it is possible to take actions to reduce the energy
consumption of the nodes whose processes do not re-execute. This work is
an extension of a previous one, in which we proposed a series of strategies
to manage energy consumption at failure-time. In this work, we have en-
riched our simulator and the experimentation by including non-blocking
communications (with and without system buffering) and a largest num-
ber of candidate processes to be analyzed. We have called the latter as
cascade analysis, because it includes processes that gets blocked by com-
munication indirectly with the failed process. The simulations show that
the savings were negligible in the worst case, but in some scenarios, it was
possible to achieve significant ones; the maximum saving achieved was
90% in a time interval of 16 minutes. As a result, we show the feasibility
of improving energy efficiency in HPC systems in the presence of a failure. |
| format |
Articulo article acceptedVersion |
| author |
Morán, Marina Balladini, Javier Rexachs, Dolores Rucci, Enzo |
| author_facet |
Morán, Marina Balladini, Javier Rexachs, Dolores Rucci, Enzo |
| author_sort |
Morán, Marina |
| title |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
| title_short |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
| title_full |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
| title_fullStr |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
| title_full_unstemmed |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
| title_sort |
exploring energy saving opportunities in fault tolerant hpc systems |
| publisher |
arXiv |
| publishDate |
2023 |
| url |
https://rdi.uncoma.edu.ar/handle/uncomaid/19175 |
| work_keys_str_mv |
AT moranmarina exploringenergysavingopportunitiesinfaulttoleranthpcsystems AT balladinijavier exploringenergysavingopportunitiesinfaulttoleranthpcsystems AT rexachsdolores exploringenergysavingopportunitiesinfaulttoleranthpcsystems AT ruccienzo exploringenergysavingopportunitiesinfaulttoleranthpcsystems |
| _version_ |
1854720037858836480 |