Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In pa...
Guardado en:
| Autores principales: | , , , |
|---|---|
| Formato: | Articulo article acceptedVersion |
| Lenguaje: | Inglés |
| Publicado: |
Elsevier
2024
|
| Materias: | |
| Acceso en línea: | http://rdi.uncoma.edu.ar/handle/uncomaid/18119 |
| Aporte de: |
| Sumario: | Nowadays, improving the energy efficiency of high-performance com-
puting (HPC) systems is one of the main drivers in scientific and techno-
logical research. As large-scale HPC systems require some fault-tolerant
method, the opportunities to reduce energy consumption should be ex-
plored. In particular, rollback-recovery methods using uncoordinated
checkpoints prevent all processes from re-executing when a failure oc-
curs. In this context, it is possible to take actions to reduce the energy
consumption of the nodes whose processes do not re-execute. This work is
an extension of a previous one, in which we proposed a series of strategies
to manage energy consumption at failure-time. In this work, we have en-
riched our simulator and the experimentation by including non-blocking
communications (with and without system buffering) and a largest num-
ber of candidate processes to be analyzed. We have called the latter as
cascade analysis, because it includes processes that gets blocked by com-
munication indirectly with the failed process. The simulations show that
the savings were negligible in the worst case, but in some scenarios, it was
possible to achieve significant ones; the maximum saving achieved was
90% in a time interval of 16 minutes. As a result, we show the feasibility
of improving energy efficiency in HPC systems in the presence of a failure |
|---|