Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability

In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the number of components. With the growing scale of HPC applications has came an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean Times Between...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autor principal: Meyer, Hugo
Formato: Articulo Revision
Lenguaje:Inglés
Publicado: 2016
Materias:
Acceso en línea:http://sedici.unlp.edu.ar/handle/10915/52386
http://journal.info.unlp.edu.ar/wp-content/uploads/2015/10/JCST-42-Thesis-Overview-1.pdf
Aporte de:
id I19-R120-10915-52386
record_format dspace
institution Universidad Nacional de La Plata
institution_str I-19
repository_str R-120
collection SEDICI (UNLP)
language Inglés
topic Ciencias Informáticas
Fault tolerance
Parallel
spellingShingle Ciencias Informáticas
Fault tolerance
Parallel
Meyer, Hugo
Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability
topic_facet Ciencias Informáticas
Fault tolerance
Parallel
description In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the number of components. With the growing scale of HPC applications has came an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean Times Between Failures (MTBF) in current systems encourages the research of suitable Fault Tolerance (FT) solutions which makes it possible to guarantee the successful completion of parallel applications. By executing applications on HPC systems, we aim to improve the performance despite the failures that may affect systems. Our research focuses on analyzing and reducing the impact of scalable FT techniques based on rollback-recovery (e.g. uncoordinated checkpoint). As message logging is normally the main source of overhead when using uncoordinated checkpoint approaches, our research focuses on analyzing and reducing the impact of current pessimistic receiver-based message logging techniques. Taking into account the advent of multicore machines, our main contributions aim to make an efficient use of the parallel environment considering the interaction between applications processes and fault tolerance tasks. The main contributions of this research are described below.
format Articulo
Revision
author Meyer, Hugo
author_facet Meyer, Hugo
author_sort Meyer, Hugo
title Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability
title_short Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability
title_full Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability
title_fullStr Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability
title_full_unstemmed Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability
title_sort fault tolerance in multicore clusters. techniques to balance performance and dependability
publishDate 2016
url http://sedici.unlp.edu.ar/handle/10915/52386
http://journal.info.unlp.edu.ar/wp-content/uploads/2015/10/JCST-42-Thesis-Overview-1.pdf
work_keys_str_mv AT meyerhugo faulttoleranceinmulticoreclusterstechniquestobalanceperformanceanddependability
bdutipo_str Repositorios
_version_ 1764820475805958146