Providing resilient executions to the energy simulations

The HPC world is facing the exascale challenge, i.e. build efficient supercomputers that will reach 1018 floating point operations per second in a way in which they will also consume as less computing power as possible. To achieve such a goal, it is mandatory to design these new machines integrating from the vey beginning all the components that are part of: hardware, software, and middleware.

Nevertheless, as the number of physical and virtual components will increase, the probability of errors will. Thus, it is needed to improve the fault tolerance capabilities and task scheduling in the future exascale supercomputers. Both open issues are strictly related, and a fundamental part of their solution is proposed in HPC4E: the creation of a checkpointing/restart mechanism capable to migrate tasks composing parallel jobs inside a distributed infrastructure, its integration with latest resource managers, and its further deployment for an increased fault tolerance and efficient task scheduling.

In massively parallel infrastructures a single failure in a multi-task parallel execution could drive to a major resource misuse of thousands of cores, so having an efficient and robust checkpointing mechanism is a must. This is however far from being trivial, with major challenges regarding scalability, overhead, and flexibility still lacking a proper solution. Such an aim is probably one of the major challenges nowadays in the HPC world.

Furthermore, by having a non-intrusive and flexible checkpointing mechanism at user-level that provides a whole new set of capabilities, a natural evolution is to employ it not only on system failures, but also to migrate tasks being run to more suitable (in terms of performance or locality) computing nodes.

HPC4E is working on implementing lightweight and scalable dynamic task scheduling algorithms for HPC, allowing migrating tasks during their execution to adapt to the computational demands and status of the infrastructure. This is expected to lead to a more efficient usage of the available resources, achieving two different objectives: an increased computational efficiency and a reduction on the power consumption.

Here, being the energy field so diverse, different kinds of applications will be supported: serial, parallel, shared memory and hybrid ones. Moreover, given the increasing presence of heterogeneous environments, the proposed solutions will be adapted to supercomputers formed by traditional CPU and accelerators such as GPUs or Xeon Phis.

In order to accomplish this objective, mechanisms for transparent migration of tasks composing parallel applications inside an HPC cluster have been created; then, a homogeneous interface to migrate both serial and parallel tasks has been defined and implemented as part of the Slurm manager. This interface will be used by a new generation of scheduling algorithms with different objectives: maximize the efficiency, enhance the stability, and optimize the power consumption of the infrastructure through a wise adaptation of the tasks being executed and queued into the physical resources.

This way, the energy simulations will be resilient and will be able to tackle more ambitious problems.

Manuel Rodríguez-Pascual, José A. Moríñigo, and Rafael Mayo-García - CIEMAT


This article also appeared in the LinkedIn GroupJoin us!

Other LinkedIn articles:

Link Wind energy in HPC4E
Link Investigation of flame structure of biomass-derived gaseous fuels
Link Finding the "not so easy" oil
Link How supercomputing can help improving the energy sector
Link Effects of fuel composition on biogas combustion in premixed laminar flames
Link New generation subsurface imaging gets a boost from HPC
Link Novel Hybridizable Discontinous Galerkin method paves the way to tackle realistic 3D problems in seismic imaging
Link Improving short-range wind intensity prediction based on multimodel meteorological ensemble forecasts and Genetic Programming
Link Biogas Utilisation for Sustainable Power Generation
Link Preparing the oil and gas industry for the Exaflop era
Link Dynamical and statistical high resolution downscaling approaches for the surface wind
Link ​​Innovative multiscale numerical algorithms for highly heterogeneous media extended to seismic problems
Link Do we really need exascale computers? Geophysicists say yes, we do