The HPC world is facing the exascale challenge, i.e. build efficient supercomputers that will reach 1018 floating point operations per second in a way in which they will also consume as less computing power as possible. To achieve such a goal, it is mandatory to design these new machines integrating from the vey beginning all the components that are part of: hardware, software, and middleware.
Nevertheless, as the number of physical and virtual components will increase, the probability of errors will. Thus, it is needed to improve the fault tolerance capabilities and task scheduling in the future exascale supercomputers. Both open issues are strictly related, and a fundamental part of their solution is proposed in HPC4E: the creation of a checkpointing/restart mechanism capable to migrate tasks composing parallel jobs inside a distributed infrastructure, its integration with latest resource managers, and its further deployment for an increased fault tolerance and efficient task scheduling.
In massively parallel infrastructures a single failure in a multi-task parallel execution could drive to a major resource misuse of thousands of cores, so having an efficient and robust checkpointing mechanism is a must. This is however far from being trivial, with major challenges regarding scalability, overhead, and flexibility still lacking a proper solution. Such an aim is probably one of the major challenges nowadays in the HPC world.
Furthermore, by having a non-intrusive and flexible checkpointing mechanism at user-level that provides a whole new set of capabilities, a natural evolution is to employ it not only on system failures, but also to migrate tasks being run to more suitable (in terms of performance or locality) computing nodes.
HPC4E is working on implementing lightweight and scalable dynamic task scheduling algorithms for HPC, allowing migrating tasks during their execution to adapt to the computational demands and status of the infrastructure. This is expected to lead to a more efficient usage of the available resources, achieving two different objectives: an increased computational efficiency and a reduction on the power consumption.
Here, being the energy field so diverse, different kinds of applications will be supported: serial, parallel, shared memory and hybrid ones. Moreover, given the increasing presence of heterogeneous environments, the proposed solutions will be adapted to supercomputers formed by traditional CPU and accelerators such as GPUs or Xeon Phis.
In order to accomplish this objective, mechanisms for transparent migration of tasks composing parallel applications inside an HPC cluster have been created; then, a homogeneous interface to migrate both serial and parallel tasks has been defined and implemented as part of the Slurm manager. This interface will be used by a new generation of scheduling algorithms with different objectives: maximize the efficiency, enhance the stability, and optimize the power consumption of the infrastructure through a wise adaptation of the tasks being executed and queued into the physical resources.
This way, the energy simulations will be resilient and will be able to tackle more ambitious problems.
Manuel Rodríguez-Pascual, José A. Moríñigo, and Rafael Mayo-García - CIEMAT
This article also appeared in the LinkedIn Group. Join us!
Other LinkedIn articles: