Highlights

Design and implementation of asynchronous iterative linear solvers, such as the most promising aPCG, which delivers:

  • State of the art parallel implementation with one-sided MPI
  • Testing on the EuroHPC machines LUMI and MareNostrum
  • Up to a 3.1x factor improvement in time-to-solution and 1.6x improvement in cost-to-solution

The schematic execution timeline of a simplified algorithm using two-sided (top) MPI communication. The algorithm is representative of most linear algebra algorithms in that it uses some essential building blocks (SpMV and global sum).

Challenge

Asynchronous algorithms are promising methods for making an efficient use of exascale machines. Indeed, modern linear solver technology is often constrained at scale by communication costs. Designing genuinely asynchronous algorithms with lower communication requirements that enable an implementation in which computing cores are nearly never idle is challenging. To achieve this goal, the computations in the iteration loop as well as the termination criterion have to be specified as asynchronous. This course of an asynchronous simulation is non-deterministic. Hence, it is difficult to write down the algorithm in a non-ambiguous and easy to read (and thus shareable) description as well as to reason about its convergence properties. Any implementation has to reflect the asynchronicity of the designed algorithm. This can be achieved using MPI which is the de facto standard message passing library on large scale computers. The corresponding features have been added relatively recently to the MPI standard and their subtle behaviour may depend on the specific MPI implementation and platform. Prototyping such methods must therefore undergo a variety of representative test cases both in terms of size and physics.


Research Topic

Linear solvers are without any doubt one of the key building blocks of numerical software. Their performance is a limiting factor in the quest to reduce the time-to-solution as well as energy-to-solution of many applications. Current linear solver technology is often limited at scale by the cost of communication. The ratio between the cost of reading two floats from memory and adding them is several orders of magnitude. It is therefore paramount to reduce the need for communication as much as possible even at the expense of extra computations. This project demonstrates several novel asynchronous linear solvers which require neither global sums nor synchronised pairwise communication.


Solution

In order to achieve asynchronicity, we chose a distributed data approach which is natural for communication avoiding algorithms. The conjugate gradient (CG) method is modified to be considered as a collection of weakly coupled local CG methods. Then, all the synchronous parts of the method are relaxed. Stabilization techniques coming from non-linear CG methods make the algorithm robust. In order to formulate the algorithms, we resorted as much as possible to the well-established formalism of domain decomposition methods and added extra notions to express the asynchronicity. In order to have a genuinely asynchronous implementation, we utilised MPI one-sided communication. The work was performed in the framework of the high-level scientific computing codes OpenFOAM and FreeFEM which provide a lot of flexibility in the generation of test problems.


Main idea for “decoupling” partitions. Boxes denote partitions. Classical algorithm using a single, global Krylov-space (left) and novel approach using 1 Krylov-space per partition (right).