Highlights

  • A suite of new implementations of the CG solver was created, encompassing three types of communication libraries that target multi-GPU platforms such as AMD and NVIDIA GPUs
  • The new CG implementations outperform their counterparts in the state-of-the-art linear-algebra PETSc library by a speedup factor between 1.04 and 3.34, when running on several GPU-based supercomputers such as LUMI, Leonardo and MareNostrum5 using selected sparse matrices from SuiteSparse and representative huge-scale sparse matrices that arise in computational science
  • Several of the new multi-GPU implementations demonstrated scaling on over 1000 GPUs, thanks to less inter-GPU communication overhead and synchronization cost
  • Real-world simulators of cardiac electrophysiology and porous-media flow were chosen as demonstrators of the performance benefits of the new multi-GPU implementations on large numbers of GPUs

Solver time (in seconds) for 100 CG iterations on LUMI using up to 1024 AMD MI250X GCDs. Times are shown for 1) PETSc, 2) aCG with GPU-aware MPI and 3) aCG with RCCL. The linear system in this example is from a finite element method for solving Poisson’s equation on a 3D unstructured, tetrahedral mesh of a simple geometry.
Solver time (in seconds) for 100 CG iterations on Leonardo using up to 1024 Nvidia A100 GPUs. Times are shown for 1) PETSc, 2) aCG with GPU-aware MPI, 3) aCG with NCCL, and 4) aCG with NVSHMEM (device-initiated communication). The linear system in this example is from a finite element method for solving Poisson’s equation on a 3D unstructured, tetrahedral mesh of a simple geometry.

Challenge

Communication and synchronization latencies that are prevalent in large-scale GPU-based supercomputers constitute a major challenge to software implementations that aim to efficiently use a large number of GPUs. The situation is particularly severe when parallelizing iterative solvers of sparse linear systems, because the computational intensity of these solvers is low, making the impact of inter-GPU communication overhead more pronounced than for other types of computation. At the same time, new communication libraries that specifically target multi-GPU platforms, such as Nvidia’s NCCL and NVSHMEM libraries, have arrived. Although these new communication libraries were originally intended for machine learning workloads, their applicability for parallelizing iterative solvers warrants a careful investigation, with respect to both performance and programmability. This innovation study chose to investigate the CG solver, the most well-known example in the class of Krylov subspace solvers. Specifically, three types of communication libraries were used to implement multi-GPU CG solvers: GPU-aware MPI, Nvidia’s NCCL or AMD’s RCCL, and Nvidia’s NVSHMEM.


Research Topic

The innovation study investigated various communication libraries that target multi-GPU clusters, in the context of implementing the conjugate gradient (CG) method for iteratively solving huge-scale, sparse systems of linear algebraic equations. The purpose is to provide various multi-GPU oriented implementations of CG, so that domain scientists can choose the best-performant CG implementation for different scenarios of the GPU architecture, network topology, and sparse linear system at hand.


Solution

Various code optimizations were adopted in the new multi-GPU implementations of CG to ensure their parallel scalability on GPU clusters. By eliminating unnecessary CPU-GPU data transfers and reducing synchronization points, our implementations using GPU-aware MPI outperformed their counterparts from the state-of-the-art linear-algebra library PETSc. For NCCL/RCCL, we ensured that fewer synchronizations occurred, realizing performance benefits over GPU-aware MPI. For one-sided communication provided by NVSHMEM, we developed a ‘monolithic’ implementation that directly uses GPU-initiated communication and device-side synchronization, so that the CPU host is minimally involved, reducing host-induced overheads. We also hand-coded the various computational steps inside the monolithic code, for which basic libraries such as cuSPARSE currently do not support GPU-initiated communication.