Snapshots from cardiac electrophysiology simulations on a patient-specific biventricular geometry. The simulator is based on solving the monodomain equations and CG is often the most time-consuming component of the underlying numerical method.
Time (in seconds) for the CG solver in a cardiac electrophysiology simulation (shown above). Up to 8 compute nodes were used with 2 Intel Xeon Platinum 8360Y (36-core Ice Lake) CPUs and 8 Nvidia A100 GPUs per node. The CPU implementation uses MPI for communication, whereas the GPU implementation may use MPI or NCCL.

Study Results

A suite of multi-GPU oriented implementations of the CG solver, both its standard algorithm and a pipelined version, was created as open-source software. The suite encompasses three types of communication libraries and can be used with both Nvidia and AMD GPUs.

Performance models were developed for the computational kernels in the CG method to verify that the single-GPU performance of the CG solver agrees with the theoretical expectation. For the inter-GPU communication overhead, i.e., peer-to-peer data exchange and all_reduce operations, standard micro-benchmarks and simple performance models were used to obtain a quantitative understanding of these costs. Time measurements of the new CG implementations were collected on several European supercomputers, showing better performance than state-of-the-art GPU implementations as well as scalable performance on more than 1000 GPUs.

To test the parallel scalability of the multi-GPU implementations for large-scale, real-world scenarios, a simulator of cardiac electrophysiology and a simulator of porous-media flow were used with the original multi-CPU CG solvers replaced by the newly developed multi-GPU code. For example, using a sparse matrix arising from an unstructured tetrahedral mesh with 31 million unknowns for the cardiac simulator, the best multi-GPU performance using 64 Nvidia A100 GPUs was found to be 13 times faster than using 16 Intel Xeon Platinum 8360Y (36-core Ice Lake) CPUs. Moreover, the new multi-GPU implementations were found to achieve a speedup of 1.29x over the multi-GPU counterparts in the state-of-the-art linear-algebra library: PETSc.

Three types of communication libraries were adopted, namely, GPU-aware MPI, Nvidia’s NCCL or AMD’s RCCL, and OpenSHMEM-based one-sided communication libraries. These communication libraries were used to implement the standard CG algorithm and a pipelined version of CG for the purpose of efficiently using large clusters of Nvidia or AMD GPUs. Care was also taken to remove the unnecessary synchronization points and CPU-GPU data transfers.


Benefits

The suite of new CG implementations developed in this innovation study enables a systematic investigation of the three types of state-of-the-art communication libraries that target multi-GPU systems. Their respective advantages and disadvantages were studied quantitatively. In particular, one-sided communication that is provided by OpenSHMEM based libraries, and its application in implementing Krylov subspace methods (with CG as the best-known example), is a topic that has so far only scarcely researched by the HPC community. The findings of this study can thus provide useful recommendations to the maintainers/developers of modern linear-algebra software libraries.

The parallel efficiency of the new CG implementations was studied on several cutting-edge European supercomputers. We showed that the new implementations are faster than the GPU implementations in a state-of-the-art software library. Moreover, the new CG implementations using one-sided communication achieved scalable performance on more than 1000 GPUs. These results may encourage more adoption of GPU computing in various branches of computational science.


Partners

Koç University (Turkey) provided HPC expertise. The PI was Prof. Didem Unat, together with Sinan Ekmekçibaşı, Doğan Sağbili, and Emre Düzakın as participants in the study.
Simula Research Laboratory (Norway) provided both HPC expertise and domain knowledge of computational physiology, as well as a CPU implementation of a real-world demonstrator that simulates the electrophysiology of the heart. The PI was Prof. Xing Cai, together with Dr. James Trotter and Dr. Johannes Langguth as participants in the study.

Contact

Name: Didem Unat, Xing Cai

Institution: Koç University, Simula Research Laboratory

Email Address: dunat@ku.edu.tr , xingca@simula.no