Highlights
- Increased parallelism in solver methods to ensure efficient GPU utilization
- Reduced latency and synchronization costs through GPU-initiated communication
Keywords | Aeronautics, Automotive |

Challenge
The transition from traditional homogeneous CPU-based architectures to heterogeneous systems with accelerators such as GPUs presents new challenges in computational science. Many of today’s computational fluid dynamics (CFD) solvers were originally developed for serial or CPU-parallel environments and adapted to distributed memory systems using MPI. These adaptations often involved compromises in numerical accuracy or efficiency to better fit the parallelization paradigm. On heterogeneous architectures, leveraging peak performance and bandwidth requires substantial programming effort, and the adaptation of numerical methods becomes even more complex. Existing solvers struggle to maintain strong scaling performance at smaller problem sizes due to communication overhead and inefficient GPU usage. Novel formulations are needed to both expose more parallelism and offload communication responsibilities from the CPU, enabling scalable CFD simulations at exascale.
Research Topic
Computational Fluid Dynamics (CFD) plays a central role in the numerical simulation of fluid flows by solving the discretized forms of partial differential equations on high-performance computing systems. These simulations enable virtual experimentation across a broad spectrum of real-world applications, including industrial design, aerodynamics, and environmental modelling. As a cornerstone of modern engineering and scientific workflows, CFD relies on scalable algorithms and software frameworks capable of exploiting the massive computational power offered by exascale platforms.
In this project, all algorithmic developments were prototyped and implemented in Neko (https://neko.cfd), a modern, Fortran-based, high-order spectral element framework for simulating turbulent flows. Neko is designed for portability and performance, featuring an object-oriented architecture and support for diverse hardware backends including CPUs, GPUs, vector processors, and limited FPGA support. The framework has demonstrated strong scaling on some of the world’s most advanced HPC systems. To achieve high-performance execution and communication, the project made use of CUDA/HIP for GPU programming, OpenMP for shared-memory parallelism, and NVSHMEM and NCCL for efficient GPU-initiated communication and interconnect utilization.
Solution
The STRAUSS innovation study developed a novel task-parallel multigrid solver for spectral element CFD applications. The new solver enhances GPU utilization by increasing the level of parallelism through a hybrid approach: OpenMP multithreading on CPUs is combined with concurrent GPU execution using multiple CUDA/HIP streams. This allows overlapping of coarse-grid and multigrid level operations, which are independent due to the additive nature of the preconditioner.
To further address communication latency, the solution integrates GPU-initiated communication using NVSHMEM and NCCL. These technologies decouple communication from the host CPU, reducing synchronization overhead and allowing fine-grained communication to be managed directly on the device. This leads to improved scalability and performance on modern HPC platforms, enabling efficient execution at realistic problem sizes.