Highlights
- Successful design, construction, and deployment of an opportunistic data operations platform (ODOP) that can solve the data analysis bottleneck of multiphysics fluid simulations at Exascale
- Successful design, construction, and deployment of a GPU-accelerated HPC pipeline, utilizing ODOP, to execute realistic workloads on LUMI-G, with times to complete reduced by more than a factor of two, with all computations and data analysis tasks executed concurrently
- Successful benchmarking and parallel performance testing on the LUMI supercomputer, showing excellent performance up to full system scales
Keywords | Environment, Climate/Weather, Engineering |
Technologies used | CUDA/HIP, openMP, MPI, Python, fastapi, pydantic, tinyflux, QoA4MLl |


Challenge
The main challenge we set out to address and solve was the data analysis bottleneck arising with Exascale fluid simulations: due to the usage of efficient GPU libraries, such as Astaroth, the grid sizes and the resulting system states grow significantly. In the era of Exascale computing, data analysis cannot be done in post processing – efficient on-the-fly data analysis and movement methods, taking advantage of idle CPU resources, must be utilized. In NEOSC we investigated the possibility to tackle this challenge with a framework that utilizes the unused CPU resources in the most optimal way, as only a fraction of them orchestrate the GPU computations.
Research Topic
The research background to the NEOSC study is the following: We study magnetized and turbulent plasmas in astrophysical and engineering contexts. Research questions we aim to answer include: How does the Sun generate and maintain its magnetic fields? Why does the magnetic field cause eruptive events, and how can they be predicted?
Numerical simulations are a vital tool for answering these questions. As we need to include vast range of temporal and spatial scales, these simulations need to reach high resolutions, exceeding what can be achieved with CPU resources. We have developed an efficient and portable GPU-based solver, but its usage in CPU-based HPC workflows leads to a data explosion, practically preventing production runs at Exascale. NEOSC helps to resolve this data analysis bottleneck.
Solution
Our solution to prevent the slow-down of important real-world applications at Exascale, due to the data analysis bottleneck, is to perform all the data analysis tasks, earlier done in post processing, with the main production run on the GPU engine, on-the-fly. During this project we have shown that this is a feasible and profitable approach: our concept was to develop the Opportunistic Data Analysis Platform (ODOP) that monitors and schedules non-used CPU resources for the data analysis and movement tasks. During the benchmarking we were able to show that the data analysis tasks can be completed without any disturbance or slow-down of the main GPU application on LUMI-G with a large fraction of the GPU nodes (up to one third of the full-scale of the machine).
