Highlights
- Exascale-ready GW Module: Developed a standalone W-engine and integrated it into the open-source GreenX library, enabling any localized-basis electronic structure code (like CP2K and FHI-aims) to perform low-scaling GW calculations for large systems
- Accelerated Convergence: Implemented a novel Γ-point extrapolation scheme, cutting down the runtime for periodic interfaces by a factor of ~3 without loss of accuracy
- GPU-Accelerated Computing: Offloaded expensive GW steps (integral evaluations, matrix operations) to GPUs, achieving speed-ups up to 8× faster as compared to CPU-only execution on hybrid CPU/GPU supercomputers
- Scalability Achieved: Demonstrated near-linear strong scaling of the GW code from 8 to 1024 CPU cores, resulting in a 67× faster time-to-solution (23 minutes on 1024 cores vs 1514 minutes on 8 cores) for a 2D semiconductor
- Open Science and Dissemination: The project delivered its outcomes to the community through open-source releases (GreenX v2.0 with W-engine, CP2K and FHI-aims code contributions) and publications, including a Phys. Rev. B article validating low-scaling GW accuracy and a submitted J. Open Source Software paper on the analytic continuation module
Keywords | Energy; Computational Materials Science; High-Performance Computing; Electronic Structure |
Technologies used | MPI, GPU acceleration |

Challenge
Simulating the electronic structure of material interfaces with GW accuracy posed several intertwined challenges. The foremost issue was computational complexity: conventional GW algorithms scale roughly as O(N4) with system size, making them prohibitive for systems beyond a few hundred atoms. Although recent advances had reduced the scaling to O(N3) (linear with the number of k-points) for localized orbital codes, even this cubic scaling with a large prefactor remained daunting for the thousands of atoms needed to model realistic interfaces.
On top of algorithmic complexity, there was a technological challenge: adapting the code to modern HPC architectures. Emerging exascale supercomputers feature heterogeneous nodes (multi-core CPUs with attached GPUs), but the legacy GW implementations were largely MPI-only and CPU-bound. Exploiting tens of thousands of GPU cores efficiently, minimizing communication, and load-balancing the work between CPUs and GPUs required a significant redesign of the code.
In summary, the project needed to drastically reduce the computational cost of GW (through algorithmic innovation) and enable the code on GPU-accelerated HPC platforms. This dual challenge – mathematical and technical – had to be overcome to achieve interactive-time (hours or minutes) GW simulations for 2D materials.
Research Topic
Exa4GW aimed to push the many-body GW method to exascale in order to simulate previously challenging material systems like 2D materials. The GW approximation is considered a gold-standard in electronic structure theory for accurate band gaps and band alignments at interfaces. However, such interfaces (e.g. heterojunctions in transistors or batteries) often require enormous atomic models (thousands of atoms). These system sizes have been computationally intractable with existing GW implementations. By reducing algorithmic complexity and exploiting GPU acceleration, the project unlocks highly efficient GW simulations.
Solution
First, we developed a specialized “W-engine”: a modular library component responsible for computing the screened interaction efficiently. This engine was built by extracting and unifying the relevant GW routines from the two codebases: the existing periodic GW in CP2K and analogous pieces from FHI-aims were re-factored into GreenX. An improved Γ-point treatment was implemented: instead of brute-force k-point grids to handle the Coulomb singularity, the code uses a linear extrapolation scheme that predicts the Γ-point contribution from a coarser grid. This reduced the computation time by a factor of three. Next, to leverage modern hardware, the team undertook a full GPU acceleration of heavy operations. The evaluation of three-center Coulomb integral – a major cost in localized-basis GW – was ported to run on GPUs using low-level CUDA/HIP kernels, maximizing parallel throughput.