Highlights

  • New computation-friendly compression scheme for Boolean or binary matrices
  • New matrix multiplication kernel for sparse-dense matrix operations
  • Format and kernel implemented in C++ and OpenMP, relying on Intel MKL and NVIDIA cuSPARSE, supporting CPU (sequential, multi-threaded, SIMD) and GPU architectures
  • Experimental results show that CBM can reduce the memory footprint of real-world graphs by up to 11×, and that parallel matrix multiplication using CBM is more than 5× faster than state-of-the-art sparse-dense multiplication kernels
  • Fully compatible with PyTorch for GNN training and inference tasks


Challenge

Graph Neural Networks (GNNs) are a key technology for learning from graph-structured data and are expected to drive future AI applications on large-scale high-performance computing (HPC) systems. These models require long sequences of sparse-dense matrix multiplications, particularly involving binary adjacency matrices and dense node feature embeddings. Existing sparse matrix formats like COO and CSR are designed to store only non-zero entries, but they do not exploit similarities between matrix rows or columns. The challenge addressed in this study was to develop a novel binary matrix format that not only achieves better compression than classical sparse formats, but also enables a matrix multiplication kernel competitive with state-of-the-art implementations in terms of speed and scalability.


Research Topic

The CBM4Scale study set out to design and implement a fundamentally new algorithm for representing and computing with sparse binary matrices, addressing a key performance bottleneck in Graph Neural Network (GNN) workloads. The result is the Compressed Binary Matrix (CBM) format and its associated matrix multiplication kernel, CBMM. Together, they offer an efficient alternative to traditional sparse formats by reducing memory usage and computational overhead, particularly in sparse-dense matrix operations that are central to GNN training and inference. By integrating CBM and CBMM into the PyTorch ecosystem, the project aims to enable significant performance gains in large-scale AI workloads running on current and future European HPC systems.

The implementation is written in C++ with OpenMP and leverages Intel MKL and NVIDIA cuSPARSE to support both CPU and GPU platforms. It is designed for shared-memory architectures and supports multi-threading and SIMD operations. CBMM is fully compatible with PyTorch and its Autograd engine and can be used transparently with models based on Graph Convolutional Networks (GCN), GraphSAGE, and Graph Isomorphism Networks (GIN) during both training and inference. The full implementation is publicly available via the project’s GitHub repository at https://github.com/cbm4scale . Additional technical details are provided in a preprint at arxiv.org/abs/2409.02208 , and results will be presented at IPDPS 2025.


Solution

The CBM format addresses the challenge of efficiently representing binary matrices by encoding each row in terms of its differences (deltas) from another similar row within the same matrix. This is based on a minimal row-dependency tree, also known as an arborescence. The CBMM kernel performs sparse-dense matrix multiplication by traversing this tree structure, reusing intermediate results wherever possible. This allows the product to be computed using fewer scalar operations than the number of non-zero entries in the original matrix. The structure of the dependency tree also enables parallelism, which is exploited using OpenMP and CUDA streams for efficient execution on CPUs and GPUs.