Highlights

  • Proposed a two-stage communication framework and its enhanced version with partial reduce that utilizes a vertex cover model. This significantly improves communication efficiency on distributed-memory systems by addressing communication volume imbalance in sparse collectives for irregularly bandwidth-bound sparse applications like SpMM and GNN
  • Delivered up to 93% (on average 72%) decrease in peak communication volume and up to 67% (on average 41%) runtime improvement for SpMM on 4096 cores; accelerated GNN training by up to 24% per epoch on 2048 cores, outperforming traditional one-stage communication
  • Demonstrated excellent scalability and portability on pre-exascale systems (LUMI, MareNostrum 5) with fully in-house C/C++/MPI/OpenMP implementations and optimized pre-processing using the Hopcroft-Karp algorithm

Volume load distribution incurred by (a) baseline, (b) two-stage task-sharing, (c) two-stage task-sharing with partial reduce for pattern1 on 112 processes.

Challenge

Task decomposition in scale-free applications may produce task-interaction graphs having large variations in the degrees of input vertices with possibly a few input vertices with very high degrees. Here, the degree of an input vertex refers to the number of tasks that need the respective vertex. This corresponds to a sparse matrix with a relatively small number of almost dense rows/columns in the SpMM application. It corresponds to a high-degree vertex in a GNN graph. When a high-degree input is assigned to a process , its send volume can be decreased through expand-task sharing, but the process still receives a large amount of data proportional to the degree of this input, which often becomes a communication bottleneck.


Research Topic

This research addressed the development of a two-phase communication optimization framework to improve the scalability of irregularly sparse bandwidth-bound applications, such as Sparse Matrix Dense Matrix Multiplication (SpMM) and Graph Neural Network (GNN) training, on distributed-memory systems. The approach balances processes’ communication volume loads by decoding sparse collective patterns and applying a novel two-stage communication scheme with expand-task sharing and partial reduce techniques


Solution

This study addresses the communication bottlenecks caused by high-degree vertices in task-interaction graphs, common in scale-free applications like SpMM and GNNs, by introducing a vertex-cover-based partial reduce method. These high-degree vertices, when assigned to a single process, lead to excessive incoming data, overwhelming the receiver despite decreases in send volume. To alleviate this, the developed solution enables neighbouring processes to perform partial local reduce before sending their data. This is guided by a minimum vertex cover model applied to bipartite graphs representing communication between process pairs. By selecting a vertex cover that maximizes reduction work on the sending side, our approach minimizes the receive-side volume load without significantly disrupting the computational load balance set during the initial task/data partitioning. Although the vertex cover model does not consider maintaining communication volume balance, it significantly reduces message sizes and receive volumes, and is used as a pre-processing step to the broader task-sharing framework, ultimately improving scalability and efficiency in distributed sparse applications.