Highlights

  • Addressing the large-scale computational needs of LLM training by the development of a faster and more efficient sparse training step included in the VENOM-TRAINING software tool
  • Creation of sparse numerical kernels and supporting technologies – a semi-structured pruning technique – as key new components of the VENOM-TRAINING software
  • The development of a gradual magnitude pruning technique to preserve the accuracy of the trained model
  • All developments realized for execution on GPUs, which are the key computing elements on all extreme-scale computing resources

[1] VENOM format overview shown as a combination of block-wise (1), vector-wise (2) and N:M (3) pruning
[2] Representation of the auxiliary data-structures required by the compressed storage format for VENOM

Challenge

The training of modern LLMs demand substantial investments, costing millions of dollars in training expenses and consuming vast amounts of energy, resulting in significant carbon emissions. For example, GPT-3’s training cost is estimated at around $12 million. As LLMs continue to evolve, models like GPT-4 have expanded to a staggering 100 trillion parameters, compared to GPT-3’s 175 billion. This dramatic increase in model size is making the training process increasingly inaccessible and cost-prohibitive for many.

Sparse kernels have long been effective for extreme levels of sparsity in scientific computing. However, their adoption in machine learning presents different challenges. One major difficulty is the need to generate semi-structured sparsity patterns that maintain model accuracy while still enabling performance gains—even at lower sparsity levels. Additionally, these new kernels must be integrated into complex frameworks such as PyTorch, where the integration process itself poses a significant challenge. Achieving true end-to-end speedup requires overcoming numerous technical hurdles and addressing many corner cases throughout the stack.


Research Topic

The ESPLAG study enabled the training of Machine Learning models on NVIDIA GPUs using sparse versions of the numerical kernels that take advantage of specific hardware components (Sparse Tensor Cores). The study was built upon previous achievements in end-to-end inference tasks for LLMs. The implementation of a sparse version of the complete training step required the implementation of new highly performant sparse kernels to complement our existing Sparse Matrix-Matrix Multiplication (SPMM). This development was co-designed with a semi-structured Gradual Magnitude Pruning (GMP) technique that induces structure to the sparsity pattern without losing much accuracy. These developments open a path to save time, energy, and memory consumption and train better models with less computational resources.


Solution

Sparse computation significantly improves training efficiency and scalability at exascale by reducing data traffic and communication overhead in model-parallel scenarios. Innovations like sophisticated pruning algorithms, specialized compressed storage formats, and co-designed, highly optimized libraries of GPU kernels enable sparse methods to surpass dense computations in speed with minimal accuracy loss. This combination is used by our solution consisting of a semi-structured pruning technique (N:M) (Figure [1]), a tailored compressed storage format (VENOM) (Figure [2]) and a codesigned kernel library (Spatha) (Figure [3]) to explore the limits of sparse training. The generated prototype showcases the potential of this solution and traces a clear path to take sparse training to the production level.


[3] High-level design of the SDDMM kernel tailored for the VENOM format