arXiv

HASTE: Hardware-Aware Dynamic Sparse Training for Large Output Spaces

June 2, 2026 · Nasib Ullah, Jinbin Zhang, Jean Lucien Randrianantenaina, Erik Schultheis, Rohit Babbar · Original Source

Title: HASTE: Hardware-Aware Dynamic Sparse Training for Large Output Spaces

Abstract:

Extreme multi-label classification (XMC) requires training models on output spaces containing millions of labels, creating a significant memory and computational bottleneck at the output layer. Although sparsity techniques can lower arithmetic complexity, they frequently fail to deliver corresponding speed improvements. This inefficiency stems from irregular memory access patterns, suboptimal hardware utilization, or the necessity for auxiliary architectural components, particularly in long-tailed distributions.

To address these challenges, we propose group-shared fixed fan-in sparsity, a semi-structured design for the output layer. In this framework, labels with semantic similarities share a common sparse input pattern, yet maintain independent weight matrices. This grouping strategy introduces an inductive bias aligned with the task, promoting the sharing of feature subsets among related labels. Simultaneously, it minimizes index memory overhead, enhances feature reuse across labels, and facilitates efficient execution on GPUs through custom CUDA kernels that utilize modern accelerator primitives.

Rather than relying on auxiliary objectives, our method leverages the inherent long-tailed nature of XMC. We decompose the output layer into a compact dense head for frequent labels and a group-shared sparse tail for the remaining labels. This structure ensures a strong gradient signal while preserving the memory advantages of sparse representations. Kernel-level microbenchmarks demonstrate that group-shared fixed fan-in effectively converts arithmetic reductions into tangible wall-clock time savings. We observe speedups of up to $4.4\times$ in the forward pass and up to $25\times$ in the backward pass compared to standard fixed fan-in sparsity, all while performing within a few percentage points of a FLOPs-matched dense baseline. On large-scale XMC benchmarks, our approach achieves precision@k scores that match or surpass previous sparse baselines, effectively closing the performance gap with dense models.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC