RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models
Title: RT-Lynx: Optimizing GEMM Sparsity for Diffusion Models
Abstract:
Although Diffusion Transformers (DiT) deliver exceptional results in image generation, they come with high inference expenses. Previous efforts have attempted to lower these costs through techniques like distillation and quantization, yet semi-structured sparsity—which has the potential to cut FLOPs by nearly half—has received little attention. This gap exists because most current strategies concentrate on weight sparsification; however, removing half of the weights often strips away essential model capabilities, leading to poorer generation quality.
In contrast, our research reveals that activations within DiTs are inherently sparse and demonstrate greater resilience to N:M semi-structured sparsification compared to weights. Based on this finding, we propose shifting the focus from sparsifying weights to sparsifying activations. We introduce RT-Lynx, a method that applies N:M sparsification to activations and utilizes error-compensation mechanisms to minimize accuracy drops. Additionally, we have developed specialized, highly optimized CUDA kernels for this specific approach, resulting in an average speed increase of 1.55x in linear layers. Comprehensive tests on various diffusion models confirm that our approach maintains the original generation fidelity while significantly boosting inference speed.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





