Where to Refine, When to Stop: Rethinking Redundancy via Latent Discrepancy for Efficient Visual Autoregressive Generation
Title: Optimizing Redundancy in Visual Autoregressive Models: A Latent Discrepancy Approach for Efficient Generation
Visual Autoregressive (VAR) models are renowned for producing high-quality images, yet they often face considerable inference latency, particularly when generating at high resolutions. While recent acceleration strategies have attempted to address this by using heuristic measures based on layer features to prune tokens, these methods frequently struggle with complex contextual semantics. Consequently, they often fail to accurately identify redundant computations and lack adaptability across different prompts.
To address these limitations, this study reevaluates the concept of redundancy in VAR models by examining its direct impact on pixel-space generation. We introduce "Latent Discrepancy," a unified metric designed to quantify a token’s contribution by measuring fluctuations in model states throughout the generation process. Our analysis indicates that redundancy can be pinpointed with greater precision when guided by signals from image latents or pixel-space data. Furthermore, we observed that during classifier-free guidance (CFG), the convergence pattern of the discrepancy between conditional and unconditional branches displays significant dynamics that vary depending on the prompt.
Leveraging these insights, we propose LD-Pruning (Latent Discrepancy Pruning), a training-free framework that eliminates redundancy through latent discrepancy. This approach combines decoding-free region selection with adaptive skipping of the unconditional branch. Extensive experimental results demonstrate that LD-Pruning significantly lowers inference latency without compromising generation quality, achieving a speedup of up to 2.35x on the Infinity-8B model.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





