arXiv

BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference

June 2, 2026 · Xiaoyou Wu, Cheng-Jhih Shih, Binfei Ji, Yong Liu, Yingyan Celine Lin · Original Source

Title: BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference

Abstract:

Diffusion language models (dLLMs) offer a compelling alternative to traditional autoregressive decoding by generating text through the parallel denoising of multiple token positions in an iterative manner. However, practical implementation of block-wise dLLM inference involves a challenging granularity trade-off. While smaller blocks maintain strong local conditioning, they necessitate a greater number of denoising steps. Conversely, larger blocks enable increased parallelism but risk premature commitments and the accumulation of cache errors. Current acceleration techniques generally select a single block size for each request, thereby failing to leverage the complementary benefits of varying block sizes.

This study demonstrates that block size serves as a valuable branching dimension. Utilizing different block sizes generates related yet distinct KV-cache trajectories. These branches typically share an initial prefix, diverge at semantically critical positions, and subsequently reconverge on syntactically simpler tokens. Drawing inspiration from this structural insight, we introduce BlockBatch, a training-free online inference framework that processes multiple block-size branches for a single request within a batched forward pass.

BlockBatch manages these concurrent branches via leader-based synchronization, confidence-gated token merging, and periodic full-sequence refreshes, which realign local block updates with a globally consistent KV state. Evaluations across four datasets and three representative dLLMs indicate that BlockBatch reduces the average number of denoising NFEs by 26.6%. It delivers a 1.33$\times$ average end-to-end speedup compared to Fast-dLLM without compromising accuracy. These findings highlight block-size diversity as a practical and previously underutilized axis for branch-parallel dLLM inference.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC