Test-Time Compute Scaling for ASR with Depth-Conditioned Looped Transformers
Title: Scaling Test-Time Compute in ASR via Depth-Conditioned Looped Transformers
Abstract: Standard end-to-end Automatic Speech Recognition (ASR) architectures rely on acoustic encoders of fixed depth during inference. This constraint complicates the effort to enhance recognition accuracy by leveraging additional test-time computation, as it traditionally necessitates training a more extensive model. While recurrently reusing a shared Transformer block appears to be a logical solution, our analysis reveals that simple looping fails to fully capitalize on the extra computational resources available. To address this, we propose LARM, a depth-conditioned looped Transformer that transforms recurrent encoder depth into a tunable axis for test-time computation.
LARM integrates several key mechanisms: sparse CTC checkpoints, supervision-clock embeddings, FiLM-based depth conditioning, and delayed soft-posterior feedback. Together, these elements organize the recurrent process into distinct recognition checkpoints interspersed with latent refinement stages, enabling the shared weights to adapt and specialize across different recurrent steps. Experiments on the LibriSpeech dataset demonstrate that LARM’s Word Error Rate (WER) decreases as the number of inference loops grows, delivering performance that rivals deeper models with unshared parameters. These findings indicate that the strategy of scaling test-time compute can be successfully extended from autoregressive language model reasoning to the domain of continuous, non-autoregressive speech recognition.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




