MURMUR: An Efficient Inference System for Long-Form ASR
Title: Murmur: A High-Performance Inference Framework for Long-Form Speech Recognition
Abstract:
Achieving both high precision and minimal delay in long-form automatic speech recognition (ASR) remains a significant challenge, as current solutions typically force a compromise between the two. Traditional chunk-based architectures process audio in parallel segments to ensure low latency; however, this approach often sacrifices cross-chunk context and relies on fragile heuristics to synchronize speaker identities and timestamps at segment boundaries. Conversely, long-context ASR models process entire sequences in a single pass to enhance accuracy, but they suffer from speed penalties that are roughly ten times slower than chunk-based methods.
To resolve this dilemma, we introduce Murmur, an inference system designed to bypass this trade-off through a dual-level operational strategy. At the inter-chunk level, we re-evaluate chunk-based pipelines for modern long-context ASR, treating chunk size as a flexible hyperparameter. Our findings indicate that intermediate chunk sizes offer an optimal equilibrium between accuracy and latency. At the intra-chunk level, we leverage attention sparsity by implementing a sliding window key-value (KV) cache eviction policy that applies to both output and speech tokens.
Experimental results on the AMI-IHM dataset demonstrate that Murmur achieves accuracy comparable to single-pass models while reducing latency by a factor of 4.2. Furthermore, incorporating token eviction yields additional efficiency improvements with negligible impact on performance, resulting in less than a 1% relative degradation in token error rate (tcpWER). The source code for Murmur is publicly accessible at https://github.com/uw-syfi/Murmur.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




