arXiv

FastSLM: Hierarchical Temporal Abstraction for Efficient Long-Form Speech Adaptation

June 2, 2026 · Junseok Lee, Sangyong Lee, Chang-Jae Chun · Original Source

Title: FastSLM: Hierarchical Temporal Abstraction for Efficient Long-Form Speech Adaptation

Original: arXiv:2601.06199v3 Announce Type: replace-cross Abstract: Scaling Multimodal Large Language Models (MLLMs) to long-form speech is bottlenecked by the explosive growth of input tokens. Unlike images or videos, audio lacks overlapping information, making extreme 1-token compression highly susceptible to the loss of fine-grained acoustic cues. To overcome this, we propose FastSLM, a token-efficient architecture featuring the Hierarchical Temporal Abstractor (HTA). HTA progressively distills non-overlapping acoustic features across multiple temporal scales, achieving an extreme compression rate of 1.67 tokens per second a 97% reduction without losing critical context. Experimental results show that FastSLM achieves competitive performance with state-of-the-art models on long-form benchmarks despite operating with significantly fewer FLOPs and parameters. The source code and model checkpoints are available at https://anonymous.4open.science/r/FastSLM-8BD3.

Rewritten: arXiv:2601.06199v3 Announce Type: replace-cross Abstract: The expansion of Multimodal Large Language Models (MLLMs) to handle long-form audio is hindered by the rapid increase in input token volume. Because audio signals do not contain the redundant, overlapping data found in visual media, aggressive compression strategies that reduce input to a single token often result in the disappearance of subtle acoustic details. Addressing this challenge, we introduce FastSLM, a novel architecture designed for token efficiency through the integration of the Hierarchical Temporal Abstractor (HTA). This component systematically extracts and condenses distinct acoustic features across various temporal resolutions, enabling a drastic compression ratio of 1.67 tokens per second—a 97% decrease—while preserving essential contextual information. Our evaluations demonstrate that FastSLM delivers performance comparable to leading models on long-form audio benchmarks, all while requiring substantially lower computational costs in terms of FLOPs and model parameters. The corresponding source code and pre-trained model weights can be accessed at https://anonymous.4open.science/r/FastSLM-8BD3.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC