Global News Digest

arXiv

FastSLM: Hierarchical Temporal Abstraction for Efficient Long-Form Speech Adaptation

Title: FastSLM: Hierarchical Temporal Abstraction for Efficient Long-Form Speech Adaptation

Original: arXiv:2601.06199v3 Announce Type: replace-cross Abstract: Scaling Multimodal Large Language Models (MLLMs) to long-form speech is bottlenecked by the explosive growth of input tokens. Unlike images or videos, audio lacks overlapping information, making extreme 1-token compression highly susceptible to the loss of fine-grained acoustic cues. To overcome this, we propose FastSLM, a token-efficient architecture featuring the Hierarchical Temporal Abstractor (HTA). HTA progressively distills non-overlapping acoustic features across multiple temporal scales, achieving an extreme compression rate of 1.67 tokens per second a 97% reduction without losing critical context. Experimental results show that FastSLM achieves competitive performance with state-of-the-art models on long-form benchmarks despite operating with significantly fewer FLOPs and parameters. The source code and model checkpoints are available at https://anonymous.4open.science/r/FastSLM-8BD3.

Rewritten: arXiv:2601.06199v3 Announce Type: replace-cross Abstract: The expansion of Multimodal Large Language Models (MLLMs) to handle long-form audio is hindered by the rapid increase in input token volume. Because audio signals do not contain the redundant, overlapping data found in visual media, aggressive compression strategies that reduce input to a single token often result in the disappearance of subtle acoustic details. Addressing this challenge, we introduce FastSLM, a novel architecture designed for token efficiency through the integration of the Hierarchical Temporal Abstractor (HTA). This component systematically extracts and condenses distinct acoustic features across various temporal resolutions, enabling a drastic compression ratio of 1.67 tokens per second—a 97% decrease—while preserving essential contextual information. Our evaluations demonstrate that FastSLM delivers performance comparable to leading models on long-form audio benchmarks, all while requiring substantially lower computational costs in terms of FLOPs and model parameters. The corresponding source code and pre-trained model weights can be accessed at https://anonymous.4open.science/r/FastSLM-8BD3.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ā€˜as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ā€˜as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers ā€œas much as possible,ā€ emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.