arXiv

Audio Interaction Model

Title: Audio Interaction Model

Abstract:

Although audio is fundamentally an interactive medium, current Large Audio Language Models (LALMs) operate in an offline capacity. Furthermore, existing streaming audio systems are typically limited to single functions, such as streaming automatic speech recognition (ASR) or voice-based conversation. We propose unifying these capabilities into a single online LALM capable of an always-on "perceive-decide-respond" cycle. This model listens to real-time audio, environmental cues, and user instructions, allowing it to react instantly.

We define this operational regime as the Audio Interaction Model and implement it through Audio-Interaction, a unified streaming architecture. This system maintains the performance of offline task execution while introducing online general audio instruction following. It supports a range of interactions, from dialogue to full voice chatting, determining the optimal moment to respond by analyzing the semantic content of the audio stream.

To support this framework, we introduce SoundFlow, an end-to-end solution for the perceive-decide-respond loop. SoundFlow facilitates data preparation, training, and deployment through three key components: streaming-native data construction, comprehension-aware training methodologies, and asynchronous low-latency inference to ensure stable, real-time interaction.

Additionally, we have created StreamAudio-2M, a streaming dataset comprising 2.6 million items that cover seven core abilities and 28 sub-tasks. We also developed Proactive-Sound-Bench to assess proactive audio intervention. Our evaluations across eight benchmarks demonstrate that Audio-Interaction delivers competitive results on standard audio tasks. Crucially, it unlocks functionalities unavailable to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive assistance.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

China’s Robotaxi Dilemma Shows AI Policy Tension Between Growth and Jobs
Bloomberg

China’s Robotaxi Dilemma Shows AI Policy Tension Between Growth and Jobs

China’s robotaxi expansion highlights the policy tension between driving economic growth through AI and protecting emplo...

Exams watchdog warns of rise in high-tech cheating
BBC News

Exams watchdog warns of rise in high-tech cheating

Ofqual warns of rising high-tech cheating, with smart devices involved in 44% of misconduct cases. Invigilators are trai...

Thailand’s Richest Man Plans $4.3 Billion Expansion Amid AI Boom
Bloomberg

Thailand’s Richest Man Plans $4.3 Billion Expansion Amid AI Boom

Thailand’s wealthiest individual is investing $4.3 billion in expansion, capitalizing on the booming artificial intellig...

US Tech Sector Announces Most Job Cuts in Nearly Two Years
Bloomberg

US Tech Sector Announces Most Job Cuts in Nearly Two Years

The US tech sector recorded its highest wave of layoffs in nearly two years, signaling a significant downturn for the in...

Iran Says No Progress in US Talks | The Opening Trade 6/4/2026
Bloomberg

Iran Says No Progress in US Talks | The Opening Trade 6/4/2026

Iran reports no progress in US talks on June 4, 2026. The Opening Trade highlights the ongoing diplomatic impasse betwee...

The Do’s and Don’ts of Buying Used Tech Gadgets
New York Times

The Do’s and Don’ts of Buying Used Tech Gadgets

Refurbished tech offers a cost-effective alternative amid component shortages and inflated prices. This guide outlines e...