Audio Interaction Model
Title: Audio Interaction Model
Abstract:
Although audio is fundamentally an interactive medium, current Large Audio Language Models (LALMs) operate in an offline capacity. Furthermore, existing streaming audio systems are typically limited to single functions, such as streaming automatic speech recognition (ASR) or voice-based conversation. We propose unifying these capabilities into a single online LALM capable of an always-on "perceive-decide-respond" cycle. This model listens to real-time audio, environmental cues, and user instructions, allowing it to react instantly.
We define this operational regime as the Audio Interaction Model and implement it through Audio-Interaction, a unified streaming architecture. This system maintains the performance of offline task execution while introducing online general audio instruction following. It supports a range of interactions, from dialogue to full voice chatting, determining the optimal moment to respond by analyzing the semantic content of the audio stream.
To support this framework, we introduce SoundFlow, an end-to-end solution for the perceive-decide-respond loop. SoundFlow facilitates data preparation, training, and deployment through three key components: streaming-native data construction, comprehension-aware training methodologies, and asynchronous low-latency inference to ensure stable, real-time interaction.
Additionally, we have created StreamAudio-2M, a streaming dataset comprising 2.6 million items that cover seven core abilities and 28 sub-tasks. We also developed Proactive-Sound-Bench to assess proactive audio intervention. Our evaluations across eight benchmarks demonstrate that Audio-Interaction delivers competitive results on standard audio tasks. Crucially, it unlocks functionalities unavailable to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive assistance.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






