Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models
Title: Repairable Arbitration Reversals in Audio-Language Models: Moving Beyond Text Following
Abstract:
Audio-language models (ALMs) frequently prioritize textual input over audio cues, even when the auditory evidence is unambiguous. This phenomenon prompts a fundamental inquiry: is the audio-supported response genuinely absent from the model’s internal representation, or does it exist but get suppressed by conflicting text? To investigate this, we employ a same-audio counterfactual approach, wherein the audio remains constant while the conflicting text is removed, allowing us to measure shifts in model preference.
Our analysis across five distinct ALMs and four conflict-based tasks reveals that 64.1% of conflicting samples exhibit a sign flip. Specifically, the same-audio branch favors the audio-supported answer, while the joint branch (with both modalities present) favors the text-supported answer. This trend indicates that audio evidence is indeed encoded but is outvoted during the arbitration process. Further investigation via activation patching pinpoints the reversal to the computation of answer-position scores, with patching effects showing a strong correlation (Spearman rho=0.93) with differences in output candidate scores.
Leveraging these insights, we introduce Gated Audio Counterfactual Logit Correction (GACL), a decoding mechanism that requires no additional training. GACL functions by interpolating between joint and same-audio scores. Evaluated under a strict budget allowing for only a 5 percentage-point drop in faithfulness, GACL enhances nAUC by 17.8 points compared to the top contrastive baseline. Furthermore, the method demonstrates strong transferability to vision-text arbitration tasks without retuning, achieving improvements of up to +40.5 percentage points.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




