AnyAudio-Judge: A Dynamic Rubric-Based Benchmark and Evaluator for Audio Instruction Following
Title: AnyAudio-Judge: A Dynamic Rubric-Based Benchmark and Evaluator for Audio Instruction Following
Abstract:
The swift progress in instruction-guided audio generation has underscored the urgent necessity for robust alignment evaluation. Existing automated assessment methods predominantly depend on holistic scoring provided by general-purpose large language models. However, these approaches often falter in disentangling complex instructions, suffer from a lack of interpretability, and are unable to detect fine-grained attribute discrepancies. To resolve these limitations, we present a novel evaluation paradigm grounded in dynamic rubrics. This method adaptively breaks down intricate audio captions into a flexible quantity of independent, verifiable binary rubric items.
To rigorously evaluate this capability, we introduce AnyAudio-Judge Bench, a comprehensive bilingual benchmark consisting of 7,920 carefully curated samples. These samples span four distinct audio domains—speech, sound, music, and mixed content—and include deliberately constructed hard negatives. Additionally, we have assembled a large-scale dataset of 105,000 samples, each accompanied by explicit Chain-of-Thought (CoT) rationales, to train our specialized evaluator, the AnyAudio-Judge model.
By utilizing a training framework that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), our model effectively aligns its reasoning processes with the rubric-based scoring system. Extensive experimental results indicate that AnyAudio-Judge not only markedly improves zero-shot alignment detection over current state-of-the-art baselines but also delivers precise and interpretable reward signals. These signals significantly enhance instruction alignment in downstream reinforcement learning tasks for audio generation.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



