VGGSounder: Audio-Visual Evaluations for Foundation Models
Title: VGGSounder: Audio-Visual Evaluations for Foundation Models
Original: arXiv:2508.08237v4 Announce Type: replace-cross
Abstract: As audio-visual foundation models continue to emerge, the need for robust assessment of their multi-modal comprehension has become increasingly critical. While the VGGSound dataset serves as a standard benchmark for audio-visual classification, our investigation highlights significant shortcomings within it, such as incomplete labeling, partially overlapping classes, and misaligned modalities. These issues result in skewed evaluations of both auditory and visual competencies. To overcome these challenges, we present VGGSounder, a newly re-annotated, multi-label test set that builds upon VGGSound and is tailored specifically for assessing audio-visual foundation models. VGGSounder incorporates detailed modality annotations, allowing for more accurate analysis of performance across individual modalities. Additionally, we uncover model limitations by examining performance declines when an additional input modality is introduced, utilizing our novel modality confusion metric.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC



