VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio
Title: VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio
Abstract:
General-purpose audio representations are designed to map acoustically variable instances of identical events to proximate points, thereby resolving content identity within a zero-shot framework. In contrast to supervised classification benchmarks that assess adaptability through parameter updates, we present VocSim, a training-free benchmark that examines the intrinsic geometric alignment of frozen embeddings. This approach operates without parameter updates or labeled data, utilizing only a label-free PCA whitening step fitted per subset to correct for anisotropy.
VocSim consolidates 125,000 single-source audio clips drawn from 19 distinct corpora, covering human speech, animal vocalizations, and environmental sounds. The benchmark explicitly isolates content representation from source separation tasks, excluding polyphonic mixtures from its scope. We assess embedding quality using Precision@k to measure local purity and the Global Separation Rate (GSR) to evaluate point-wise class separation, with GSR values calibrated against an empirical permutation baseline to determine lift.
A straightforward pipeline comprising frozen Whisper features, time-frequency pooling, and label-free PCA demonstrates robust zero-shot performance, maintaining stable GSR rankings across various domains (Kendall's tau = 0.60). However, performance on blind, low-resource speech datasets (specifically Shipibo-Conibo and Chintang) reveals a collapse in local retrieval capabilities, though results remain above chance levels, highlighting a cross-lingual speech generalization gap. As external validation, our top-performing embeddings accurately predict avian perceptual similarity, enhance bioacoustic classification, and achieve state-of-the-art results on the HEAR benchmark. We publicly release the associated data, code, and leaderboard.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




