Representation Matters in Randomized Smoothing for Audio Classification
Title: The Critical Role of Representation in Randomized Smoothing for Audio Tasks
Abstract: Randomized smoothing (RS) is a technique used to certify robustness within the vector space created by adding Gaussian noise. However, in the context of audio classification, this space is frequently ill-defined because standard processing pipelines typically involve normalizing, range-controlling, and converting waveforms into spectral features such as log-mel spectrograms. We demonstrate that applying RS directly is ambiguous unless both the specific object being certified and the preprocessing protocol are explicitly stated. Through experiments on two audio benchmarks—keyword spotting and environmental-sound classification—we analyze smoothing applied to waveforms, feature spaces, and post-processed data. Our findings highlight the necessity of reporting results with awareness of the underlying representation. For instance, at an identical smoothing level of $\sigma=0.0025$, two datasets exhibit the same median raw radius of $.007996$; nevertheless, their differing waveform energies result in distinct SNR-equivalent scales of $83.98$ dB and $90.97$ dB, respectively. Furthermore, smoothing in the log-mel domain yields higher certified accuracy for positive radii on environmental sounds ($68.42\%$ compared to $65.53\%$), allowing for the certification of more examples with nonzero radii, albeit based on features rather than raw waveforms. Additionally, operations such as clipping or peak normalization alter the effective perturbation norm by a factor of approximately $230$ to $351$. Consequently, we advise that studies employing RS in audio classification must clearly define and report the task-specific certified object and perturbation model. This includes specifying the perturbation location, gain policy, raw radius, and any modifications to geometry following noise addition.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






