Beyond Compression: Quantifying Spectral Accessibility in Vision Representations
Beyond Compression: Quantifying Spectral Accessibility in Vision Representations
Abstract
While vision-language models utilize learned projection layers to map visual features into a shared embedding space, the precise manner in which these transformations restructure visual information remains an open question. This research investigates such alterations by assessing spatial-frequency accessibility, specifically through the lens of linear recoverability of band-limited Fourier energy within model representations. To distinguish structural changes from simple dimensionality reduction, we propose Residual Spectral Loss (RSL), a metric that benchmarks modifications against a dimension-matched random projection baseline. Furthermore, to mitigate confounding variables arising from optimization processes, our analysis relies on pretrained models with frozen parameters.
Our experimental findings reveal consistent, frequency-dependent shifts in accessibility across both CLIP and DINOv2 architectures when evaluated on the ImageNet and MS-COCO datasets. Spectral accessibility exhibits a non-monotonic pattern throughout the network depth, reaching a peak at intermediate layers before declining as it approaches the final output. The nature of the final transformation varies by architecture: CLIP’s learned projection is spectrally neutral, with observed changes attributable primarily to compression. In contrast, DINOv2’s [CLS] pooling mechanism engenders a structured loss across the entire spectrum. These results pinpoint intermediate layers and specific pooling mechanisms as the key determinants of spectral transformation in contemporary vision encoders.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC






