arXiv

$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

June 3, 2026 · Sreehari Rammohan, Huy Ha, Carl Vondrick · Original Source

Title: $A^2$: Compact Self-Supervised Vision Transformers Exhibit Superior Localization Capabilities Compared to Larger Models

Abstract:

Effective visual classification typically hinges on the ability to pinpoint primary foreground elements within an image while disregarding misleading contextual noise. Counterintuitively, our investigation reveals that self-supervised Vision Transformers (ViTs) with fewer parameters generate attention maps that localize foreground objects more accurately than their larger counterparts. Nevertheless, large ViTs remain indispensable due to their capacity to derive more nuanced representations from individual image patches.

To combine the advantages of precise localization and robust feature extraction, we introduce $A^2$, a straightforward approach that capitalizes on this inverse scaling phenomenon. The method separates the tasks of "where to look" and "what to extract" by employing a small attention model to identify regions of interest and a larger embedding model to analyze them. Specifically, $A^2$ crops areas surrounding the attention peaks detected by the smaller model and processes these crops using the larger model.

This technique relies exclusively on pretrained features, eliminates the need for group labels, and avoids the requirement for per-dataset training of either the attention mechanism or the backbone architecture. Evaluated across five distinct benchmarks, $A^2$ performs competitively against methods like DFR, which match backbones at the loss level, and surpasses end-to-end attention training approaches, particularly under more challenging distribution shifts.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC