arXiv

$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

Title: $A^2$: Compact Self-Supervised Vision Transformers Exhibit Superior Localization Capabilities Compared to Larger Models

Abstract:

Effective visual classification typically hinges on the ability to pinpoint primary foreground elements within an image while disregarding misleading contextual noise. Counterintuitively, our investigation reveals that self-supervised Vision Transformers (ViTs) with fewer parameters generate attention maps that localize foreground objects more accurately than their larger counterparts. Nevertheless, large ViTs remain indispensable due to their capacity to derive more nuanced representations from individual image patches.

To combine the advantages of precise localization and robust feature extraction, we introduce $A^2$, a straightforward approach that capitalizes on this inverse scaling phenomenon. The method separates the tasks of "where to look" and "what to extract" by employing a small attention model to identify regions of interest and a larger embedding model to analyze them. Specifically, $A^2$ crops areas surrounding the attention peaks detected by the smaller model and processes these crops using the larger model.

This technique relies exclusively on pretrained features, eliminates the need for group labels, and avoids the requirement for per-dataset training of either the attention mechanism or the backbone architecture. Evaluated across five distinct benchmarks, $A^2$ performs competitively against methods like DFR, which match backbones at the loss level, and surpasses end-to-end attention training approaches, particularly under more challenging distribution shifts.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TechCrunch

The world’s largest privately owned laser just turned on

Xcimer Energy activated the Phoenix laser, the world’s largest privately owned laser, aiming to commercialize fusion pow...

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya
Bloomberg

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya

Uber plans to double its electric motorcycle fleet in Kenya. This expansion aims to enhance sustainable transport option...

AI Saves Time But Most Companies Waste the Gain, Study Shows
Bloomberg

AI Saves Time But Most Companies Waste the Gain, Study Shows

A study reveals that while AI saves employee time, most companies fail to capitalize on these gains, squandering potenti...

JPMorgan Lifts S&P Target on Earnings 'Supercycle'
Bloomberg

JPMorgan Lifts S&P Target on Earnings 'Supercycle'

JPMorgan raised its S&P 500 target, citing an earnings “supercycle” that reflects heightened confidence in corporate pro...

Europe Sleepwalking Into Economic Ruin, Serb Leader Says
Bloomberg

Europe Sleepwalking Into Economic Ruin, Serb Leader Says

Serbian leader warns Europe is sleepwalking into economic ruin.

Delta Electronics Flags Power Crunch
Bloomberg

Delta Electronics Flags Power Crunch

Delta Electronics warns of a looming power deficit due to surging demand and constrained production, predicting serious ...