Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention
Title: Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention
Abstract
Concept bottleneck models (CBMs) enhance decision audibility by predicting a layer of human-interpretable attributes prior to final class prediction. However, in fine-grained recognition scenarios, standard concept heads are unconstrained in their visual attention, allowing a head designated for one body region to incorrectly rely on evidence from a different area. To address this, we introduce a part-factorized CBM that enforces spatial grounding by design.
Our approach builds upon a frozen DINOv3 vision transformer and comprises three key components. First, a learned foreground gate, trained on DINOv3 patch features, filters out background patches within part attention mechanisms. Second, a series of part queries cross-attend to patch features. Crucially, each of the 312 CUB attributes is routed via a fixed concept-to-part mapping to access only the specific part token implied by its name. Third, to break permutation symmetry among part queries, we inject a learnable two-dimensional Gaussian prior additively into the attention logits (in log space). The means of this prior are initialized using the dataset-average keypoint locations for each part, eliminating the need for per-image keypoint supervision during both training and testing.
Experiments on the CUB-200-2011 dataset demonstrate that our spatial-prior model achieves performance comparable to a fully supervised baseline, reaching 88.85% top-1 accuracy against the baseline’s 88.95%. Notably, this comes with a 16-point increase in pointing accuracy (52.6% versus 36.4%). By substituting bounding-box supervision with a PCA-based foreground target and integrating the Gaussian prior, we eliminate the need for per-image supervision entirely, achieving 88.6% top-1 accuracy with approximately 70% pointing accuracy. Furthermore, a sweep of keypoint fractions reveals that initializing the prior with just 0.5% of the training set (roughly 27 images) incurs no measurable performance loss. In contrast, removing part identity without any spatial prior causes pointing accuracy to plummet to just 2.9%, highlighting the necessity of spatial constraints.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




