Concept-wise Attention for Fine-grained Concept Bottleneck Models
Title: Concept-wise Attention for Fine-grained Concept Bottleneck Models
Abstract:
Recent advancements in Concept Bottleneck Models (CBM) have leveraged the image-text alignment capabilities of large pre-trained vision-language models, such as CLIP, to achieve remarkable performance. Despite these gains, concept modeling faces two primary challenges. First, current approaches are frequently hindered by pre-training biases, which result in granularity misalignment or an over-reliance on structural priors. Second, the conventional fine-tuning process using Binary Cross-Entropy (BCE) loss treats concepts in isolation. This independent treatment overlooks the mutual exclusivity inherent among concepts, resulting in suboptimal alignment.
To overcome these obstacles, we introduce CoAt-CBM (Concept-wise Attention for Fine-grained Concept Bottleneck Models), a novel framework designed to deliver both adaptive fine-grained image-concept alignment and enhanced interpretability. CoAt-CBM utilizes learnable concept-wise visual queries to dynamically generate fine-grained concept-specific visual embeddings. These embeddings are subsequently employed to construct a concept score vector. Furthermore, we propose a novel concept contrastive optimization strategy that directs the model to manage the relative significance of these concept scores. This approach ensures that concept predictions accurately mirror the image content, thereby improving alignment. Comprehensive experiments confirm that CoAt-CBM consistently surpasses state-of-the-art methodologies. The source code will be released upon acceptance of this work.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





