arXiv

SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

June 2, 2026 · Olaf D\"unkel, Basavaraj Sunagad, Haoran Wang, David T. Hoffmann, Christian Theobalt, Adam Kortylewski · Original Source

Title: SOCO: Evaluating Semantic Object Correspondence in Vision Foundation Models

Abstract:

Assessing the structured object comprehension of vision foundation models is currently hindered by a lack of standardized evaluation protocols and insufficient part-level supervision. Semantic correspondence (SC) serves as a metric for this capability, testing an model’s ability to align specific object parts across different instances and categories, even amidst significant changes in appearance, geometry, and viewpoint. To facilitate a rigorous and systematic SC evaluation, we present SOCO, a novel benchmark designed for Semantic Object Correspondence. SOCO establishes a taxonomy of correspondence types and delivers consistent, functionally relevant keypoint annotations spanning 100 categories and more than 1 million correspondence pairs. Furthermore, the dataset incorporates language descriptions for keypoints, allowing for the assessment of large vision-language models (LVLMs) and their ability to understand fine-grained object parts.

Our comprehensive experiments yield three primary insights: (i) while vision foundation backbones capture robust semantic structures, they struggle to transfer correspondence knowledge across related categories and only partially grasp the positional relationships of object parts; (ii) LVLMs demonstrate superior proficiency in localizing parts via text prompts compared to matching visual references across images, highlighting a disconnect between language-grounded localization and detailed visual correspondence; and (iii) performance in correspondence tasks is a stronger predictor of success in dense downstream applications—such as segmentation, tracking, 3D pose estimation, and 3D detection—than traditional ImageNet classification metrics. Collectively, these results establish SOCO as a vital benchmark for evaluating the quality of structured, part-level representations within both vision and multimodal foundation models.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC