Formalizing the Binding Problem
Title: Formalizing the Binding Problem
Abstract
World representations arguably encode not only feature data—such as color or shape—but also "binding information," which specifies how these features cluster into distinct objects (e.g., identifying that a specific blue object is circular). While any system capable of interpreting multi-object scenes must resolve this binding problem, it remains unclear whether contemporary deep learning models actually acquire such binding information for features, despite evidence that Vision Transformers (ViTs) can identify which image patches belong together. Given that ViT-based architectures frequently fail by misattributing features to incorrect objects—particularly when objects share similar characteristics—it is tempting to assume binding information is scarce in these models. To address this, we employ an information-theoretic framework to formalize the binding problem and propose a probing methodology to quantify binding information within model representations. Our experiments evaluate ViTs, analyzing binding metrics across various architectural components, including the image summary token ([CLS]) and spatial tokens. Utilizing datasets that present distinct binding challenges, such as feature sharing, occlusion, and natural feature variations, we benchmark the performance of several pre-trained ViTs. Ultimately, our findings underscore binding as a critical component for achieving robust visual recognition and reasoning.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



