arXiv

Positional Encodings Anchor Spatial Structure in Vision Transformers: A Geometric Perspective on Robustness

June 2, 2026 · Mahmoud Mannes · Original Source

Title: Positional Encodings Anchor Spatial Structure in Vision Transformers: A Geometric Perspective on Robustness

Abstract: While the influence of positional embeddings (PEs) on the performance and robustness of Vision Transformers (ViTs) is widely recognized, their specific function in molding internal spatial representations remains poorly understood. This study investigates how various PE formats affect the representational geometry of ViTs and links these geometric changes to model resilience against distribution shifts that disrupt visual content. To measure spatial structure within token representations, we propose a new metric: Spatial Similarity Distance Correlation (SSDC). Our analysis reveals that ViTs trained without PEs do develop non-trivial spatial structures; however, these structures are content-dependent and disintegrate when tokens are permuted. Conversely, we observe that all examined PE types—specifically learned absolute, sinusoidal, and rotary encodings—drive a consistent shift toward an index-based spatial organization. Consequently, the representations in these models maintain stability against content-disrupting perturbations and demonstrate significantly enhanced robustness to such distributional changes. Furthermore, although different PEs generate distinct depth-wise trajectories for spatial structure, their robustness characteristics are largely comparable, with only minor variations across encoding schemes. This suggests that resilience relies more heavily on the existence of a stable positional reference frame than on the particular encoding mechanism employed. These findings provide a geometric explanation for how positional encodings mold internal representations, offering valuable insights for the principled development of future encoding strategies.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC