Attend to Anything: Foundation Model for Unified Human Attention Modeling
Title: Attend to Anything: A Foundation Model for Unified Human Attention Modeling
Abstract
Current approaches to human attention (or saliency) modeling remain heavily fragmented, varying significantly across different modalities, environments, and task definitions. As a result, despite advancements in model capacity and the volume of training data, existing solutions are largely confined to specific scenes and tasks, lacking the practical generalization capabilities required for real-world deployment. To overcome these fundamental constraints, we introduce the Attend to Anything Model (AAM), a multi-modal foundation model designed to consolidate attention modeling across a wide spectrum of image, video, and audio-visual tasks and settings. AAM redefines attention as a cognitive entailment relationship structured within a general-to-specific hierarchy, a framework realized through language prompts and hierarchical embeddings situated in hyperbolic space. Additionally, to bridge the gap between static image attention and dynamic video attention, we employ a fluid-dynamics approach, modeling video-frame attention as a diffusive temporal evolution driven by the Fokker--Planck equation. Our extensive evaluation across 16 benchmarks reveals that AAM consistently surpasses state-of-the-art methods, delivering an average performance improvement of 6% across diverse scenarios and achieving roughly a 4$\times$ acceleration in video inference. These findings establish AAM as a robust and principled foundation for future investigations into attention and saliency-related tasks. The associated dataset and code are accessible at https://github.com/wz-zhao/Attend-to-Anything.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





