Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues
Title: Vision-Language Models Confuse Head Position for Gaze: Insights from Nonverbal Conversation Cues
Abstract: The direction of one’s gaze serves as a fundamental nonverbal signal that both children and adults utilize effortlessly. This study evaluates the capability of Vision-Language Models (VLMs) to deduce where a person is looking. To create a robust set of evaluation stimuli, we collected 1,360 photographs from real-world environments featuring individuals focusing on specific objects placed on a table. Crucially, the experiments manipulated the head orientation of the subjects: in some instances, the head pointed toward the target of the gaze; in others, it faced a distractor object; and in the remaining cases, head position was left uncontrolled.
Our analysis revealed a significant disparity in performance between VLMs and human observers. After eliminating potential confounding factors such as image resolution and object recognition capabilities, we determined that the primary cause of this performance gap is the models’ reliance on head orientation rather than eye appearance to estimate gaze direction. Evidence suggests this bias stems from training data rather than model architecture, a conclusion supported by a proof-of-concept experiment involving the fine-tuning of a transformer-based vision model. Subsequent research should examine whether these results are consistent across various deep learning methodologies trained on current datasets, and whether improved data quality can resolve this issue for all architectural types. Understanding the root cause of this error is essential for developing technologies capable of accurately interpreting gaze, thereby facilitating more effective human-machine interactions.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





