arXiv

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

June 2, 2026 · Liyang Li, Muzhi Zhu, Zhiyue Zhao, Hengyu Zhao, Ke Liu, Linhao Zhong, Hao Chen, Chunhua Shen · Original Source

Title: Navigating to a Goal: Can Foundation Models Achieve Target Viewpoints via Active Exploration?

Abstract

While humans can effectively align their perspective with a target image by actively moving their head and body, spatial intelligence in foundation models has predominantly focused on the passive interpretation of static, pre-recorded data. To bridge this gap, we present Target Viewpoint Reproduction (TVR), an active challenge in which an agent manipulates its position within a 3D setting until its visual input corresponds to a specified target image. Accompanying this task is TVRBench, a benchmark based on indoor simulations that evaluates agents across varying scene scales and the visual complexity of target viewpoints.

Current capabilities remain limited. On the evaluation split, the leading open-source and closed-source models achieved success rates of merely 7.8% and 12.0%, respectively. Detailed analysis reveals two primary constraints: standard models find it difficult to manage visual history across multiple turns, and their performance deteriorates significantly when the task demands body translation rather than simple in-place rotation. This highlights a deficiency in translating spatial errors into appropriate embodied actions.

To address these shortcomings, we developed a comprehensive post-training framework for TVR. This framework integrates expert-trajectory Supervised Fine-Tuning (SFT), rationale-supervised Chain-of-Thought SFT, offline Single-turn Generative Reward-Based Policy Optimization (GRPO), and on-policy Multi-turn GRPO utilizing live simulator rollouts. Our findings indicate that Visual-action SFT yields the most significant improvements, boosting a 9B open-source model’s success rate to 50.8%. Furthermore, Multi-turn GRPO facilitates targeted refinement across multiple rooms, pushing overall performance to 51.4%. In contrast, CoT supervision and Single-turn GRPO were found to hinder closed-loop performance. These outcomes position TVRBench as a critical platform for assessing and developing foundation models capable of active perception and action in 3D spaces. Code, data, and models are accessible at https://github.com/aim-uofa/TVRBench.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC