arXiv

RescueBench: Can Embodied Agents Save Lives in the Wild ?

June 2, 2026 · Kui Wu, Beiyu Guo, Hao Chen, ShuHang Xu, Yuling Li, Yongdan Zeng, Zhoujun Li, Yizhou Wang, Fangwei Zhong · Original Source

Title: RescueBench: Can Embodied Agents Save Lives in the Wild?

Abstract:

Search-and-rescue (SAR) operations demand that embodied agents navigate unknown terrains amidst multimodal uncertainty, execute complex multi-stage interactions, and maintain spatial memory across extended timeframes. While current benchmarks assess these individual capabilities in isolation, they fail to clarify how such failures accumulate when integrated into realistic, composite workflows. To address this gap, we present RescueBench, a photo-realistic diagnostic benchmark that models SAR as a four-stage process: multimodal exploration, target rescue, memory-guided return, and final handoff.

By integrating sequential task composition with granular, stage-level evaluation, RescueBench allows for an analysis of how errors in exploration and memory propagate throughout embodied rescue scenarios. The benchmark features five progressive difficulty tiers that modulate environmental complexity, clue ambiguity, and spatial hierarchy. It also includes an automated pipeline for episode generation and annotation, facilitating scalable evaluation and training.

Our evaluation of seven baseline models, an oracle reference, and human participants reveals that no baseline model successfully completed the full task at the highest difficulty level. Stage-specific diagnostics indicate that autonomous exploration is the primary failure mode, while spatial memory acts as a secondary, independent bottleneck. These findings suggest that existing topological visual-language navigation and map-based methods do not adequately resolve these critical limitations. The code for RescueBench is publicly available at https://github.com/wukui-muc/RescueBench.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC