Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games
Title: Assessing Interactive Reasoning in Large Language Models: A Hierarchical Benchmark Using Executable Games
Abstract: This paper presents a novel multi-turn interactive framework designed to evaluate reasoning capabilities by treating the process as active evidence gathering and subsequent belief updating. In this setup, large language models (LLMs) are provided solely with task instructions and are required to formulate specific queries to access a concealed environment. They must then synthesize these partial observations across multiple turns to determine the optimal moment for submitting a final response.
Beyond traditional metrics such as success rates and interaction efficiency, our approach assesses contextual robustness by introducing controlled perturbations. It also measures metacognitive adaptation via counterfactual revision and necessity judgment. To operationalize this framework, we developed a benchmark comprising 474 executable games. These games are tested across five distinct configuration search spaces, each representing a specific difficulty level, allowing for the evaluation of a wide range of state-of-the-art LLMs.
Our findings indicate that the benchmark is highly effective at distinguishing model performance, revealing significant disparities not only in overall success but also in how efficiently models interact. Empirical analysis demonstrates that while contextual perturbations result in moderate yet consistent performance declines, tasks involving counterfactual revision and necessity judgment trigger substantially larger drops in accuracy.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC