arXiv

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

June 2, 2026 · Mingyuan Fan, Weiguang Han, Daixin Wang, Cen Chen, Zhiqiang Zhang, Jun Zhou · Original Source

Title: Assessing Interactive Reasoning in Large Language Models: A Hierarchical Benchmark Using Executable Games

Abstract: This paper presents a novel multi-turn interactive framework designed to evaluate reasoning capabilities by treating the process as active evidence gathering and subsequent belief updating. In this setup, large language models (LLMs) are provided solely with task instructions and are required to formulate specific queries to access a concealed environment. They must then synthesize these partial observations across multiple turns to determine the optimal moment for submitting a final response.

Beyond traditional metrics such as success rates and interaction efficiency, our approach assesses contextual robustness by introducing controlled perturbations. It also measures metacognitive adaptation via counterfactual revision and necessity judgment. To operationalize this framework, we developed a benchmark comprising 474 executable games. These games are tested across five distinct configuration search spaces, each representing a specific difficulty level, allowing for the evaluation of a wide range of state-of-the-art LLMs.

Our findings indicate that the benchmark is highly effective at distinguishing model performance, revealing significant disparities not only in overall success but also in how efficiently models interact. Empirical analysis demonstrates that while contextual perturbations result in moderate yet consistent performance declines, tasks involving counterfactual revision and necessity judgment trigger substantially larger drops in accuracy.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Global News Digest

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

State Street's Paglia on SBI Group Partnership, ETFs

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

TSE Talking With Regulator For Easing ETF Listing Rules

S&P DJI CEO on Japan Markets, Mega IPOs