Global News Digest

arXiv

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

Title: Assessing Interactive Reasoning in Large Language Models: A Hierarchical Benchmark Using Executable Games

Abstract: This paper presents a novel multi-turn interactive framework designed to evaluate reasoning capabilities by treating the process as active evidence gathering and subsequent belief updating. In this setup, large language models (LLMs) are provided solely with task instructions and are required to formulate specific queries to access a concealed environment. They must then synthesize these partial observations across multiple turns to determine the optimal moment for submitting a final response.

Beyond traditional metrics such as success rates and interaction efficiency, our approach assesses contextual robustness by introducing controlled perturbations. It also measures metacognitive adaptation via counterfactual revision and necessity judgment. To operationalize this framework, we developed a benchmark comprising 474 executable games. These games are tested across five distinct configuration search spaces, each representing a specific difficulty level, allowing for the evaluation of a wide range of state-of-the-art LLMs.

Our findings indicate that the benchmark is highly effective at distinguishing model performance, revealing significant disparities not only in overall success but also in how efficiently models interact. Empirical analysis demonstrates that while contextual perturbations result in moderate yet consistent performance declines, tasks involving counterfactual revision and necessity judgment trigger substantially larger drops in accuracy.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.