arXiv

MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

June 2, 2026 · Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov · Original Source

Title: MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

Abstract:

Developing language model agents capable of engaging in multi-agent strategic interactions is fundamentally challenged by the fact that the value of a given action often hinges on subsequent events that do not occur, rule-breaking moves, or choices made by competing players. Conventional reinforcement learning paradigms rely on the premise that rewards are assignable at every step; however, this premise breaks down in complex environments where outcomes are deeply intertwined across time and multiple agents. To address this, we present a novel framework featuring delayed per-step reward attribution integrated with eligibility gating. This system operates through an episode lifecycle and postprocessing pipeline that calculates rewards exclusively at the conclusion of an episode. It then traces these rewards backward to their specific originating steps based on task-specific semantics, while simultaneously filtering out any steps that lack the necessary dependent information for valid training.

When combined with asynchronous rollout generation utilizing vLLM’s continuous batching, curriculum-based opponent sampling, and multi-level stratified batch construction, this methodology facilitates stable and sample-efficient reinforcement learning within multi-agent settings. We tested this approach on the MindGames Arena benchmark at NeurIPS 2025. Our results demonstrate that a single 8-billion-parameter open-source model, trained using our method, either matched or exceeded the performance of significantly larger proprietary systems, such as GPT-5, in direct head-to-head competitions. Furthermore, this model secured first place in both the Open (unrestricted) and Efficient (limited to ≤8B parameters) tracks of the competition.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Global News Digest

MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

State Street's Paglia on SBI Group Partnership, ETFs

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

TSE Talking With Regulator For Easing ETF Listing Rules

S&P DJI CEO on Japan Markets, Mega IPOs