MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution
Title: MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution
Abstract:
Developing language model agents capable of engaging in multi-agent strategic interactions is fundamentally challenged by the fact that the value of a given action often hinges on subsequent events that do not occur, rule-breaking moves, or choices made by competing players. Conventional reinforcement learning paradigms rely on the premise that rewards are assignable at every step; however, this premise breaks down in complex environments where outcomes are deeply intertwined across time and multiple agents. To address this, we present a novel framework featuring delayed per-step reward attribution integrated with eligibility gating. This system operates through an episode lifecycle and postprocessing pipeline that calculates rewards exclusively at the conclusion of an episode. It then traces these rewards backward to their specific originating steps based on task-specific semantics, while simultaneously filtering out any steps that lack the necessary dependent information for valid training.
When combined with asynchronous rollout generation utilizing vLLM’s continuous batching, curriculum-based opponent sampling, and multi-level stratified batch construction, this methodology facilitates stable and sample-efficient reinforcement learning within multi-agent settings. We tested this approach on the MindGames Arena benchmark at NeurIPS 2025. Our results demonstrate that a single 8-billion-parameter open-source model, trained using our method, either matched or exceeded the performance of significantly larger proprietary systems, such as GPT-5, in direct head-to-head competitions. Furthermore, this model secured first place in both the Open (unrestricted) and Efficient (limited to ≤8B parameters) tracks of the competition.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC