arXiv

A Unified Framework for Locality in Scalable MARL

Title: A Unified Framework for Locality in Scalable MARL

Abstract:

Networked multi-agent reinforcement learning (MARL) enables scalable planning by restricting each agent to consider only a local neighborhood within the agent graph. This approach is valid under the assumption of value locality, which posits that perturbations at one agent have a diminishing impact on the long-term value of distant agents. In the context of average-reward settings, establishing locality typically relies on the Dobrushin row-sum bound applied to a specific matrix, $C^\pi$, which characterizes the dependence of each agent's next state on the current states of others. To facilitate computation, previous studies have bounded this matrix using the supremum over all joint actions. While this policy-independent bound is straightforward, it often proves overly conservative, particularly when the employed policy does not select worst-case actions.

In this work, we decompose $C^\pi$ into distinct components that isolate environmental sensitivity from policy sensitivity, expressed as $C^\pi \preceq E^{\mathrm s}+E^{\mathrm a}\Pi(\pi)$. Here, $E^{\mathrm s}$ quantifies the variation of the next state relative to the current state, $E^{\mathrm a}$ captures the sensitivity to current actions, and $\Pi(\pi)$ reflects the policy’s responsiveness to state changes. The spectral radius of the matrix $H^\pi := E^{\mathrm s}+E^{\mathrm a}\Pi(\pi)$ governs the decay rate of the average-reward Poisson solution. Our spectral certificate condition, $\rho(H^\pi)<1$, is strictly less restrictive than the standard row-sum condition $|H^\pi|_\infty<1$ applied to the same matrix. Consequently, our framework remains effective in scenarios where prior Dobrushin-style methods, which rely on policy-independent action-supremum bounds, fail.

Furthermore, for softmax policies with temperature $\tau$, we demonstrate that $\Pi(\pi)\le L/(2\tau)$, indicating that the softmax temperature directly modulates locality. Leveraging this decay property, we derive a deterministic oracle guarantee for a block-coordinate KL-proximal policy-improvement template, showing that the truncation bias decreases exponentially with respect to the message-passing radius $\kappa$.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Glazer Family Members Said to Study Manchester United Stake Sale
Bloomberg

Glazer Family Members Said to Study Manchester United Stake Sale

Reports indicate the Glazer family is evaluating a potential sale of their Manchester United stake, with family members ...

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines
Bloomberg

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines

Ares’ Blair Jacobson argues that private credit headlines misrepresent reality, highlighting a disconnect between media ...

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion
Bloomberg

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion

Nvidia-backed robotics startup Generalist AI has reached a $2 billion valuation. Founders Pete Florence, Andy Zeng, and ...

TechCrunch

Oura Ring 5 review: Thinner, lighter, better

The Oura Ring 5 is 40% smaller and lighter than its predecessor, offering superior comfort and a discreet, jewelry-like ...

Financial Times

How AI has de-skilled translation

AI fragments specialist translation into routine tasks, effectively de-skilling the profession. This shift reduces compl...

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...