arXiv

COP-Q: Safety-First Reinforcement Learning for Robot Control via Cholesky-Ordered Projection

June 4, 2026 · Guopeng Li, Moritz A. Zanger, Matthijs T. J. Spaan, Julian F. P. Kooij · Original Source

Title: COP-Q: Prioritizing Safety in Reinforcement Learning for Robot Control through Cholesky-Ordered Projection

Abstract: In the realm of safe robot control, the primary challenge lies in maximizing cumulative returns while strictly adhering to safety constraints. Traditional off-policy safe reinforcement learning approaches typically employ separate critic ensembles to learn reward and safety Q-values independently, managing uncertainty for each objective in isolation. This fragmented, objective-wise methodology ignores the correlation between objectives, often resulting in excessively conservative value estimates that compromise sample efficiency. To overcome these limitations, we introduce Cholesky-Ordered Projection Q-learning (COP-Q), a safety-centric framework that integrates inter-objective covariance into vector-valued Q-value estimation. By constructing a generalized confidence bound within the joint Q-value space, COP-Q utilizes Cholesky factorization to establish a sequential encoding of objective priorities. This mechanism maintains necessary conservatism for safety while dynamically mitigating unnecessary caution in the reward objective. The refined estimates are subsequently applied to both temporal-difference target calculations and actor optimization. COP-Q introduces negligible computational costs and is seamlessly integrable with most contemporary deep Q-learning architectures. Empirical evaluations across robot locomotion tasks in Brax and safe navigation scenarios in Safety-Gymnasium—encompassing both hard and soft safety regimes—reveal that COP-Q delivers robust safety outcomes alongside sample efficiency that is either competitive with or superior to established baseline methods.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC