arXiv

Libra: Efficient Resource Management for Agentic RL Post-Training

June 3, 2026 · Kaiwen Chen, Xin Tan, Jingzong Li, Hong Xu · Original Source

Title: Libra: Optimizing Resource Allocation for Post-Training Agentic Reinforcement Learning

Abstract:

Reinforcement learning (RL) has emerged as a standard post-training framework for large language models (LLMs), expanding their capabilities beyond simple preference alignment to encompass complex reasoning and multi-turn agentic interactions. However, the rollout phase in agentic RL introduces significant resource management hurdles. By invoking tools to generate trajectories, this stage creates long-tailed and non-stationary workloads that defy traditional resource-management assumptions.

Three primary challenges define this landscape. First, the long-tailed nature of the distribution means that a minimal number of trajectories are responsible for the majority of the rollout makespan. Second, there is a pronounced asymmetry between the rollout and training phases regarding their sensitivity to sequence length, memory requirements, and compute patterns. Third, as the RL policy evolves, the distribution of trajectory lengths shifts over time, causing any fixed resource split to become increasingly inefficient.

To address these issues, we introduce Libra, a system built on two core mechanisms. The first is a periodic global resource planner that simultaneously optimizes GPU allocation across both rollout and training clusters. This planner utilizes an elastic hybrid pool to facilitate rapid, non-blocking reallocation of workers between stages. The second mechanism is a causality-driven multi-level feedback queue (C-MLFQ) scheduler. Instead of relying on unreliable length predictions, this scheduler directs requests to heterogeneous rollout buckets based on causal signals extracted from tool-return outcomes.

Evaluations conducted on 48 A800 GPUs demonstrate that Libra outperforms baseline methods, achieving up to a 3.0$\times$ increase in throughput and converging up to 2.5$\times$ faster in terms of reward.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC