Reusing Trajectories in Policy Gradients Enables Fast Convergence
Title: Accelerating Policy Gradient Convergence via Trajectory Reuse
Abstract:
Policy gradient (PG) algorithms constitute a powerful category of reinforcement learning techniques, especially for continuous control tasks. However, their dependence on fresh on-policy data renders them sample-inefficient, necessitating approximately $O(\epsilon^{-2})$ trajectories to achieve an $\epsilon$-approximate stationary point. To mitigate this inefficiency, practitioners often employ strategies that recycle information from earlier iterations, such as historical gradients or trajectories, thereby transitioning toward off-policy PG methods. Although gradient reuse has been extensively studied—yielding improved convergence rates up to $O(\epsilon^{-3/2})$—the theoretical implications of reusing past trajectories remain largely underexplored, despite the approach's intuitive appeal.
This study offers the first rigorous theoretical proof that leveraging past off-policy trajectories can drastically speed up PG convergence. We introduce RT-PG (Reusing Trajectories - Policy Gradient), a new algorithm designed to effectively merge on-policy and off-policy data from the most recent $\omega$ iterations. RT-PG utilizes a power mean-corrected multiple importance weighting estimator for this integration. Our novel analysis demonstrates that RT-PG attains a sample complexity of $\tilde{O}(\epsilon^{-2}\omega^{-1})$. In the scenario where all available historical trajectories are reused, this results in a convergence rate of $\tilde{O}(\epsilon^{-1})$, which currently represents the best-known performance in the literature for PG methods. Empirical evaluations further confirm the method's superiority over baselines that achieve state-of-the-art rates.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






