Non-Uniform Noise-to-Signal Ratio in the REINFORCE Policy-Gradient Estimator
Title: Non-Uniform Noise-to-Signal Ratio in the REINFORCE Policy-Gradient Estimator
Abstract: While policy-gradient techniques are a staple in reinforcement learning, practitioners frequently encounter training instability or stagnation as the optimization process advances. This study investigates this issue by analyzing the noise-to-signal ratio (NSR) of the policy-gradient estimator, which is calculated by dividing the estimator’s variance (noise) by the squared norm of the actual gradient (signal). Our primary findings demonstrate that for finite-horizon linear systems employing Gaussian policies with linear state-feedback, as well as finite-horizon polynomial systems using Gaussian policies with polynomial feedback, the NSR of the REINFORCE estimator can be precisely defined. This exact characterization is achievable either through closed-form expressions or numerical moment-evaluation algorithms, without relying on approximations. Furthermore, for broader scenarios involving general nonlinear dynamics and highly expressive policies, including those with neural network components, we establish a general upper bound for the variance. These analytical tools allow for a direct assessment of how the NSR fluctuates across different policy parameters and changes throughout optimization paths, such as those taken by SGD or Adam. Our experiments reveal that the NSR landscape is markedly non-uniform; it typically rises as the policy nears an optimal solution. In certain conditions, the NSR diverges, a phenomenon that can induce training instability and lead to policy collapse.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





