Trajectory Data Suffices for Statistically Efficient Policy Evaluation in Fixed-Horizon Offline RL with Linear $q^\pi$-Realizability and Concentrability
Title: Trajectory Data Enables Statistically Efficient Policy Evaluation in Fixed-Horizon Offline RL with Linear $q^\pi$-Realizability and Concentrability
Abstract: This study investigates finite-horizon offline reinforcement learning (RL) employing function approximation, addressing both policy optimization and policy evaluation. Previous research by Foster et al. (2021) demonstrated that statistically efficient learning is unattainable for either task if the only premises are $q^\pi$-realizability—where the state-action value function for every policy is linearly realizable—and data concentrability (sufficient coverage). However, Tkachuk et al. (2024) recently introduced a statistically efficient algorithm for policy optimization, provided the data is structured as trajectories. Building on this, we propose a statistically efficient learner for policy evaluation that relies on these identical assumptions. Additionally, we demonstrate that the sample complexity of the policy optimization algorithm developed by Tkachuk et al. (2024) can be reduced through a more rigorous analysis.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





