Off-Policy Learning in Large Action Spaces: Optimization Matters More Than Estimation
Title: In Large Action Spaces, Optimization is More Critical Than Estimation for Off-Policy Learning
Abstract:
Off-policy evaluation (OPE) and off-policy learning (OPL) serve as the bedrock for decision-making processes within offline contextual bandits. While recent progress in OPL has largely focused on refining OPE estimators to achieve better statistical performance, there is an underlying assumption that superior estimators automatically lead to better policies. Although this estimator-centric view holds theoretical merit, it overlooks a significant practical hurdle: the complexity of optimization landscapes.
This study offers both theoretical analysis and empirical data revealing that existing OPL methods suffer from serious optimization difficulties, a problem that intensifies as the size of the action space increases. We demonstrate that while policy parametrization tailored to specific estimators can alleviate some of these issues, it does not completely solve them. Consequently, we investigate the use of simpler weighted log-likelihood objectives. Our results show that these simpler approaches possess markedly superior optimization characteristics and are capable of producing policies that are not only competitive but frequently superior to those derived from more complex estimators. These insights highlight the urgent need to prioritize explicit optimization strategies when developing OPL algorithms designed for environments with large action spaces.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





