Decision-Focused On-Policy Learning for Contextual Linear Optimization with Partial Feedback
Title: On-Policy Learning for Contextual Linear Optimization with Partial Feedback: A Decision-Focused Approach
Abstract
Decision-focused learning (DFL) prioritizes the quality of downstream decisions over the standalone accuracy of predictive models during training. While current DFL approaches for contextual linear optimization typically rely on offline datasets and assume complete visibility of the objective cost vector, this work introduces an on-policy learning framework for sequential contextual linear optimization under partial feedback. This approach extends the conventional bandit feedback paradigm. The proposed method employs a stochastic predict-then-optimize policy, which draws a prediction of the cost vector from a conditional distribution and subsequently addresses the downstream linear optimization problem.
To refine the distributional model, we propose a hybrid gradient estimator comprising two distinct elements. The first is a score function estimator, delivering an unbiased policy gradient estimate that may exhibit high variance. The second element is a decision-focused plug-in component that leverages an auxiliary estimate of the latent cost vector to capitalize on the structure of the downstream optimization task; its utility increases as the accuracy of this auxiliary estimate improves. We establish an $\mathcal{O}(T^{-1/2})$ bound for the average squared norm of the policy gradient, aligning with the convergence rates typical of standard non-convex stochastic gradient descent (SGD). Empirical evaluations across top-$k$ selection, shortest path, combinatorial pricing, and a real-world energy-scheduling benchmark demonstrate that the hybrid gradient strategy yields lower cumulative regret compared to contextual-bandit baselines. These results hold true for both Gaussian and more complex conditional generative models. The associated code can be accessed at https://github.com/Joeyetinghan/on-policy-bandit-dfl.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





