arXiv

Adversarial Dual On-Policy Distillation from Expressive Teacher

June 2, 2026 · Zhenglin Wan, Jingxuan Wu, Xingrui Yu, Chubin Zhang, Mingcong Lei, Bo An, Ivor W. Tsang, Yang You · Original Source

Title: Adversarial Dual On-Policy Distillation from Expressive Teacher

Abstract: While recent advancements in diffusion models and flow-matching techniques have enhanced behavioral cloning by capturing multi-modal expert actions, these approaches remain fundamentally offline supervised learners. Consequently, the policy is trained exclusively on expert states and lacks corrective feedback regarding the states it actually encounters during execution. On-policy distillation (OPD) presents a logical solution to this limitation; however, traditional OPD frameworks rely on the availability of a strong, fixed teacher, a condition that is rarely met in demonstration-only control scenarios.

To address this, we introduce FA-OPD, an adversarial dual on-policy distillation framework. This method involves co-training a lightweight MLP student alongside a Flow Matching (FM) teacher that is derived from demonstrations. The teacher delivers two distinct, complementary signals based on student rollouts. First, the reward channel optimizes an expert-likeness metric across state-action pairs, thereby facilitating online exploration via long-horizon policy optimization. Second, the action channel provides dense, local targets at states visited by the student, which serves to stabilize exploitation. FA-OPD integrates these components to ensure that reward distillation promotes generalization beyond specific demonstration points, while action distillation maintains exploration within the vicinity of expert-like behaviors.

Evaluated across six benchmarks spanning robot navigation, manipulation, and locomotion, FA-OPD outperforms strong baseline methods and demonstrates significantly enhanced robustness when faced with noisy or sparse demonstration data.

Source code: https://github.com/vanzll/FA-OPD

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC