arXiv

Adversarial Dual On-Policy Distillation from Expressive Teacher

Title: Adversarial Dual On-Policy Distillation from Expressive Teacher

Abstract: While recent advancements in diffusion models and flow-matching techniques have enhanced behavioral cloning by capturing multi-modal expert actions, these approaches remain fundamentally offline supervised learners. Consequently, the policy is trained exclusively on expert states and lacks corrective feedback regarding the states it actually encounters during execution. On-policy distillation (OPD) presents a logical solution to this limitation; however, traditional OPD frameworks rely on the availability of a strong, fixed teacher, a condition that is rarely met in demonstration-only control scenarios.

To address this, we introduce FA-OPD, an adversarial dual on-policy distillation framework. This method involves co-training a lightweight MLP student alongside a Flow Matching (FM) teacher that is derived from demonstrations. The teacher delivers two distinct, complementary signals based on student rollouts. First, the reward channel optimizes an expert-likeness metric across state-action pairs, thereby facilitating online exploration via long-horizon policy optimization. Second, the action channel provides dense, local targets at states visited by the student, which serves to stabilize exploitation. FA-OPD integrates these components to ensure that reward distillation promotes generalization beyond specific demonstration points, while action distillation maintains exploration within the vicinity of expert-like behaviors.

Evaluated across six benchmarks spanning robot navigation, manipulation, and locomotion, FA-OPD outperforms strong baseline methods and demonstrates significantly enhanced robustness when faced with noisy or sparse demonstration data.

Source code: https://github.com/vanzll/FA-OPD


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...