Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding
Title: Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding
Abstract:
Although on-policy distillation provides dense supervision signals beneficial for training compact reasoning models, the optimization dynamics within the multimodal sphere have not been thoroughly investigated. This study questions the conventional monolithic approach to Vision-Language Model (VLM) distillation by mathematically breaking down the loss function into two separate elements: the language prior and visual grounding. Our analysis reveals that the gradient vectors associated with these components are nearly orthogonal, suggesting that the goal of aligning with the teacher’s language distribution is geometrically distinct from the aim of replicating its visual perception. As a result, standard optimization processes tend to passively adhere to a suboptimal compromise path that implicitly balances these two objectives. Believing that visual grounding represents the main constraint in vision-language reasoning, we propose Visual Gradient Steering (VGS), a technique that dynamically adjusts the update vector to emphasize the visual subspace. Extensive experiments across various distillation configurations and complex multimodal benchmarks show that VGS markedly surpasses the standard monolithic formulation of on-policy distillation, delivering superior grounding performance with negligible additional training costs.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





