Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models
Enhancing Image-to-Video Motion by Adjusting Reference Frame Influence
Abstract
Compared to their text-to-video counterparts, image-to-video (I2V) models frequently produce content that lacks sufficient dynamism, appearing overly static. Although previous strategies have attempted to address this limitation by altering or dampening the conditioning signal derived from the input image, these methods typically demand extra training resources or compromise the visual fidelity to the source picture. This study identifies "reference-frame dominance" as the primary driver behind this suppression of movement. We find that during generation, non-reference frames in I2V architectures assign disproportionately high self-attention weights to key tokens associated with the reference frame. This behavior leads to the excessive propagation of reference-specific information across the temporal sequence, thereby stifling inter-frame dynamics. Leveraging this insight, we introduce DyMoS (Dynamic Motion Slider), a model-agnostic approach that requires no additional training. DyMoS works by recalibrating the attention flow from newly generated frames back to the reference frame during the early stages of denoising. The method preserves both the original model weights and the input image, relying instead on a single scalar parameter that allows for continuous adjustment of motion intensity. Evaluations across various state-of-the-art I2V backbones confirm that DyMoS reliably enhances motion dynamics without sacrificing visual quality or adherence to the reference image.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC



