Beyond Rigid: Benchmarking Non-Rigid Video Editing
Title: Beyond Rigid: Benchmarking Non-Rigid Video Editing
Abstract: As video generation models face increasing pressure to accurately manipulate physical dynamics, the evaluation landscape must evolve beyond mere appearance fidelity and semantic alignment. Non-rigid video editing serves as a particularly insightful testbed for this purpose, as diverse materials impose unique physical constraints. In this study, we present NRVBench, a diagnostic benchmark designed for non-rigid video editing. The objective is to alter deformable motion while keeping irrelevant areas unchanged and ensuring material-specific plausibility. NRVBench comprises 180 carefully curated videos spanning six physics-based categories, accompanied by 2,340 detailed editing instructions, 360 multiple-choice questions, and pixel-precise masks. Additionally, we introduce NRVE-Acc, a structured protocol based on Vision-Language Models (VLMs) that breaks down editing success into three components: adherence to instructions, material-aware deformation plausibility, and temporal coherence with motion cues. Our experiments on several representative inference-time video editing methods highlight a significant discrepancy between traditional metrics and physics-aware perceptual success. Specifically, models that maintain high appearance fidelity or achieve strong global alignment can still fail when handling non-rigid dynamics. Finally, we present VM-Edit, a straightforward region-conditioned editing baseline that isolates the foreground while stabilizing the background, thereby revealing the inherent trade-off between stability and plasticity.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





