arXiv

Auteur: Language-Driven Cinematographic Framing for Human-Centric Video Generation

June 2, 2026 · Muhammed Burak Kizil, Enes Sanli, Niloy J. Mitra, Xuelin Chen, Erkut Erdem, Aykut Erdem, Duygu Ceylan · Original Source

Title: Auteur: Language-Driven Cinematographic Framing for Human-Centric Video Generation

Abstract:

While generative video models have made significant strides in achieving high visual fidelity and temporal consistency, precise camera control remains a persistent challenge. Current frameworks typically treat camera movement as a secondary outcome of pixel generation, resulting in trajectories that are often random, spatially disjointed, and disconnected from the human subjects central to the scene. To address this, we introduce Auteur, a novel approach that enables language-driven, human-centric camera framing within generative video systems.

Our primary observation is that professional cinematographers do not conceptualize shots as trajectories through world space; rather, they define framing relative to the actor, specifying shot size, angle, and composition as variables dependent on human pose and motion. We translate this intuition into a human-centric camera parameterization and develop a Domain-Specific Language (DSL) that can be converted into standard 6-DoF camera parameters. In this pipeline, a fine-tuned multimodal large language model serves as a virtual director, translating natural language prompts and coarse human motion data into sparse DSL keyframes. These keyframes are deterministically interpolated to create continuous camera trajectories, which are subsequently fed into video generators as input.

We trained and evaluated the Auteur framework using a newly constructed dataset comprising 34,000 instances of aligned text, human motion, and DSL-annotated camera trajectories. This dataset was compiled from procedural synthesis and real-world movie footage sourced from the CondensedMovies dataset. Auteur successfully introduces cinematographic framing capabilities for human-centered scenes, a feature largely missing from previous generative models. To rigorously evaluate this performance, we developed new metrics focused specifically on framing quality. Our experimental results demonstrate that Auteur consistently surpasses existing methods.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC