Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing
Title: Chameleon: A Framework for Disentangling Style and Content in Cross-Domain Object Compositing
Abstract:
Image compositing involves the seamless integration of a foreground subject into a background scene. While recent progress in diffusion models has markedly improved quality—particularly when both elements originate from the same domain, such as natural imagery—cross-domain compositing remains a significant challenge. This task requires the model to maintain the identity of the foreground object while simultaneously adapting its style to align with the background domain. Currently, this area is under-researched, and most existing solutions depend on training-free blending and refinement techniques. This reliance stems largely from the scarcity of large-scale paired datasets for cross-domain scenarios, which has hindered the creation of training-based alternatives. Consequently, prior methods are often restricted to tone-level adjustments, leading to results that are either stylistically inconsistent or excessively stylized.
To address these issues, we introduce ChameleonDataset, the first large-scale training dataset designed for cross-domain compositing, accompanied by a comprehensive evaluation benchmark. This resource was developed using a scalable data construction pipeline. Leveraging this dataset, we present Chameleon, a novel two-stage training-based framework for cross-domain compositing. The first stage employs Joint Hard Contrastive Learning (JHCL) to train the ChameleonEncoder, successfully separating style and content representations. In the second stage, we integrate Spatio-Temporal Attention Gating (STAG) into a diffusion transformer to facilitate effective stylization. This mechanism adaptively controls the injection of style tokens from the initial encoder across both spatial and temporal dimensions. Our approach surpasses current state-of-the-art models for both in-domain and cross-domain compositing, as well as sequential pipelines and commercial tools, demonstrating superior performance in compositional plausibility and stylistic fidelity.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





