arXiv

COLLAR: Cascaded Object-Level Latent Refinement for High-Fidelity Conditional Generation

June 2, 2026 · Xinlong Zhang, Jia Wei, Xiaoyu Zhang, Teng Zhou, Chengyu Lin, Yongchuan Tang · Original Source

Title: COLLAR: Cascaded Object-Level Latent Refinement for High-Fidelity Conditional Generation

Abstract:

Despite the integration of structural priors such as Canny edges and depth maps, securing precise, high-fidelity control over objects within Diffusion Transformers continues to pose a substantial hurdle. Existing methods for object-level conditional generation often encounter issues with visual artifacts and fail to deliver accurate manipulation in confined, localized areas. To overcome these obstacles, we introduce Cascaded Object-Level Latent Refinement (COLLAR), a novel, training-free framework that incrementally enhances object-level characteristics through Field-of-View (FoV) expansion.

Our approach begins with the Cross-Scale Semantic Alignment (CSSA) module, which bridges spatial-semantic discrepancies by incorporating object-level data into extended-FoV branches using attention mechanisms. To further refine these features, the Cyclic Feature Injection (CFI) module employs a reciprocal background feedback system. This module utilizes a frequency-driven adaptive strategy to selectively update the global backbone with local information that aligns with the broader context. Ultimately, the extended-FoV branch acts as a central hub for feature optimization, facilitating the seamless integration of object-level details into the overall generation process while preserving the integrity of the final image. Comprehensive evaluations on the COCO-MIG and COCO-POS benchmarks reveal that our method consistently surpasses state-of-the-art techniques in terms of semantic alignment, spatial fidelity, and overall image quality.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC