FlowOVD: Learning Generative Latent Flows for Zero-shot Open-vocabulary Detection
Title: FlowOVD: Learning Generative Latent Flows for Zero-shot Open-vocabulary Detection
Abstract:
While open-vocabulary object detection (OVD) has seen significant advancements driven by large-scale vision-language pre-training, current approaches generally treat OVD as a discriminative prediction task. In these existing frameworks, decoder queries are either fixed or derived from encoder features, which restricts their variability and adaptability. To address these limitations, this study presents a generative approach that conceptualizes the creation of decoder queries as a continuous transport process within latent space.
We introduce FlowOVD, a framework for generating queries conditioned on text, utilizing rectified flow to gradually convert text-agnostic queries into those guided by textual input. By integrating continuous latent query dynamics into a detector built on a vision-language model (VLM), our method eliminates the need for heuristic discrete query construction. This integration facilitates more nuanced semantic alignment, thereby enhancing performance in open-vocabulary detection.
Notably, FlowOVD requires no additional training data yet delivers strong results, achieving 49.5 AP on the COCO dataset and 31.5 AP on LVIS. These figures surpass those of GroundingDINO by +1.2 AP (+2.5%) and +4.1 AP (+15.0%), respectively. The more substantial improvement observed on the difficult, long-tailed LVIS benchmark underscores the efficacy of continuous query generation for improving generalization in open-vocabulary contexts.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





