arXiv

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Title: UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Abstract:

Tasks that bridge vision and language, including Visual Question Answering (VQA), SNLI-VE, and Visual Commonsense Reasoning (VCR), present significant challenges due to the necessity for robust reasoning capabilities to interpret the semantics of both visual scenes and natural language. While supervised approaches for these tasks have been extensively investigated, their implementation within a zero-shot context remains underexplored. Given Contrastive Language-Image Pre-training (CLIP)’s demonstrated success in zero-shot image-text matching, earlier studies leveraged this strength by reframing vision-language problems as matching tasks, predominantly focusing on global-level alignment, such as comparing entire images against whole sentences. However, our analysis reveals that fine-grained details—such as specific keywords within text and distinct objects within images—provide crucial semantic insights. Drawing from this observation, we introduce a comprehensive framework designed to harness fine-grained data for zero-shot vision-language learning across various tasks, including VQA, SNLI-VE, and VCR. Our experimental results indicate that this approach surpasses existing zero-shot methods in VQA performance and yields significant gains in both SNLI-VE and VCR. Additionally, ablation studies validate the method’s efficacy and broad applicability.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

TechCrunch

Meta’s Oversight Board says account bans lack due process, transparency

Meta’s Oversight Board criticized account bans for lacking due process and transparency, citing inconsistent enforcement...

TechCrunch

Meta rolls out a new AI creator assistant on Facebook

Meta launched an AI creator assistant on Facebook to streamline analytics and content brainstorming. Initially available...

TechCrunch

What to expect from WWDC 2026: Siri’s highly anticipated revamp and Apple Intelligence updates

WWDC 2026 promises a Siri revamp powered by Google’s Gemini and standalone app, plus AI agents in the App Store and Came...

TechCrunch

A burglar used a Waymo to steal yoga clothes in San Francisco — and got away with it

A thief stole yoga clothes using a Waymo, but police failed to catch them because the car’s video data was deleted and b...

Goldman Sachs CEO David Solomon on the Coming Mega IPOs
Bloomberg

Goldman Sachs CEO David Solomon on the Coming Mega IPOs

Goldman Sachs CEO David Solomon anticipates a surge in major IPOs, signaling renewed market confidence and significant o...

What Are A.I. Agents Actually Doing?
New York Times

What Are A.I. Agents Actually Doing?

Arena research shows tech professionals are most likely to use AI agents at work, highlighting a strong industry trend i...