arXiv

Distortion-Aware Fusion of Statistical and Vision-Language Features for Blind Image Quality Assessment

Title: Distortion-Aware Fusion of Statistical and Vision-Language Features for Blind Image Quality Assessment

Abstract:

Blind Image Quality Assessment (BIQA) seeks to estimate the perceived quality of an image in the absence of a reference. While classical Natural Scene Statistics (NSS) descriptors and contemporary Vision-Language Model (VLM) embeddings tackle this challenge from distinct theoretical angles, it remains unclear whether merging them offers synergistic advantages, nor how best to balance their respective influences for specific images. To address this, we introduce a distortion-aware fusion framework that merges a 138-dimensional NSS descriptor with two complementary VLM embeddings—SigLIP and CLIP-H. This integration relies on a multiplicative gating mechanism that dynamically learns input-specific stream weights conditioned on the image content.

In contrast to static concatenation methods, our gating network adaptively suppresses or amplifies each stream’s contribution based on the input image. The resulting weights exhibit a positive Spearman rank correlation (rho=0.33) with the per-distortion NSS contributions observed in independent ablation studies on the KADID-10k dataset. Notably, the framework operates without requiring end-to-end fine-tuning of the VLM backbones. Training employs a hybrid loss function that combines mean squared error, Pearson linear correlation, and pairwise ranking objectives.

We benchmarked the proposed method on three standard datasets. On KonIQ-10k, it achieved an SROCC of 0.9142 and a PLCC of 0.9279. On KADID-10k, it reached an SROCC of 0.9715 and a PLCC of 0.9733, outperforming recent state-of-the-art approaches. Furthermore, on the LIVE Challenge in-the-Wild dataset, it secured an SROCC of 0.8527 and a PLCC of 0.8802, leveraging cross-dataset pretraining and fine-tuning. A detailed per-distortion analysis on KADID-10k indicates that NSS features are most effective for noise and color-shift distortions, where pixel statistics are directly impacted, and least effective for perceptual changes like color saturation. These results are corroborated by the learned gate values, which confirm that the model autonomously identifies distortion-stream affinity patterns consistent with manual analysis.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...