Distortion-Aware Fusion of Statistical and Vision-Language Features for Blind Image Quality Assessment
Title: Distortion-Aware Fusion of Statistical and Vision-Language Features for Blind Image Quality Assessment
Abstract:
Blind Image Quality Assessment (BIQA) seeks to estimate the perceived quality of an image in the absence of a reference. While classical Natural Scene Statistics (NSS) descriptors and contemporary Vision-Language Model (VLM) embeddings tackle this challenge from distinct theoretical angles, it remains unclear whether merging them offers synergistic advantages, nor how best to balance their respective influences for specific images. To address this, we introduce a distortion-aware fusion framework that merges a 138-dimensional NSS descriptor with two complementary VLM embeddings—SigLIP and CLIP-H. This integration relies on a multiplicative gating mechanism that dynamically learns input-specific stream weights conditioned on the image content.
In contrast to static concatenation methods, our gating network adaptively suppresses or amplifies each stream’s contribution based on the input image. The resulting weights exhibit a positive Spearman rank correlation (rho=0.33) with the per-distortion NSS contributions observed in independent ablation studies on the KADID-10k dataset. Notably, the framework operates without requiring end-to-end fine-tuning of the VLM backbones. Training employs a hybrid loss function that combines mean squared error, Pearson linear correlation, and pairwise ranking objectives.
We benchmarked the proposed method on three standard datasets. On KonIQ-10k, it achieved an SROCC of 0.9142 and a PLCC of 0.9279. On KADID-10k, it reached an SROCC of 0.9715 and a PLCC of 0.9733, outperforming recent state-of-the-art approaches. Furthermore, on the LIVE Challenge in-the-Wild dataset, it secured an SROCC of 0.8527 and a PLCC of 0.8802, leveraging cross-dataset pretraining and fine-tuning. A detailed per-distortion analysis on KADID-10k indicates that NSS features are most effective for noise and color-shift distortions, where pixel statistics are directly impacted, and least effective for perceptual changes like color saturation. These results are corroborated by the learned gate values, which confirm that the model autonomously identifies distortion-stream affinity patterns consistent with manual analysis.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





