LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models
Title: LL-Bench: Reevaluating Low-Level Vision Assessment Amidst the Rise of Large-Scale Generative Models
Abstract:
While large-scale generative models have achieved significant success in image generation and editing, their proficiency in low-level vision tasks—those demanding precise pixel-level control—has not been thoroughly examined. To fill this research void, we present LL-Bench, a robust Benchmark designed to assess the aptitudes of large-scale generative models specifically for Low-Level vision applications.
LL-Bench features a dataset of 2,469 real-world images exhibiting various degradations across 16 distinct low-level tasks. It also includes 28,919 restored images generated by 10 leading large-scale generative models alongside 21 traditional restoration algorithms. This dataset is enriched with 152,020 expert-annotated pairwise human preferences and 28,334 quality scores.
Leveraging LL-Bench, we conduct a systematic analysis to delineate the performance limits and specific failure patterns of large-scale generative models in low-level vision scenarios, contrasting them with standard restoration techniques. Our investigation into current quality evaluation metrics on LL-Bench highlights a substantial misalignment between these automated metrics and human judgments.
To bridge this gap and better align quality assessment with human preference, we introduce LL-Score, an evaluator based on Multimodal Large Language Models (MLLMs). LL-Score is designed to account for both the fidelity of the restoration and the presence of hallucinations. Comprehensive experiments confirm that LL-Score surpasses existing image quality assessment metrics and emerges as a viable reward model for training generative models on low-level vision tasks.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





