WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts
Title: WebRISE: Evaluating MLLM-Generated Web Artifacts Through Requirement-Induced States
Abstract
Current benchmarks for evaluating web artifacts generated by Multimodal Large Language Models (MLLMs) rely heavily on local evidence to assess interaction, often overlooking the requirement-induced states and transitions that are critical for determining a page’s functionality. To address this gap, we present WebRISE, a framework that transforms task requirements into Interaction Contract Graphs (ICGs). These graphs map observable states, user-intent transitions, and DOM/visual assertions, enabling implementation-agnostic execution in browsers.
WebRISE encompasses 442 tasks spanning five input modalities: Text, Markdown, Sketch, Image, and Video. The dataset includes 5,495 transitions and 5,271 requirement checks designed to distinguish between user-stated functions and implicit product-level constraints. Our evaluation across 14 MLLMs reveals that even the most powerful models achieve only 65.6% transition validity and 66.3% requirement coverage. Furthermore, our findings indicate that visual quality does not guarantee behavioral correctness; for instance, Qwen3.6-35B-A3B achieved a visual score of 80.8 on Markdown inputs but a mere 15.5 for transition validity.
Among the modalities, Video provided the strongest interaction signal, improving implicit coverage by 10.6 percentage points over Text. However, implicit constraints remain a persistent challenge. Defect injection experiments demonstrate that ICG-based scoring identifies state errors at a rate 2 to 16 times higher than traditional checkpoint-style evaluation methods.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



