Global News Digest

arXiv

Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

Title: Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

Abstract:

As front-end web code emerges as a primary showcase for every new large language model (LLM) release, assessing these interactive applications during development proves prohibitively expensive. Traditional human-judged leaderboards, such as Arena, lack the scalability required for this task. Meanwhile, current automated evaluation methods often rely on reference implementations, predefined test suites, or inflexible checklists, failing to capture the nuanced, reasoned synthesis that human reviewers apply during live sessions.

To address this gap, we introduce a novel evaluation framework that is reference-free, autonomous, and capable of holistic reasoning. We implement this framework through two key contributions:

  1. \textbf{\dataname}: A comprehensive WebDev benchmark comprising 1,000 queries across 54 leaf nodes and 11 domains. This dataset covers both static presentation and interactive application tasks, balanced across three difficulty levels and three target language groups. To prevent models from relying on memorized responses, the task briefs have been rewritten to resist recall from previously circulated prompts.

  2. \textbf{\framename}: Inspired by Flavell’s metacognitive monitoring, this system decouples evidence gathering from final judgment across three distinct phases:

    • Static Perception: Generates an initial impression through passive observation.
    • Agent-Driven Interaction: The system autonomously explores the application while recording continuous screen video, audio, and per-step screenshots.
    • Dynamic Scoring: Once the entire evidence chain is established, this stage issues holistic verdicts on functionality and aesthetics, providing structured attribution for any failures.

Evaluations on \textbf{\dataname} demonstrate that \textbf{\framename} correlates strongly with expert human ratings. Furthermore, it reveals significant performance gaps across 13 leading LLMs in the realm of interactive web generation.

\noindent https://anonymous.4open.science/r/Cookie-3CE/


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ā€˜as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ā€˜as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers ā€œas much as possible,ā€ emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.