arXiv

Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

June 2, 2026 · Haoyue Yang, Zhangxiao Shen, Fan Ding, Hangting Lou, Yifeng Kou, Haoqing Yu, Jingyao Li, Zhengfan Wu, Siqi Bao, Jing Liu, Hua Wu · Original Source

Title: Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

Abstract:

As front-end web code emerges as a primary showcase for every new large language model (LLM) release, assessing these interactive applications during development proves prohibitively expensive. Traditional human-judged leaderboards, such as Arena, lack the scalability required for this task. Meanwhile, current automated evaluation methods often rely on reference implementations, predefined test suites, or inflexible checklists, failing to capture the nuanced, reasoned synthesis that human reviewers apply during live sessions.

To address this gap, we introduce a novel evaluation framework that is reference-free, autonomous, and capable of holistic reasoning. We implement this framework through two key contributions:

\textbf{\dataname}: A comprehensive WebDev benchmark comprising 1,000 queries across 54 leaf nodes and 11 domains. This dataset covers both static presentation and interactive application tasks, balanced across three difficulty levels and three target language groups. To prevent models from relying on memorized responses, the task briefs have been rewritten to resist recall from previously circulated prompts.
\textbf{\framename}: Inspired by Flavell’s metacognitive monitoring, this system decouples evidence gathering from final judgment across three distinct phases:
- Static Perception: Generates an initial impression through passive observation.
- Agent-Driven Interaction: The system autonomously explores the application while recording continuous screen video, audio, and per-step screenshots.
- Dynamic Scoring: Once the entire evidence chain is established, this stage issues holistic verdicts on functionality and aesthetics, providing structured attribution for any failures.

Evaluations on \textbf{\dataname} demonstrate that \textbf{\framename} correlates strongly with expert human ratings. Furthermore, it reveals significant performance gaps across 13 leading LLMs in the realm of interactive web generation.

\noindent https://anonymous.4open.science/r/Cookie-3CE/

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC