arXiv

Benchmarking Multimodal LLMs on Code Generation for Complex Interactive Webpages

June 2, 2026 · Fan Wu, Lishuai Dong, Cuiyun Gao, Yujia Chen, Yiming Huang, Yang Xiao, Qing Liao · Original Source

Title: Evaluating Multimodal LLMs on Code Generation for Complex Interactive Webpages

Abstract:

The rapid evolution of multimodal large language models (MLLMs) has driven significant strides in multimodal reasoning and code synthesis, heralding a transformative era for front-end engineering. Notably, these models possess the capability to convert visual layouts directly into functional code, thereby enhancing both the speed and flexibility of web development workflows. However, contemporary web applications are characterized by their dynamic nature and intricate user-page interactions, a complexity that current evaluation frameworks fail to adequately capture. Most existing benchmarks focus primarily on static page generation, neglecting the sophisticated interactive behaviors inherent in real-world applications. Furthermore, standard evaluation metrics are typically restricted to visual accuracy and code architecture, disregarding the crucial aspect of interaction consistency between generated outputs and reference designs.

To bridge these gaps, we present WebIGBench, the inaugural benchmark specifically engineered to assess code generation capabilities for interactive webpages featuring complex user interactions. Through a methodology that integrates manually curated interaction paths with UI automation, we compiled a dataset of 103 complex webpages sourced from live websites. This benchmark encompasses five prevalent categories of interactive actions, such as clicking and inputting, totaling 871 distinct interactive events. Additionally, we introduce an innovative evaluation pipeline designed to facilitate the automated assessment of these interactive behaviors. Comprehensive experiments utilizing several representative MLLMs highlight the current performance limits of these models in generating code for interactive webpages. The WebIGBench benchmark is publicly accessible at https://github.com/anoa12159-hue/WebIGBench_eval.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC