The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue
Title: The Image Reconstruction Game: Establishing Shared Understanding via Iterative Multimodal Conversation
Abstract: This study presents the Image Reconstruction Game, an automated benchmarking framework where vision-language models provide corrective feedback to image generators over several interaction rounds. This process renders the development of shared context directly visible in the final output. By evaluating two Describer models against two Generator models across seven distinct image categories, we identify that the Describerâs capabilities are the primary driver of reconstruction fidelity, whereas the Generator dictates whether iterative refinement yields beneficial or detrimental results. Geometric and mathematical images present the most significant hurdles. The Describerâs token allocation critically influences convergence; limited budgets produce initial renderings with greater potential for visible enhancement, whereas higher budgets improve initial quality but reduce the scope for further correction. High-performing Describers employ a diverse vocabulary covering spatial, numerical, and structural elements, while less capable models focus on superficial traits and typically terminate interactions early. Furthermore, human evaluation reveals that the top-performing automated judge aligns only slightly to fairly with human preferences, indicating that automated metrics require human calibration for reliable application.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




