arXiv

Business Utility of Large Language Models as Exploratory Data Analysis Agents

June 2, 2026 · Rafa{\l} {\L}ab\k{e}dzki, Patryk Miziu{\l}a, Hubert Rutkowski, Szymon Betlewski, Cezary Depta, Szymon Janowski, Jaros{\l}aw Kochanowicz, Jan Kanty Milczek · Original Source

Title: Leveraging Large Language Models as Exploratory Data Analysis Agents for Business Applications

Abstract

While the integration of Large Language Models (LLMs) into analytical workflows is on the rise, their effectiveness as exploratory data analysis (EDA) agents within commercial environments has yet to be fully established. For an EDA agent to be viable in practice, it must deliver not just competent average performance, but also a high degree of repeatability to ensure stakeholders can trust its findings. To test this premise, we utilized a controlled benchmark grounded in a business-oriented supply chain simulation. The primary objective was to pinpoint supplier-product pairs linked to poor quality and subsequent sales declines, requiring the models to deduce these issues from indirect operational data rather than relying on explicit labels.

We assessed fifteen configurations spanning eight distinct model families across four experimental scenarios. These scenarios manipulated data representation, the clarity of prompts, and signal strength, with five trajectories recorded for each condition. Model outputs were evaluated against deterministic ground truth using the Jaccard index. Furthermore, we applied a comprehensive assessment framework that integrates mean score (ms), coefficient of variation (CV), exploratory cross-condition significance tests, and a novel metric called "Business utility." This proposed metric consolidates both quality and repeatability into a single operational measure adjusted for risk.

The study reveals that the majority of configurations lack the reliability necessary for autonomous EDA tasks, despite some showing acceptable average scores. The top performer was GPT-5.4 operating with extra-high reasoning effort, which recorded an experiment-averaged ms of 0.8748 and a Business utility score of 0.6952. In contrast, the next-best configurations suffered significant utility reductions when variability was accounted for. These results indicate that assessing the trustworthiness of EDA agents requires viewing average quality, repeatability, and sensitivity to conditions as interdependent dimensions of operational reliability.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC