arXiv

CauTion: Knowing When to Trust LLMs for Ensemble Causal Discovery

June 3, 2026 · Bo Peng, Kaiwen Wu, Sirui Chen, Zhiheng Wang, Yu Qiao, Chaochao Lu · Original Source

Title: CauTion: Navigating Trust in Large Language Models for Ensemble Causal Discovery

Abstract

Deriving causal structures from observational data is inherently difficult, primarily because purely statistical approaches face fundamental constraints. These include the inability to distinguish within equivalence classes and a pronounced sensitivity to limited sample sizes. Although Large Language Models (LLMs) present a valuable avenue for incorporating domain expertise to support statistical inference, current LLM-integrated methods are prone to errors introduced by the models themselves and involve significant token expenses. Furthermore, depending on a single data-driven algorithm can render outcomes vulnerable to specific algorithmic biases.

To overcome these challenges, we introduce CauTion, a novel framework designed to robustly embed LLM-derived domain knowledge into an ensemble of statistical causal discovery methods. This integration is achieved through consensus filtering and the estimation of LLM reliability. The CauTion process unfolds across three distinct phases:

Consensus Filtering: An ensemble of algorithms employs consensus voting to resolve up to 96% of edges where there is agreement among the methods. This step yields near-perfect accuracy for the edges retained in the consensus.
Trust-Calibrated Arbitration: An annotation-free trust calibration procedure assesses the relative reliability of both the LLM and the statistical algorithms. This metric informs a trust-weighted voting system that limits LLM intervention strictly to edges where algorithmic evidence is deemed unreliable.
Cycle Repair: A final cycle repair mechanism ensures that the resulting causal graph is strictly acyclic and structurally valid.

Empirical evaluations across six datasets show that CauTion consistently surpasses both data-centric and LLM-augmented baseline methods. The performance improvements are particularly notable on larger graphs, and the framework demonstrates strong resilience against LLM inaccuracies. The source code for this framework is publicly accessible at https://github.com/OpenCausaLab/CauTion.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC