Plausibility Is Not Prediction: Contrastive Evidence for LLM-Based Cellular Perturbation Reasoning
Title: Plausibility Does Not Equal Prediction: Evidence Against LLM-Based Cellular Perturbation Reasoning
Abstract:
While perturbation experiments are fundamental to deciphering cellular mechanisms, their high cost and scarcity have spurred interest in predicting gene expression responses for unobserved conditions. Recently, large language models (LLMs) have emerged as a promising approach, functioning as "virtual cell" simulators. By employing stepwise, knowledge-grounded mechanistic reasoning to infer differential expression, these methods aim to establish an interpretable, knowledge-driven paradigm that moves beyond purely data-driven techniques.
However, our analysis demonstrates that plausibility is not synonymous with prediction. Although these LLM-generated explanations are biologically plausible, they fail to accurately capture perturbation-specific effects. Specifically, the models systematically overestimate differential expression, often performing worse than a simple gene-frequency baseline in aggregate evaluations, and revert to chance-level performance when assessed at the individual gene level. This indicates that the models rely more on intrinsic gene response tendencies than on genuine perturbation reasoning.
We attribute this failure to the manner in which evidence is currently presented: existing methods assess perturbation-gene pairs in isolation, lacking exposure to how related perturbations exert different effects on the same gene. To overcome this limitation, we introduce CORE (Contrastive Organization of Relational Evidence). CORE reframes the prediction task as a comparison exercise by structuring evidence into positive and negative outcomes derived from related perturbations. Leveraging a biomedical knowledge graph for evidence retrieval, CORE enhances calibration and significantly improves perturbation-specific prediction capabilities in both LLM-based and non-LLM contexts. For instance, on drug-perturbation datasets, CORE-Reasoning boosts Qwen3.5-9B aggregate metrics by up to 28.6%. Similarly, on generic perturbation data, CORE-Voting elevates the macro-per-gene AUROC from chance levels to an average of 0.703 across four cell lines. These findings underscore the critical importance of organizing evidence contrastively for reliable LLM-based perturbation reasoning.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




