Global News Digest

arXiv

Plausibility Is Not Prediction: Contrastive Evidence for LLM-Based Cellular Perturbation Reasoning

Title: Plausibility Does Not Equal Prediction: Evidence Against LLM-Based Cellular Perturbation Reasoning

Abstract:

While perturbation experiments are fundamental to deciphering cellular mechanisms, their high cost and scarcity have spurred interest in predicting gene expression responses for unobserved conditions. Recently, large language models (LLMs) have emerged as a promising approach, functioning as "virtual cell" simulators. By employing stepwise, knowledge-grounded mechanistic reasoning to infer differential expression, these methods aim to establish an interpretable, knowledge-driven paradigm that moves beyond purely data-driven techniques.

However, our analysis demonstrates that plausibility is not synonymous with prediction. Although these LLM-generated explanations are biologically plausible, they fail to accurately capture perturbation-specific effects. Specifically, the models systematically overestimate differential expression, often performing worse than a simple gene-frequency baseline in aggregate evaluations, and revert to chance-level performance when assessed at the individual gene level. This indicates that the models rely more on intrinsic gene response tendencies than on genuine perturbation reasoning.

We attribute this failure to the manner in which evidence is currently presented: existing methods assess perturbation-gene pairs in isolation, lacking exposure to how related perturbations exert different effects on the same gene. To overcome this limitation, we introduce CORE (Contrastive Organization of Relational Evidence). CORE reframes the prediction task as a comparison exercise by structuring evidence into positive and negative outcomes derived from related perturbations. Leveraging a biomedical knowledge graph for evidence retrieval, CORE enhances calibration and significantly improves perturbation-specific prediction capabilities in both LLM-based and non-LLM contexts. For instance, on drug-perturbation datasets, CORE-Reasoning boosts Qwen3.5-9B aggregate metrics by up to 28.6%. Similarly, on generic perturbation data, CORE-Voting elevates the macro-per-gene AUROC from chance levels to an average of 0.703 across four cell lines. These findings underscore the critical importance of organizing evidence contrastively for reliable LLM-based perturbation reasoning.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.