arXiv

When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

June 2, 2026 · Bogdan Zagribelnyy, Ivan Ilin, Maksim Kuznetsov, Nikita Bondarev, Mathieu Reymond, Roman Schutski, Thomas MacDougall, Rim Shayakhmetov, Zulfat Miftakhutdinov, Mikolaj Mizera, Vladimir Aladinskiy, Alex Aliper, Alex Zhavoronkov · Original Source

Title: Beyond the Single Solution: Reevaluating Single-Step Retrosynthesis Benchmarks for Large Language Models

Abstract: While large language models (LLMs) are increasingly being applied to drug discovery tasks such as synthesis planning, the objective assessment of their retrosynthetic capabilities remains underdeveloped. Current evaluation standards predominantly depend on published synthetic routes and Top-K accuracy metrics anchored to a single ground-truth answer. This methodology fails to reflect the inherently open-ended nature of practical synthesis planning. To address this gap, we present a new benchmarking framework for single-step retrosynthesis designed to assess both general-purpose and chemistry-specialized LLMs. Central to this framework is ChemCensor, a novel metric that prioritizes chemical plausibility over exact string matching, thereby offering an evaluation method more closely aligned with human decision-making processes. Additionally, we introduce CREED, a comprehensive dataset containing millions of reaction records validated by ChemCensor, which we utilized to train a model that outperforms existing LLM baselines within this new benchmarking structure.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC