arXiv

Rethinking Molecular Text Representations for LLMs: An Empirical Study

Title: Reevaluating Molecular Text Formats for Large Language Models: A Comprehensive Empirical Analysis

Large language models (LLMs) are seeing growing adoption in molecular science, yet there is still no consensus on the most effective molecular text representation. This study introduces a rigorous benchmark designed to assess the capabilities of LLMs in handling chemistry, evaluating nine distinct representations across eight different chemical tasks. The evaluation encompasses 16 LLMs drawn from five different model families, incorporating both reasoning and non-reasoning architectures, models specialized in chemistry, and leading closed-source frontier models.

The findings indicate that model performance is heavily influenced by the chosen representation. No single format emerged as universally superior; however, the ranking generally placed CML at the top, followed by MolJSON, InChI, and canonical SMILES. The study reveals a clear division in utility: explicit structured text formats, specifically CML and MolJSON, are most effective for structural tasks. In contrast, IUPAC nomenclature excelled in semantic tasks, achieving the highest success rate in molecule retrieval across all 16 tested LLMs. Furthermore, when employing LLM-as-a-judge metrics, IUPAC yielded the highest proportion of correctly generated molecules.

Interestingly, despite the widespread use of SMILES during pretraining, variants of SMILES rarely proved to be the optimal choice for these tasks. The research highlights a potential bias in current evaluation methods: chemistry-specialized models perform strongly with SMILES but suffer significant performance drops when using structured text representations. This suggests that evaluating models exclusively with SMILES may reward specialization that lacks generalizability.

To understand these performance differences, a mechanistic investigation was conducted using tokenization audits, linear probes, and attention analysis. The results demonstrate that models encode different representations in distinct ways; for instance, processing structured representations requires higher attention levels across the entire molecular span. Ultimately, these findings challenge the validity of representation-invariant evaluations and advocate for task-aware representation routing in LLM-driven chemistry applications.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...