arXiv

Rethinking Molecular Text Representations for LLMs: An Empirical Study

June 3, 2026 · Arun Raja, Garrett M. Morris, Kian Ming A. Chai · Original Source

Title: Reevaluating Molecular Text Formats for Large Language Models: A Comprehensive Empirical Analysis

Large language models (LLMs) are seeing growing adoption in molecular science, yet there is still no consensus on the most effective molecular text representation. This study introduces a rigorous benchmark designed to assess the capabilities of LLMs in handling chemistry, evaluating nine distinct representations across eight different chemical tasks. The evaluation encompasses 16 LLMs drawn from five different model families, incorporating both reasoning and non-reasoning architectures, models specialized in chemistry, and leading closed-source frontier models.

The findings indicate that model performance is heavily influenced by the chosen representation. No single format emerged as universally superior; however, the ranking generally placed CML at the top, followed by MolJSON, InChI, and canonical SMILES. The study reveals a clear division in utility: explicit structured text formats, specifically CML and MolJSON, are most effective for structural tasks. In contrast, IUPAC nomenclature excelled in semantic tasks, achieving the highest success rate in molecule retrieval across all 16 tested LLMs. Furthermore, when employing LLM-as-a-judge metrics, IUPAC yielded the highest proportion of correctly generated molecules.

Interestingly, despite the widespread use of SMILES during pretraining, variants of SMILES rarely proved to be the optimal choice for these tasks. The research highlights a potential bias in current evaluation methods: chemistry-specialized models perform strongly with SMILES but suffer significant performance drops when using structured text representations. This suggests that evaluating models exclusively with SMILES may reward specialization that lacks generalizability.

To understand these performance differences, a mechanistic investigation was conducted using tokenization audits, linear probes, and attention analysis. The results demonstrate that models encode different representations in distinct ways; for instance, processing structured representations requires higher attention levels across the entire molecular span. Ultimately, these findings challenge the validity of representation-invariant evaluations and advocate for task-aware representation routing in LLM-driven chemistry applications.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC