arXiv

Evaluating Relational Reasoning in LLMs with REL

June 3, 2026 · Lukas Fesser, Yasha Ektefaie, Ada Fang, Sham M. Kakade, Marinka Zitnik · Original Source

Title: Assessing Relational Reasoning Capabilities in Large Language Models via REL

Abstract: The capacity to deduce relationships that simultaneously connect multiple entities, attributes, or variables constitutes relational reasoning. While this skill is fundamental to scientific inquiry, current assessments of relational reasoning in large language models (LLMs) frequently rely on structured data inputs like tables or graphs, as well as synthetic tasks. Consequently, these evaluations often fail to isolate the specific challenge posed by binding relations of higher arity. To address this gap, we examine the issue through the framework of Relational Complexity (RC). We define RC as the smallest count of independent entities or operands that must be bound concurrently to execute a relation. This metric offers a rigorous method for scaling reasoning difficulty while controlling for confounding factors, including input volume, vocabulary size, and representational design. Leveraging RC, we present REL, a generative benchmark suite covering the domains of algebra, biology, and chemistry, which modulates RC levels within each field. Our analysis of state-of-the-art LLMs reveals a consistent and monotonic decline in performance as RC rises, even when the total number of entities remains constant. This deficiency endures despite enhancements in test-time compute and the use of in-context learning, pointing to a core limitation rooted in the arity of the relational binding rather than a shortage of inference steps or training examples. These findings highlight a specific regime of high-arity reasoning where contemporary models falter, urging a reevaluation of benchmarks based on relational complexity.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC