arXiv

BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law

June 2, 2026 · Sebastian Nagl, Ann-Kristin Mayrhofer, Martin Heidebach, Aleyna Ko\c{c}ak, Anne Zettelmeier, Elly Breu, Angelina Greiner, Sofija Milijas, Matthias Grabmair · Original Source

Title: BenGER: A Benchmark for Evaluating LLM Systems on Subsumption-Based Legal Reasoning within the German Legal Framework

Abstract:

This paper presents BenGER (Benchmark for German Law), a novel dataset designed to assess Large Language Model (LLM) systems on subsumption-based legal reasoning tasks specific to German law. The BenGER collection comprises two primary elements: 531 short doctrinal reasoning exercises and 596 free-text legal case problems structured in an exam format, spanning various tiers of legal education.

We subjected 12 current LLM architectures—including open-weight models, efficiency-focused systems, and closed flagship models—to evaluation using both automatic metrics and judge-based assessments. To contextualize model performance, we analyzed a controlled validation subset consisting of timed, human-written solutions generated under both unaided conditions and human-AI co-creation scenarios.

Furthermore, we developed an LLM-as-a-Judge framework aligned with specific rubrics, which was cross-validated against a multi-rater human grading protocol. This human protocol involved three blind reviews and one author-informed creator review for each solution. Our findings indicate that substituting a blind human reviewer with the LLM judge results in a decrease in agreement with the full human pool that is no greater than the drop observed when removing that reviewer entirely (Calderon r=0.96 compared to r=0.96, with a matched sample size of n=30). Additionally, the results demonstrate that closed-flagship systems dominate the leaderboard across all datasets, and that collaborative human-AI workflows significantly surpass unaided human performance.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC