arXiv

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

June 2, 2026 · Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su, Yujia Zhou, Min Zhang, Yiqun Liu, Qinyao Ai · Original Source

Title: Assessing the Efficacy of LLM-as-a-Judge in Evaluating Long-Form Content

Abstract

The growing prevalence of large language models (LLMs) in generating long-form content has rendered the reliable assessment of such outputs a pressing issue. While LLM-as-a-judge presents a scalable solution to traditional human evaluation, its trustworthiness in this specific context has received limited attention. Existing meta-evaluation benchmarks predominantly target short-form outputs, overlooking the unique complexities of longer texts. Unlike short-form assessment, evaluating long-form generation involves more than just increased length; it necessitates that judges navigate intricate, document-level requirements.

To address this gap, we present LongJudgeBench, a robust benchmark designed to assess the performance of LLM judges on long-form outputs across a variety of real-world applications and judging protocols. We conduct a systematic analysis of a wide array of LLM judges, incorporating diverse base models and evaluation settings. Our findings highlight a significant reliability deficit: current LLM judges demonstrate instability across different scenarios. While the inclusion of rubrics or reference materials improves performance, it does not guarantee consistency. We anticipate that LongJudgeBench will facilitate the development of more resilient, context-sensitive, and human-aligned LLM-as-a-judge methodologies. The code for this study is accessible at https://anonymous.4open.science/r/LongJudgeBench-F782.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC