LLM Judges Inconsistently Disagree Across Safety Criteria and Harm Categories
Title: Large Language Model Evaluators Show Significant Disagreement Regarding Safety Standards and Harm Types
Abstract: This study assesses the reliability of automated evaluators when performing multi-dimensional safety assessments in a reference-free environment. Our findings reveal that Large Language Models (LLMs) are inconsistent judges when detecting safety concerns associated with machine-generated guidance in regulated sectors like finance. However, they demonstrate greater reliability when identifying more explicit forms of harmful content, such as violence. The extent of inconsistency in a model’s evaluations fluctuates considerably depending on the specific safety criteria applied, and is also influenced by the content’s language and linguistic style. Furthermore, we observe substantial divergence among different evaluators regarding the same output, spanning various domains, safety metrics, and languages. These insights shed new light on the utilization of LLMs as evaluators and provide practical recommendations for deploying automated judges in real-world applications.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





