Decomposing and Measuring Evaluation Awareness
Title: Dissecting and Quantifying Evaluation Awareness
Abstract:
Frontier language models occasionally detect that they are being assessed, leading to behavioral adjustments that compromise the validity of benchmark outcomes. However, current research lacks a unified framework, often conflating model traits with evaluation characteristics and confusing detection with the subsequent behavioral response. This study anchors the concept of evaluation awareness in social psychology, breaking it down into two distinct parts: an environmental aspect (defined by how identifiable the task is) and a model aspect (which distinguishes between recognizing the evaluation and the likelihood of acting on that recognition). We quantify the environmental component by identifying eight specific trigger factors, including the use of placeholder entities and grading-oriented output structures, while monitoring recognition and behavior via chain-of-thought analysis. Our experiments across four benchmarks and nine frontier models reveal that recognition rates are determined by the unique combination of model and benchmark, rather than by either element alone. Although recognition seldom results in behavioral shifts, when changes do occur, their direction is influenced by the perceived nature of the evaluation. Furthermore, models exhibit heightened sensitivity to safety assessments compared to capability tests, thereby posing a greater threat to the reliability of safety benchmarks. To investigate the specific factors to which each model responds and how these factors interact, we introduce EvalAwareBench. This controlled benchmark consists of 100 paired safety and capability tasks, allowing each of the eight factors to be toggled independently while keeping the core request constant. Our findings indicate that no single factor impacts all models uniformly; however, combining multiple factors consistently increases evaluation awareness across the board. Our proposed framework and EvalAwareBench offer essential tools for measuring, attributing, and mitigating evaluation awareness, suggesting that maintaining behavioral consistency despite recognition is a viable direction for future research.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



