Do LLMs Hold Their Values? MANTA: A Multi-Turn Adversarial Benchmark for Animal Welfare Reasoning
Title: Do LLMs Hold Their Values? MANTA: A Multi-Turn Adversarial Benchmark for Animal Welfare Reasoning
Abstract:
Despite the rapid integration of Large Language Models (LLMs) into both professional and consumer sectors—where welfare considerations often appear implicitly in daily queries—assessing their reasoning regarding animal welfare remains an unresolved challenge. Current evaluation frameworks, such as AnimalHarmBench, rely on single-turn, explicitly framed questions to determine if models refuse to generate harmful content when directly solicited. However, this methodology fails to account for two critical failure modes: the erosion of alignment under prolonged adversarial pressure and "moral sensitivity," defined as the model’s tendency to spontaneously highlight welfare issues in routine interactions.
To address these limitations, we introduce MANTA, a novel benchmark comprising 1,088 multi-turn conversations. These interactions begin with an implicit scenario in Turn-1, evolve into an explicit welfare prompt, and subsequently undergo three rounds of adversarial pressure. These pressures are categorized into a five-type taxonomy: Social, Cultural, Economic, Pragmatic, and Epistemic. We assess performance across two metrics: Animal Welfare Value Stability (AWVS), serving as the primary score, and Animal Welfare Moral Sensitivity (AWMS), used as a diagnostic measure.
Our study evaluates seven leading models: Claude Opus 4.7, GPT-5.5, DeepSeek V4, Llama 3.3 70B, Mistral Small, Grok 4.3, and Gemini 3.1 Flash Lite. The multi-turn approach reveals behavioral nuances that single-turn benchmarks overlook; notably, four out of the seven models shifted their relative rankings when comparing Turn-1 results to later turns. For instance, Gemini Flash Lite fell from fifth place on AWMS to last place on AWVS. Furthermore, while AWMS and AWVS show a positive correlation, it is imperfect, indicating that tests of moral recognition capture a stable yet incomplete aspect of model behavior under stress.
MANTA also facilitates a species-by-pressure interaction matrix, a feature absent in previous benchmarks. This analysis demonstrates that welfare robustness is determined by the interplay between the specific animal involved and the type of pressure applied. Generally, companion animals exhibited higher robustness scores than wild animals, which in turn scored higher than farmed animals and invertebrates. We have publicly released the dataset, scripted pressure plans, judge prompts, and the associated analysis code.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




