arXiv

Do LLMs Hold Their Values? MANTA: A Multi-Turn Adversarial Benchmark for Animal Welfare Reasoning

Title: Do LLMs Hold Their Values? MANTA: A Multi-Turn Adversarial Benchmark for Animal Welfare Reasoning

Abstract:

Despite the rapid integration of Large Language Models (LLMs) into both professional and consumer sectors—where welfare considerations often appear implicitly in daily queries—assessing their reasoning regarding animal welfare remains an unresolved challenge. Current evaluation frameworks, such as AnimalHarmBench, rely on single-turn, explicitly framed questions to determine if models refuse to generate harmful content when directly solicited. However, this methodology fails to account for two critical failure modes: the erosion of alignment under prolonged adversarial pressure and "moral sensitivity," defined as the model’s tendency to spontaneously highlight welfare issues in routine interactions.

To address these limitations, we introduce MANTA, a novel benchmark comprising 1,088 multi-turn conversations. These interactions begin with an implicit scenario in Turn-1, evolve into an explicit welfare prompt, and subsequently undergo three rounds of adversarial pressure. These pressures are categorized into a five-type taxonomy: Social, Cultural, Economic, Pragmatic, and Epistemic. We assess performance across two metrics: Animal Welfare Value Stability (AWVS), serving as the primary score, and Animal Welfare Moral Sensitivity (AWMS), used as a diagnostic measure.

Our study evaluates seven leading models: Claude Opus 4.7, GPT-5.5, DeepSeek V4, Llama 3.3 70B, Mistral Small, Grok 4.3, and Gemini 3.1 Flash Lite. The multi-turn approach reveals behavioral nuances that single-turn benchmarks overlook; notably, four out of the seven models shifted their relative rankings when comparing Turn-1 results to later turns. For instance, Gemini Flash Lite fell from fifth place on AWMS to last place on AWVS. Furthermore, while AWMS and AWVS show a positive correlation, it is imperfect, indicating that tests of moral recognition capture a stable yet incomplete aspect of model behavior under stress.

MANTA also facilitates a species-by-pressure interaction matrix, a feature absent in previous benchmarks. This analysis demonstrates that welfare robustness is determined by the interplay between the specific animal involved and the type of pressure applied. Generally, companion animals exhibited higher robustness scores than wild animals, which in turn scored higher than farmed animals and invertebrates. We have publicly released the dataset, scripted pressure plans, judge prompts, and the associated analysis code.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Who’s Excited for SpaceX’s I.P.O.? Space Nerds.
New York Times

Who’s Excited for SpaceX’s I.P.O.? Space Nerds.

Space enthusiasts are the most eager for SpaceX’s IPO, driven by their passion for space exploration.

TechCrunch

Apple touts $1.4 trillion in App Store billings and sales, 90% without a commission

Apple reported $1.4 trillion in App Store billings for 2025, noting 90% were commission-free. Digital sales rose to $149...

Dimon and SpaceX Executives to Pitch IPO to Clients
Bloomberg

Dimon and SpaceX Executives to Pitch IPO to Clients

JPMorgan Chase CEO Jamie Dimon and SpaceX executives are pitching IPO details to clients.

Financial Times

Europe is finally flexing its innovation muscles

The EU’s new tech sovereignty package signals a positive shift from defensive regulation to proactive innovation, markin...

Apollo’s Zelter Expects High-Grade Debt Sales to Top US Treasuries
Bloomberg

Apollo’s Zelter Expects High-Grade Debt Sales to Top US Treasuries

Apollo’s Zelter expects high-grade debt sales to surpass US Treasuries. He anticipates investment-grade debt outperformi...

EU Insurance Watchdog Warns on Loan Risks
Bloomberg

EU Insurance Watchdog Warns on Loan Risks

EIOPA warns insurers to closely monitor loan risks, though initial reports lack specific details on the nature or scope ...