arXiv

BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format

Title: BioBlue: Systematic Runaway-Optimizer-Like LLM Failure Modes on Biologically and Economically Aligned AI Safety Benchmarks for LLMs with Simplified Observation Format

Abstract:

Discussions regarding "runaway optimization" in AI alignment frequently center on reinforcement learning (RL) agents—systems defined by unbounded utility maximization that over-optimize proxy goals, such as the "paperclip maximizer" scenario or specification gaming, often at the cost of other critical factors. It is commonly presumed that Large Language Model (LLM)-based systems present a lower safety risk because they operate as next-token predictors rather than persistent optimizers. To challenge this assumption, we empirically evaluate LLMs within simplified, long-horizon control environments that demand the maintenance of state or the balancing of objectives over extended periods. These benchmarks include single- and multi-objective homeostasis, the management of unbounded objectives subject to diminishing returns, and the sustainability of renewable resources.

Our results indicate that while LLMs frequently act in accordance with instructions and demonstrate a clear grasp of stated goals for numerous steps, they often succumb to structured context loss. This leads to runaway behaviors, such as disregarding homeostatic targets or reducing multi-objective trade-offs into single-objective maximization, thereby violating concave utility structures. These malfunctions appear reliably following initial phases of competent performance and display distinct patterns—including self-imitative oscillations, unbounded maximization, and a regression to single-objective optimization—even when the context window remains largely unused.

Crucially, the issue is not merely a loss of context resulting in incoherence. While LLMs may appear to handle multiple objectives and remain bounded superficially, their behavior under sustained interaction involving complex objectives is systematically skewed toward that of single-objective, unbounded, and poorly aligned optimizers. We propose a hypothesis involving a token-level pattern reinforcement attractor, suggesting that LLMs increasingly base their actions on the token patterns of their recent action history rather than adhering to original instructions. The specific reasons why these failures are predominantly triggered in multi-objective settings remain an open question.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

AI Concentration Risk Is the Problem: 3-Minutes MLIV
Bloomberg

AI Concentration Risk Is the Problem: 3-Minutes MLIV

The article argues that AI concentration risk, rather than the technology itself, is the primary concern. It highlights ...

Reuters

Foxconn announces strategic collaboration with Intel on next-gen AI infrastructure

Foxconn and Intel announced a strategic partnership to develop next-generation AI infrastructure. This collaboration aim...

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)
Bloomberg

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)

SpaceX aims for a record $75 billion valuation through an initial public offering. This historic IPO marks a significant...

Broadcom AI Chip Outlook Disappoints Investors
Bloomberg

Broadcom AI Chip Outlook Disappoints Investors

Broadcom’s AI chip projections disappointed investors, dampening market sentiment. The outlook fell short of expectation...

Reuters

Europe's tech 'liberation day'? Computer says not yet

Europe’s expected tech breakthrough remains unrealized, as current systems indicate that a true "liberation day" has not...

Hiranandani Group CEO on Powering India's Digital Future
Bloomberg

Hiranandani Group CEO on Powering India's Digital Future

Hiranandani Group CEO discusses driving India's digital transformation.