Safety Game: Inference-Time Alignment of Black-Box LLMs via Constrained Optimization
Title: Safety Game: Inference-Time Alignment of Black-Box LLMs via Constrained Optimization
Abstract:
Achieving compliance with safety standards is a critical hurdle in the deployment of large language models (LLMs). While current alignment strategies largely depend on training-phase interventions—such as reinforcement learning from human feedback or fine-tuning—these techniques are often expensive and rigid, necessitating complete retraining whenever safety protocols change. Although recent advancements in inference-time alignment aim to address these constraints, they typically rely on access to internal model mechanisms, rendering them unsuitable for external parties or third-party stakeholders who lack such access.
To overcome these barriers, this study introduces a model-independent framework for safety alignment that functions as a black box, eliminating the need for retraining or knowledge of the underlying LLM architecture. As a proof of concept, we tackle the challenge of balancing the generation of safe but potentially uninformative responses against helpful answers that may carry risks. We conceptualize this trade-off as a two-player zero-sum game, where the minimax equilibrium represents the ideal compromise between utility and safety. Within this framework, LLM agents utilize a linear programming solver during inference to calculate equilibrium strategies. Our findings confirm the viability of black-box safety alignment, presenting a scalable and inclusive solution that enables diverse stakeholders—including smaller entities and those with limited resources—to maintain safety standards within the fast-changing LLM landscape.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





