Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Deliberation
Title: Navigating Boundaries: Improving Specification Adherence Through Test-Time Deliberation
Abstract: As large language models (LLMs) are deployed across a widening array of practical applications, they must adhere to unique behavioral and safety guidelines—collectively known as specifications (specs)—that are customized by users or institutions. These specifications, divided into safety-spec and behavioral-spec categories, differ significantly depending on the context and shift in response to evolving needs and preferences. This paper frames the issue as "specification alignment," concentrating on how well LLMs can comply with dynamic, context-specific rules from both safety and behavioral standpoints. To tackle this, we introduce Align3, an efficient approach that utilizes Test-Time Deliberation (TTD) incorporating hierarchical reflection and revision to analyze specification constraints. Additionally, we release SpecBench, a comprehensive benchmark designed to evaluate specification alignment. This benchmark encompasses 1,500 prompts, 103 specifications, and 5 distinct scenarios. Our experiments, which tested 15 reasoning models and 18 instruction-tuned models against various TTD techniques such as Self-Refine, TPO, and MoreThink, produced three primary insights: (i) test-time deliberation improves specification alignment; (ii) Align3 pushes the boundary of the safety-helpfulness trade-off with negligible computational cost; and (iii) SpecBench successfully identifies specific alignment deficiencies. These findings underscore the efficacy of test-time deliberation as a robust strategy for navigating real-world specification limits. Code and related resources can be accessed at https://github.com/zzzhr97/SpecBench.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC



