arXiv

Robust Asynchronous Planning via Auto-Formalization

June 2, 2026 · Jiayi Zhang, Jianing Yin, Ben Zhou, Li Zhang · Original Source

Title: Scalable Asynchronous Planning Through Auto-Formalization

Original: arXiv:2606.00981v1 Announce Type: new Abstract: LLMs can plan by either generating action sequences directly as a Planner or translating tasks into domain specific language for an external solver as a Formalizer. While most real-world tasks are asynchronous with non-uniform durations, concurrency, and execution-time constraints, existing benchmarks hardly cover them. We unify these asynchronous planning challenges under a single formulation and introduce the first three benchmarks that address each at scale. We conclude that the choice of formal representation primarily determines whether planning scales: as dependency graphs grow from 5 to 100 actions, Planner collapses from 96% to 5% plan accuracy and PDDL2.1 Formalizer from 13% to 0%, while CP-SAT Formalizer averages 94% and still achieves 83% at 100 actions. Faithfulness diagnostics show that PDDL2.1's predicate-based planning representation becomes brittle compared to general constraint satisfaction programs, when LLMs must keep predicates, effects, and goals consistent. Execution-time updates of planning constraints further degrade performance sharply (Planner 23.9%, PDDL2.1 0.7%, CP-SAT 46.1%), but a state-aware repair strategy that updates only event-induced constraints recovers CP-SAT Formalizer to 84.5%.

Rewrite:

Large Language Models (LLMs) approach planning through two distinct pathways: acting as a "Planner" that outputs action sequences directly, or functioning as a "Formalizer" that converts tasks into domain-specific languages for external solvers. Despite the prevalence of asynchronous characteristics in real-world scenarios—such as variable task durations, concurrent operations, and time-sensitive constraints—current benchmarks largely fail to represent these complexities. This study consolidates these asynchronous planning difficulties into a unified framework and presents the first three large-scale benchmarks designed to evaluate these specific challenges.

Our analysis indicates that the selection of formal representation is the critical factor in determining planning scalability. When dependency graphs expand from five to 100 actions, the direct Planner approach sees its accuracy plummet from 96% to 5%, and the PDDL2.1 Formalizer drops from 13% to 0%. In contrast, the CP-SAT Formalizer maintains an average accuracy of 94% and retains 83% accuracy even with 100 actions. Faithfulness diagnostics reveal that PDDL2.1’s predicate-based structure is less robust than general constraint satisfaction programs when LLMs are required to maintain consistency among predicates, effects, and goals.

Performance deteriorates significantly when planning constraints are updated during execution, resulting in accuracies of 23.9% for the Planner, 0.7% for PDDL2.1, and 46.1% for CP-SAT. However, implementing a state-aware repair mechanism that modifies only event-induced constraints allows the CP-SAT Formalizer to recover its performance to 84.5%.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC