Can Large Language Models Generalize Procedures Across Representations?
Title: Can Large Language Models Generalize Procedures Across Representations?
Abstract: Although large language models (LLMs) undergo rigorous training and evaluation using symbolic formats like code and graphs, practical user requirements are frequently articulated in natural language. This raises the critical question of how effectively LLMs can transfer skills between these distinct representational forms. To investigate this, we examine isomorphic tasks where procedures are encoded as code, visualized as graphs, or described in natural language—such as step scheduling in planning scenarios. Our results indicate that relying exclusively on post-training methods with either graph or code datasets fails to ensure robust generalization to natural language equivalents. Conversely, training exclusively on natural language data yields inefficient improvements in performance. To bridge this divide, we introduce a two-stage reinforcement learning curriculum that prioritizes symbolic data before transitioning to natural language inputs. This approach significantly boosts performance across various model architectures and task types. Notably, a 1.5B parameter Qwen model optimized with our method achieves performance levels comparable to zero-shot GPT-4o in naturalistic planning contexts. Furthermore, our analysis interprets successful cross-representation generalization as a type of generative analogy, a capability that our proposed curriculum actively fosters. The dataset and code used in this paper can be found \href{https://github.com/fangru-lin/procedure_generalization_llm}{here}.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




