A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners
Title: Investigating World Model Reconstruction in Supervised Fine-Tuned LLM Planners
Original: arXiv:2606.03685v1 Announce Type: cross Abstract: Supervised fine-tuning (SFT) improves end-to-end classical planning in large language models (LLMs), but do these models also learn to represent and reason about the planning problems they are solving? Due to the relative complexity of classical planning problems and the challenge that end-to-end plan generation poses for LLMs, it has been difficult to explore this question. In our work, we devise and perform a series of interpretability experiments that holistically interrogate world model recovery by examining both internal representations and generative capabilities of fine-tuned LLMs. We find that: a) Supervised fine-tuning on valid action sequences enables LLMs to linearly encode action validity and some state predicates. b) Models that struggle to use output probabilities for classifying action validity may still learn internal representations that separate valid from invalid actions. c) Broader state space coverage during fine-tuning, such as from random walk data, yields more accurate recovery of the underlying world model. In summary, this work contributes a recipe for applying interpretability techniques to planning LLMs and generates insights that shed light on open questions about how knowledge is represented in LLMs.
Rewritten:
Title: An In-Depth Analysis of World Model Reconstruction in Supervised Fine-Tuned LLM Planners
Abstract: While supervised fine-tuning (SFT) enhances the ability of large language models (LLMs) to perform end-to-end classical planning, it remains unclear whether these systems actually acquire the capacity to represent and reason about the specific planning tasks they address. The inherent complexity of classical planning, combined with the difficulties LLMs face in generating complete plans end-to-end, has historically made this inquiry challenging. To address this, our study implements a comprehensive suite of interpretability experiments designed to assess world model recovery by analyzing both the internal representations and generative outputs of fine-tuned LLMs. Our findings indicate three key outcomes: First, SFT utilizing valid action sequences allows LLMs to linearly encode action validity along with certain state predicates. Second, even when models fail to leverage output probabilities for classifying action validity, they can still develop internal representations capable of distinguishing between valid and invalid actions. Third, expanding state space coverage during the fine-tuning phase—such as by incorporating data from random walks—leads to a more precise reconstruction of the underlying world model. Ultimately, this research provides a methodological framework for applying interpretability techniques to planning-focused LLMs and offers valuable insights into ongoing debates regarding knowledge representation within these models.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



