Test-Time Deep Thinking to Explore Implicit Rules
Title: Leveraging Test-Time Deep Thinking to Uncover Implicit Rules
Abstract: As Large Language Models (LLMs) continue to evolve, intelligent agents are gaining increasing prominence. Nevertheless, these agents frequently struggle in environments defined by implicit rules—unseen constraints that cannot be directly observed and must instead be deduced through interaction. Such difficulties often trap agents in cyclical trial-and-error patterns, resulting in task failure. To tackle this issue, we present TTExplore, a framework in which a "thinker" module examines interaction histories to deduce these hidden rules and direct an "actor." Success in this context relies heavily on the thinker's reasoning capabilities. However, assessing deep reasoning paths is inherently unstable and challenging, creating a significant barrier to effective training. We address this problem by introducing a new, stable reinforcement learning pipeline. The fundamental concept involves utilizing precise task-level scores as indirect rewards, thereby sidestepping the complexity of evaluating intermediate reasoning steps. Additionally, we limit each trajectory to a single thinking node to mitigate reward sparsity. Leveraging this approach, we trained a dedicated 7B model, Exp-Thinker. Evaluations across five text-based embodied tasks reveal that TTExplore, when paired with Exp-Thinker, enhances baseline agent performance by an average of 14–19 points, highlighting the efficacy of explicitly reasoning about implicit rules.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




