Step-Level Sparse Autoencoder for Reasoning Process Interpretation
Title: Disentangling Reasoning Steps with Sparse Autoencoders
Abstract: Although Large Language Models (LLMs) have demonstrated robust complex reasoning abilities via Chain-of-Thought (CoT) strategies, their internal reasoning patterns remain difficult to decipher. While Sparse Autoencoders (SAEs) have become a potent instrument for interpretability, current methods primarily function at the token level. This creates a granularity mismatch when attempting to capture crucial step-level data, such as semantic transitions and reasoning direction. To address this, we introduce the Step-level Sparse Autoencoder (SSAE), an analytical framework designed to separate various facets of an LLM’s reasoning steps into distinct sparse features. By precisely regulating the sparsity of step features relative to their context, we establish an information bottleneck during step reconstruction. This mechanism isolates incremental information from background noise, distributing it across several sparsely activated dimensions. Our experiments, conducted across various base models and reasoning tasks, validate the utility of the extracted features. Through linear probing, we successfully predict both surface-level metrics, including generation length and the distribution of the first token, and more complex attributes, such as the logicality and correctness of the step. These findings suggest that LLMs possess at least a partial awareness of these properties during the generation process, thereby laying the groundwork for their self-verification capabilities. The code is accessible at https://github.com/Miaow-Lab/SSAE.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




