Not All Explanations Simulate Equally: Comparing Verbalized Feature Attributions and Self-Generated Rationales
Title: Divergent Simulation Capabilities: A Comparative Study of Verbalized Feature Attributions and Self-Generated Rationales
Abstract: While natural-language explanations are frequently regarded as a standardized medium for interpreting model behavior, the capacity for simulation may vary significantly depending on the source of the explanation. This study juxtaposes two distinct categories of explanations utilized by question-answering systems: verbalized feature attributions and self-generated rationales. We assess these approaches within a unified counterfactual simulation framework, employing an LLM-based judge as the predictor to determine which explanation type enables more accurate forecasts of a model’s responses to subsequent queries. By examining various instruction-tuned models, we investigate the impact of explanation origin, verbalization methods, and feature granularity on simulatability. The findings indicate that both the format of the explanation and its level of granularity influence simulatability. Specifically, attribution-based explanations and self-generated rationales demonstrate differing degrees of success in enhancing counterfactual prediction, with outcomes that fluctuate across different models and formats.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





