Data Attribution in Large Language Models via Bidirectional Gradient Optimization
Title: Leveraging Bidirectional Gradient Optimization for Data Attribution in Large Language Models
Abstract:
As Large Language Models (LLMs) become integral to a wide array of applications, pressing issues regarding governance, accountability, and data provenance have come to the forefront. A core unresolved challenge in this domain is identifying which specific training examples exerted the greatest influence on a model’s generated output. To tackle this, we propose a novel approach to Training Data Attribution (TDA) for auto-regressive LLMs. Building on an inverse perspective, our method asks: how would the training data have been impacted if the model had encountered the generated output during its training phase?
Our technique involves perturbing a base model through bidirectional gradient optimization—utilizing both gradient ascent and descent—on a specific text sample. By observing the resulting shifts in loss across various training samples, we can determine data influence. This framework is versatile, supporting attribution at any level of data granularity and facilitating both factual and stylistic analysis. We benchmarked our approach against existing baselines using pretrained models trained on datasets with known provenance. Our results demonstrate superior performance in influence metrics compared to prior methods, significantly advancing model interpretability—a crucial component for ensuring the accountability of AI systems.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






