Sequential Data Poisoning in LLM Post-Training
Title: Sequential Data Poisoning in LLM Post-Training
Abstract: Large Language Model (LLM) post-training typically involves a multi-stage process, such as supervised fine-tuning (SFT) followed by reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO). Since each stage relies on data from distinct and potentially untrusted sources, existing research has focused on data poisoning attacks at individual stages but has overlooked scenarios involving multiple attackers. To assess the trustworthiness of the entire post-training pipeline, we introduce the threat model of sequential data poisoning, which involves separate adversaries poisoning both SFT and preference datasets. Our analysis reveals a "single-attacker illusion": when evaluated in isolation, each adversary appears to pose a minimal threat. However, collaboration across stages exposes significant vulnerabilities. In the SFT $\to$ DPO pipeline, the attackers' effects are additive; distributing a fixed poison budget across stages yields better results than concentrating it in just one. Conversely, in the SFT $\to$ PPO pipeline, the contributions are complementary: while poisoning the SFT data or the reward model fails individually, their combination is successful. These results demonstrate that security assessments of isolated post-training stages systematically underestimate the compound vulnerabilities arising from their interaction. Code is available at https://github.com/jcksanderson/sequential-poisoning.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




