RogueMerge: Robust and Unified Attacks against LLM Model Merging
Title: RogueMerge: A Robust and Unified Approach to Attacking LLM Model Merging
Abstract:
Model merging integrates specialized functionalities into a single Large Language Model (LLM) by aggregating task vectors obtained from unverified public repositories, thereby creating a significant vulnerability in the supply chain. Since task vectors can embed malicious behaviors, the merging process effectively grants third-party inputs direct write access to model weights, allowing attackers to trigger or exacerbate various downstream threats. Previous research has primarily focused on backdoor attacks targeting classifiers through static arithmetic heuristics. However, these methods are ill-suited for generative LLMs due to three fundamental limitations: (i) LLMs utilize autoregressive decoding, meaning that the slight parameter shifts caused by merging accumulate across tokens, rapidly diminishing the attack’s efficacy; (ii) attackers lack insight into the victim’s specific merging configurations, causing isolated, static attack vectors to be easily diluted or nullified; and (iii) practical threats must generalize to unseen attack prompts, a requirement that static vectors cannot meet.
We introduce RogueMerge, the first comprehensive framework designed to overcome these three challenges. To counter the compounding effects of autoregressive generation, we substitute static arithmetic with a joint optimization process that explicitly ensures attack success post-merging. To address unknown merging parameters, we treat attack injection as a stochastic min-max problem, resolving it through meta-learning-style simulations. Furthermore, to ensure robustness across diverse attack prompts, we implement distributionally robust optimization, deriving a tractable first-order Taylor approximation suitable for LLMs with a provable error bound. Evaluations across four threat types, six merging algorithms, and more than 170 merged LLMs demonstrate that RogueMerge consistently surpasses existing attack methods. Additionally, it maintains stability under varying merging conditions and resists conventional defensive measures.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



