Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLMs
Title: Learning from Errors: Tree-like Self-Play for Enhancing Security in Code LLMs
Abstract
Although Large Language Models (LLMs) have achieved remarkable proficiency in generating code, they frequently reproduce subtle but critical security vulnerabilities inherent in their training datasets. Existing alignment strategies, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), generally rely on coarse-grained optimization at the sequence level. Consequently, these methods often struggle to resolve the localized nature of security defects, where a single erroneous token selection can jeopardize the integrity of an entire program.
To address this limitation, we propose Tree-like Self-Play (TSP), a novel framework that reimagines secure code generation as a fine-grained sequential decision-making process. Rather than simply maximizing likelihood as conventional approaches do, TSP builds a decision tree that enables the model to explore diverse branching trajectories. This involves generating both secure "golden paths" and vulnerable alternatives. By framing code generation as a self-play game, the model is trained to strictly identify and penalize its own localized mistakes. This mechanism yields a dense, on-policy learning signal that drives self-correction specifically at the critical decision nodes where vulnerabilities are most likely to arise.
Experimental results indicate that TSP significantly improves model reliability. On Python security benchmarks, TSP elevates the pass rate (SPR@1) for CodeLlama-7B to 75.8%, a substantial improvement over both SFT (57.0%) and unstructured self-play baselines. Moreover, TSP fosters robust out-of-distribution generalization. The model reduces vulnerabilities in previously unseen Common Weakness Enumeration (CWE) categories by 24.5% and successfully transfers security principles acquired from C/C++ to other languages such as Python, Go, and JavaScript. These findings suggest that TSP goes beyond simple memorization of patches, instead enabling the model to internalize abstract, language-agnostic security logic.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



