Synthetic Hallucinations, Real Gains: Hard Negatives from Frontier Models for FIM Hallucination Mitigation
Title: Synthetic Hallucinations, Real Gains: Hard Negatives from Frontier Models for FIM Hallucination Mitigation
Abstract
Current small, open-source code models used for IDE autocomplete frequently generate hallucinated Fill-in-the-Middle (FIM) completions. These outputs often appear syntactically correct, invoking methods, parameters, variables, and imports that are absent from the surrounding project context. Traditional mitigation strategies face significant limitations: they either rely on per-language execution sandboxes, which are impractical during mid-keystroke suggestions, or depend on preference-optimization pipelines that necessitate extensive human-labeled datasets.
To address these challenges, we introduce an execution-free approach that utilizes frontier code models to generate plausible yet incorrect completions, serving as hard negatives. By contrasting these synthetic hallucinations with the actual developer edits, we create a supervised fine-tuning signal. Our methodology involves scraping multilingual FIM contexts from public GitHub repositories across eight different languages. We then engage a panel of three frontier models to generate one hard negative per context for four specific hallucination types, based on the Delulu taxonomy—a Docker-verified multilingual FIM hallucination benchmark. This process results in a paired dataset of chosen (correct) and rejected (hallucinated) examples.
When we fine-tune Qwen2.5-Coder-7B-Instruct on a curated 100K-row subset of this data, we observe an increase of 18.8 points in Delulu exact match accuracy and a 0.22 improvement in edit similarity across all languages and hallucination types. Additionally, this method enhances performance on every HumanEval-Infilling split and every SAFIM subset. Applying the same strategy to the 3B model yields a 12.8-point gain in Delulu exact match, accompanied by a minor, well-characterized trade-off in general FIM capabilities.
Through five-axis ablations—examining model size, type mix, language coverage, base-model family, and a difficulty-aware fool rate—as well as a direct comparison between Supervised Fine-Tuning (SFT) and DPO/ORPO, we identify the specific design choices that drive these performance gains. To ensure reproducibility, we are releasing the complete pipeline source code, including generation tools, LLM-based fool-rate judging, data curation scripts, and the FIM fine-tuning recipe. This allows any researcher to replicate our experiments end-to-end using any permissively licensed corpus.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



