arXiv

SDR: Set-Distance Rewards for Radiology Report Generation

June 2, 2026 · Halil Ibrahim Gulluk, Max Van Puyvelde, Wim Van Criekinge, Olivier Gevaert · Original Source

Title: SDR: Set-Distance Rewards for Radiology Report Generation

Abstract:

Reinforcement learning techniques leveraging verifiable rewards have significantly propelled the reasoning capabilities of vision-language models. Nevertheless, applying these methods to chest X-ray report generation presents a unique challenge: standard reward mechanisms, such as exact-match accuracy or step-level process evaluation, are ill-suited. This incompatibility arises because radiology reports comprise unordered and orthogonal findings, rather than following a linear causal reasoning chain.

To bridge this gap, we introduce a set-based approach. In this framework, each report is decomposed into individual sentences and processed by a frozen sentence transformer to create unordered sets of embeddings. We propose utilizing set-to-set distances between the embeddings of generated and reference reports as continuous, permutation-invariant rewards.

Our experiments across two distinct datasets and three vision-language models (Qwen3-VL-2B, Qwen3-VL-4B, and Gemma3-4B) demonstrate that post-training with these set-to-set distance rewards, implemented via GRPO, consistently surpasses both supervised fine-tuning and exact-match GRPO. Specifically, we observed average relative improvements of 6.80% in BERTScore, 7.82% in RadGraph F1, and 4.45% in CheXbert F1.

Furthermore, these set distances facilitate test-time best-of-$N$ selection. By scoring candidate outputs based on their proximity to training-report embeddings, we achieved superior performance compared to random selection. This advantage extended to our trained models as well as three closed-source large language models (Mistral-Small, Gemini-2.5 Flash-Lite, and GPT-4o-mini), yielding an average relative improvement of 16.4% on BERTScore.

When employed as a streaming signal, these rewards enable a more efficient method of test-time scaling. Pruning low-scoring candidates during generation reduces the total number of generated tokens by more than 50%, while maintaining the Findings quality equivalent to a full best-of-$N$ selection. Collectively, these findings position set-distance rewards as a unified signal for enhancing both post-training and test-time scaling in the domain of chest X-ray report generation. Our code is publicly available at \href{https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA}{https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA}.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC