When RL Wins the Benchmark but Loses the Patient

This is a TL;DR of our paper “Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients” by Armin Berger, Manuela Bergau, Helen Schneider, Saad Ahmad, Tom Anglim Lagones, Gianluca Brugnara, Martha Foltyn-Dumitru, Kai Schlamp, Philipp Vollmuth, and Rafet Sifa, available on arXiv and to be published in the Proceedings of IJCNN 2026.

Reinforcement learning has been the headline story for LLMs lately: DeepSeek-R1 and friends have shown you can squeeze impressive reasoning out of a model with the right reward signal. A natural next question: does this work for medical imaging too, especially on a tight budget? Our team from Fraunhofer IAIS, the University of Bonn, the Lamarr Institute, University Hospital Bonn, and Queensland Health set out to find out, and the answer turns out to be more complicated than the leaderboards suggest.

The Setup

We built ChexReason, a small vision-language model trained to classify chest X-rays and produce radiologist-style reasoning traces. The recipe is the now-familiar R1 two-step:

Supervised fine-tuning (SFT) on 2,000 examples, with reasoning traces generated by Gemini 2.5 and validated by radiologists.
GRPO reinforcement learning on another 1,000 examples, with two different reward functions tested.

The whole thing runs on a single A100 GPU: roughly 50× less data and 4× less compute than NVIDIA’s comparable NV-Reason-CXR-3B model. We tested on two out-of-distribution benchmarks: CheXpert (from Stanford) and the NIH Chest X-ray dataset.

The Headline Result

GRPO worked. ChexReason posted a 23% macro-F1 improvement on CheXpert over the SFT checkpoint, nearly matching the MedGemma baseline despite training on a different distribution. Categories like Cardiomegaly and Lung Opacity saw dramatic gains.

So far, so good. Then we evaluated on NIH.

The Catch

On NIH, ChexReason performance dropped 19%, back to baseline. The same RL training that lifted CheXpert scores actively hurt transferability to a dataset that uses different labeling conventions. And here’s the kicker: NVIDIA’s much larger NV-Reason-CXR-3B shows the same pattern, dropping 61% from CheXpert to NIH. This isn’t a small-model artifact; it’s a recurring alignment-versus-transfer trade-off in this RL setting.

Our interpretation: GRPO isn’t really teaching the model to read chest X-rays better. It’s teaching the model to predict CheXpert-style labels, exploiting label conventions, annotation quirks, and dataset-specific patterns that don’t carry across institutions.

The Paradox

The most interesting finding is what happened before RL: the SFT-only checkpoint was the only model that improved on NIH (0.282 → 0.299 macro-F1), even though its CheXpert numbers were the weakest. Teacher-guided reasoning traces seem to encode something more institution-agnostic than reward-optimized outputs do, possibly because Gemini’s traces emphasize diagnostic principles rather than the shortcut features RL learns to exploit.

A Bonus Finding on Prompting

The team also tested how instruction format interacts with medical pre-training, and the result is a clean reversal:

Qwen2.5-VL-3B (general-purpose): Best with structured, 12-step clinical reasoning scaffolds. Without medical pre-training, it needs the explicit guidance.
MedGemma-4B (medically pre-trained): Best with direct label prediction or free-form reasoning. The structured scaffold is redundant — the model has already internalized those patterns.

Practical takeaway: structured reasoning prompts compensate for missing domain knowledge, but can hurt models that don’t need the crutch.

Why It Matters

For anyone deploying small medical VLMs in the real world, this is a useful warning. Benchmark-driven RL fine-tuning may produce numbers that look great in a paper but degrade exactly the property you most need clinically, robustness across hospitals, scanners, and labeling conventions. Under tight resource constraints, carefully curated SFT may actually outperform aggressive RL for clinical deployment.

That doesn’t mean RL is doomed for medical imaging, but it does mean “improved benchmark score” and “better diagnostic model” aren’t the same thing, and we probably need reward formulations, multi-dataset curricula, or architectural changes that explicitly target generalization rather than single-benchmark wins.

The one-line summary

TL;DR: RL fine-tuning made the model better at the benchmark and worse at the job and the same thing happens at 50× the budget, so the problem is the recipe, not the resources.