We measured it - paired against real-data-only baselines, 7 to 15 seeds, on a frozen test set, across 8 industries and three imaging types, from color photos to X-ray.
The result
On the single hardest, most starved class in each dataset, adding Synthgen synthetic data to the same real data lifted accuracy by +9 to +35 points.
Per-class accuracy on the hardest class in each dataset. Paired vs real-only, 7-15 seeds, frozen test split, placebo-controlled. z = standard deviations above the paired baseline.
Data efficiency
With Synthgen, two labelled examples per class reach the same accuracy as five real ones alone - 60% fewer labels. The gap is widest exactly when your real data is scarcest.
MVTec Pill, accuracy vs real labelled images per class, paired. With + Synthgen, 2 labels per class (44.6%) reach the accuracy of 5 real labels alone (45.4%) - and the lift is largest exactly when real data is scarcest.
“Every hardest-class gain held up paired over 7 to 15 random seeds, on a frozen test set we never touched during training.”
Synthgen internal benchmark
Where it fits
Synthetic data pays exactly where real data is scarce and the class is hard - the rare, costly classes you can't collect enough of.
The appearance-based classes starved of real examples - exactly where models fail and where it costs you most.
The less real data you have, the more synthetic data is worth - a working model before you have collected hundreds of examples.
Pinpointing where it is in the image, on exactly the starved cases the real-only model localizes worst.
Run it on your data
Bring your most-confused, least-labelled class. We will show you the gain on a pilot.