Comparison of average AUC versus fairness (AUC) gap across different baselines for radiology for race for in-distribution (n = 23,261 samples) and OOD (n = 17,723 samples) datasets. Race labels are not available for the OOD dataset. For a, we report results in-distribution (left) and OOD (right) on CheXpert and ChestX-ray14 datasets, respectively. We marked the baseline ‘Pretrained on JFT’ with black. Label conditioning corresponds to the model that used synthetic images from a diffusion model conditioned on only the diagnostic labels. We further compared to other strong contenders, that is, a BiT-ResNet model pretrained on ImageNet-21K (Pretrained on IN-21K), a model pretrained on JFT using RandAugment heuristic augmentations (RandAugment), a model trained with RandAugment on top of standard ImageNet augmentations (RandAugment + IN Augms) and a model trained with focal loss (Focal loss). To ensure a fair comparison, all methods were trained and finetuned for the same number of steps and with the same batch size. For the fairness gap, smaller values are preferable. Data are presented as the mean ± s.d. across five technical replicates.