OOD distribution results on n = 85,054 samples. Prediction accuracy (x axis) on the validation and test hospitals when training the generative model on all in-distribution labeled examples is shown. Note that the validation set was used for model selection, given that its distribution was more similar to the training distribution. We compared the following methods: baseline model with no augmentations; Color augm. for a model that uses color augmentations; Label conditioning and Label and property conditioning for our proposed approach of a generative model conditioned on the diagnostic label and both the diagnostic label and the hospital ID, respectively; L cond. + color augm. and L and P cond. + color augm. were used to apply color augmentations on the images generated with the diffusion models. Combining color augmentation with synthetic data performed best across all settings. Data are presented as mean ± s.d. across five technical replicates.