We varied the ID model selection criteria and compared the selected model against the oracle that chooses the model that is most fair OOD. We plotted the increase in OOD fairness gap of the selected model over the oracle, averaged across 42 combinations of OOD dataset, task and attribute. We used non-parametric bootstrap sampling ( n = 1,000) to define the bootstrap distribution for the metric. We found that selection criteria based on choosing models with minimum attribute encoding achieve better OOD fairness than naively selecting based on ID fairness or other aggregate performance metrics ('Minimum Attribute Prediction Accuracy' versus 'Minimum Fairness Gap': P = 9.60 x 10 -94 , one-tailed Wilcoxon rank-sum test; 'Minimum Attribute Prediction AUROC' versus 'Minimum Fairness Gap': P = 1.95 x 10 -12 , one-tailed Wilcoxon rank-sum test).