Schematic representation of acNMF approach for recovering gene expression programs in a single-cell RNA-seq dataset denoted Y. The dataset is (i) randomly split 50/50, then (ii) each split is factorized independently into a basis matrix of gene expression programs (GEPs) W and a GEP activity matrix H using cNMF at a range of values of k (e.g., 5–200). At each k the redundant GEPs from each split are (iii) reaggregated by a community detection algorithm. This community number is then (iv) determined for a range of values of k and a range of Jaccard length values, parameters which are chosen to maximize the number of GEP communities replicated in each split. Critically, the number of ground truth GEPs is typically smaller than the value of k (GEPs < k , hypothetical example highlighted in blue).