Correlation of vocabulary characteristics and genome biology annotation of the vocabulary with the GROVER vocabulary-derived W2V embedding. Depicted is variance explained throughout the first 20 PCs of a PCA, along with the Spearman correlation with token characteristics and percentage of tokens of a specific token sequence that belong to genome annotation categories. Gene element annotations with gene promoters (transcriptional start site ±1 kb), 5′ and 3′ untranslated regions, exons, introns, gene downstream regions (10 kb) and distal intergenic regions, as well as gene strand, coding sequence strand, chromatin colours, replication timing and replication strand, were corrected for GC content (as marked with an asterisk) using linear regression.