Registration of multimodal data, overview. (i) Each modality, A and B, is passed through its respective landmark detector, and a latent representation of modality A is saved. (ii) Landmarks are identified within the heatmaps by pinpointing the pixel with the highest intensity for each individual heatmap, a process known as the spatial argmax operation. Following this, the samples from the two modalities are registered. (iii) A latent representation is obtained from modality B. (iv) We now aim to maximize the similarity between the latent representations of modalities A and B. This optimization is a crucial part of the training process, where two key objectives need to succeed: first, ensuring the landmark detector identifies corresponding landmarks across modalities, leading to accurate registration; and second, embedding both modalities into a joint latent space. By minimizing the distance of the modalities latent representations into the loss function, we progressively refine our two objectives during the training process.