A sentence of two exemplary words (tokens) are re-represented as a sequence of two semantic embeddings (X, 4-dimensional semantic embedding), one row for each input word occurrence. Then, based on the two-word embedding from the encoder neural network (cf.Figure 2), self-attention invokes three new vectors: query (Q), key (K), and value (V). Each of these three vectors emerges from a matrix multiplication between X and a to-be-estimated matrix W0, whose parameter entries are trained in conjunction with the overall neural network. The ensuing query vector, key vector, and value projection vectors typically have smaller dimensionality than the embedding for each word of the input sequence. Regarding the content of representation, query vectors instantiate a focus subset of the input sequence (the question we are asking), key vectors instantiate the entire input sequence (responses for everything we could ask), and value vectors instantiate a corresponding quantity (the content of all possible answers). Using the queries, keys, and values, the attention mechanism creates weighted attention scores as output derived from X (how much attention to pay to each response based on relevance to the question). In the transformer model, this attention mechanism is used multiple times in each layer and across multiple layers to progressively refine the representations of the input sequence (Figure 2) (source:https://jalammar.github.io/illustrated-transformer/).