Watching self-attention build a table of scores
Type a short sentence, inspect the token sequence, and edit the sequence embeddings directly. The page computes causal pairwise dot products, masks out future positions, and then shows the row-wise softmax normalization that turns scores into attention weights before applying those weights to a value matrix.
Sequence Plot
Use a short input such as "the cat sees the dog".
Sequence Embeddings
Each token position gets a 2D vector. Editing these values recomputes attention and moves the plot.
Value Matrix
These are the per-token value vectors V that attention will mix together.
Dot Products
Entry (i, j) is visible only when token i may attend to token j.
Raw similarity scores with a causal mask, so future tokens are blocked.
Row-Wise Normalization
Each row is softmax-normalized only over the unmasked entries, so every row still sums to 1.
This is the causal self-attention table produced from the masked dot products.
A ยท V
The normalized attention matrix multiplies the value matrix to produce one output vector per token.
Each row is the weighted average of the earlier value vectors allowed by the causal mask.