Watching self-attention build a table of scores

Type a short sentence, inspect the token sequence, and edit the sequence embeddings directly. The page computes causal pairwise dot products, masks out future positions, and then shows the row-wise softmax normalization that turns scores into attention weights before applying those weights to a value matrix.

Input sentence to token sequence
Embeddings editable T x 2 matrix
Values editable V matrix
Geometry draggable points in 2D
Attention causal mask plus row-wise softmax

Sequence Plot

Use a short input such as "the cat sees the dog".

Token Sequence
Token IDs

Sequence Embeddings

Each token position gets a 2D vector. Editing these values recomputes attention and moves the plot.

Value Matrix

These are the per-token value vectors V that attention will mix together.

Raw dot products
Row-wise softmax
Active token row

Dot Products

Entry (i, j) is visible only when token i may attend to token j.

Raw similarity scores with a causal mask, so future tokens are blocked.

Row-Wise Normalization

Each row is softmax-normalized only over the unmasked entries, so every row still sums to 1.

This is the causal self-attention table produced from the masked dot products.

A ยท V

The normalized attention matrix multiplies the value matrix to produce one output vector per token.

Each row is the weighted average of the earlier value vectors allowed by the causal mask.