Transformer Model Implementation Details

Today's article summarizes a detailed implementation of the Transformer deep learning model. It covers the model architecture and the key components used in practice, including self-attention, multi-head attention, positional feed-forward networks, residual connections, positional encoding, masking, and regularization.

Transformer overall architecture
Overview
Tensors and embeddings
Self-attention
Multi-head attention
Position-wise feed-forward networks
Residual connections and layer normalization
Positional encoding
Decoder
Masks: padding mask and sequence mask
Final linear and softmax layers
Embedding sharing
Regularization

1. Transformer model architecture

In 2017, the paper "Attention Is All You Need" proposed the Transformer model, which replaces RNNs commonly used in NLP tasks with self-attention structures. Compared with RNNs, a primary advantage is the ability to perform parallel computation. The overall Transformer architecture is shown in the figure below.

Transformer architecture diagram

2. Overview

Treat the Transformer as a black box: for machine translation, it maps an input sentence in one language to an output sentence in another language.

Transformer black-box overview

2.1 Encoder-Decoder

The Transformer is an encoder-decoder architecture. The middle portion can be divided into two components: the encoder stack and the decoder stack.

Encoder-decoder architecture

The encoder component consists of multiple encoder layers (the original paper used 6 layers), and the decoder component consists of the same number of decoder layers (also 6 in the original paper). Each encoder layer contains two sublayers:

Self-attention layer
Position-wise feed-forward network (FFN)

The encoder first passes inputs through the self-attention layer, then through the feed-forward network. The decoder contains the same two sublayers plus an additional encoder-decoder attention layer that lets the decoder attend to relevant parts of the encoder output.

Encoder and decoder components

3. Tensors and embeddings

Next we examine the vectors and tensors and how they flow through the model from input to output.

3.1 Word embeddings

As in typical NLP tasks, words are first converted to word vectors using embeddings. In the original Transformer, embedding dimension is 512.

Word embeddings to 512-d vectors

Word embeddings occur only at the bottom layer of the encoder stack. Each encoder receives a sequence of 512-dimensional vectors. The sequence length is a hyperparameter, typically set to the maximum sentence length in the training data.

3.2 Encoding after embedding

After embedding, each token vector passes through the encoder sublayers as shown below.

Embeddings flowing through encoder

4. Self-Attention

4.1 Self-attention overview

Consider this example sentence: "The animal didn’t cross the street because it was too tired." What does "it" refer to: animal or street? For humans this is trivial, but for an algorithm it is harder. Self-attention lets the model associate "it" with "animal" or other relevant tokens. When processing each position, self-attention allows the model to attend to other positions in the sequence to better encode the current token.

4.2 Self-attention mechanism

Self-attention uses Query (Q), Key (K), and Value (V) matrices derived from the same input. The scaled dot-product attention computes the dot product of Q and K, scales by sqrt(d_k), applies softmax to obtain weights, and then multiplies by V to produce the weighted sum. The basic computation is:

Scaled dot-product attention formula

4.3 Self-attention step-by-step

Step 1: For each input vector (token embedding), create three vectors: Query, Key, and Value. These are obtained by multiplying the token embedding by three learned weight matrices. In the original model the new vectors have dimension 64 while the embeddings are 512. The smaller dimension is chosen to make multi-head attention tractable.

Query, Key, Value per token

Step 2: Compute attention scores. For a token like "Thinking", compute a score for every token in the sentence by taking the dot product between the Query for "Thinking" and the Key of each token.

Attention score calculation

Step 3: Divide each score by sqrt(d_k) where d_k is the Key vector dimension. This stabilizes gradients during training.

Step 4: Apply softmax to obtain normalized attention weights.

Step 5: Multiply each softmax weight by the corresponding Value vector. High weights emphasize relevant positions; low weights downweight irrelevant positions.

Step 6: Sum the weighted Value vectors to produce the output for the current position. The resulting vectors are fed to the feed-forward network. In practice these operations are implemented efficiently using matrix multiplications.

Self-attention full process

4.4 Matrix implementation

Step 1: Stack all token embeddings into a matrix X. Compute Q, K, V matrices by multiplying X with three learned weight matrices.

Compute Q, K, V matrices from X

Each row of X is a token embedding. Rows of Q, K, V correspond to the Query, Key, Value vectors for each token. Step 2 compresses steps 2 through 6 into a single matrix expression using scaled dot-product and softmax.

Matrix form of self-attention

5. Multi-head attention

5.1 Architecture

Multi-head attention enhances self-attention by performing multiple attention operations in parallel with different learned linear projections. For each head, Q, K, V are projected via distinct weight matrices, attention is computed, and the results are concatenated and projected again by a final linear transformation.

Multi-head attention structure

In the original paper h=8 heads are used. Each head computes attention in a lower-dimensional subspace (e.g., d_k = 64), then the head outputs are concatenated and transformed to the model dimension.

Multi-head attention formula

Compute attention separately with different weight matrices to obtain multiple attention outputs, then concatenate and linearly project to produce a single output matrix suitable for the feed-forward network.

Separate QKV per head

Concatenate heads and project

5.2 Summary

Multi-head attention maps the same Q, K, V into different subspaces and performs attention in each. This enables the model to capture various types of relationships across positions while keeping parameter counts tractable and helping generalization.

Multi-head attention overview

Different heads can focus on different relevant tokens; for example, when encoding "it", one head may focus on "animal" and another on "tired".

Attention heads focusing on different tokens

6. Position-wise feed-forward networks

The position-wise feed-forward network is a fully connected feed-forward network applied independently and identically to each position. It consists of two linear transformations with a ReLU activation in between:

The input and output dimensions are the model dimension (e.g., 512). The inner-layer dimensionality is typically larger (e.g., 2048 in the original paper).

7. Residual connections and layer normalization

Each sublayer in both encoder and decoder (self-attention, encoder-decoder attention, and FFN) has a residual connection followed by layer normalization. Each sublayer output is: LayerNorm(x + Sublayer(x)).

This pattern applies across encoder and decoder layers. To facilitate residual connections, the output dimensionality of embeddings and all sublayers must match the model dimension.

8. Positional encoding

The model needs a way to represent token order. Transformer adds a positional encoding vector to each input embedding. These positional vectors follow a deterministic pattern that allows the model to infer relative positions.

Positional encoding added to embeddings

If the embedding dimension is 4, the positional encodings look like this:

The positional encoding uses sine and cosine functions of different frequencies:

PE(pos,2i) = sin(pos / 10000^(2i/d_model)) PE(pos,2i+1) = cos(pos / 10000^(2i/d_model))

These functions let the model easily learn to attend to relative positions. Note that some reference implementations interleave sine and cosine differently, but the key property is that the encodings generalize to longer sequences than seen during training.

Positional encoding visualization from reference implementation

9. Decoder

The encoder outputs Key and Value vectors that are used by each decoder layer's encoder-decoder attention to help the decoder attend to appropriate input positions. Decoding proceeds one time step at a time, producing one output token per step. Each output token is fed as input to the next decoding step (with positional encoding applied to decoder inputs). The encoder-decoder attention builds Query from the decoder and Key/Value from the encoder outputs.

Decoder attending to encoder outputs

10. Masks

Masks hide certain positions so they do not affect attention scores. Two masks are used: padding mask and sequence mask.

10.1 Padding mask

Input sequences in a batch may have different lengths, so shorter sequences are padded with zeros. These padding positions are meaningless and should not receive attention. In practice, padding positions are assigned a large negative value (negative infinity) before the softmax so their attention weights become approximately zero.

10.2 Sequence mask

Sequence mask prevents the decoder from attending to future positions. At decoding step t, the model should only depend on outputs up to t-1. Implement this by applying an upper-triangular mask (with zeros in the allowed lower triangle and negative infinity elsewhere) to block attention to future positions.

For decoder self-attention, both padding mask and sequence mask are applied (addition of the two masks). For other attention layers, only padding mask is typically needed.

11. Final linear and softmax layers

The decoder output at each position is a float vector. To convert it to a token, pass it through a linear layer to produce logits, then apply softmax to obtain a probability distribution over the vocabulary.

11.1 Linear layer

A simple fully connected layer maps the decoder output vector to the vocabulary-size logits vector.

11.2 Softmax

Softmax converts logits into probabilities. The token with the highest probability is selected as the output for that time step.

12. Embedding sharing

The original paper mentions sharing the weight matrix between the input embedding, output embedding in the decoder, and the final linear layer. If the weight matrix is shared, the embedding matrix is multiplied by sqrt(d_model) when used as input embeddings.

13. Regularization

Common regularization techniques used when training Transformers include:

Dropout: applied to the outputs of each sublayer before residual connection and layer norm, and applied after summing embeddings and positional encodings.
Label smoothing: used for classification/sequence prediction targets during training.

References

Original paper: "Attention Is All You Need" (Vaswani et al., 2017)
Illustrated reference: Jalammar's visual guide to the Transformer (English)