Transformer from Scratch

Introduction

In this blog post I will code the Transformer model from scratch and trained to translate English to Italian. The code can be found HERE.

Transformers, in the context of machine learning and natural language processing (NLP), are a type of deep learning model architecture that has had a profound impact on a wide range of NLP tasks. They were introduced in the paper titled "Attention Is All You Need" by Vaswani et al. in 2017 and have since become a fundamental building block in the field of deep learning.

GPT (OpenAI), PaLM (Google), Llama (Facebook) are built on transformers. Here are the key concepts and components of transformers:

Attention Mechanism: The core innovation of transformers is the attention mechanism, which allows the model to weigh the importance of different parts of an input sequence when processing it. This mechanism allows transformers to capture long-range dependencies in data efficiently, making them well-suited for sequential data like text.
Self-Attention: Transformers use self-attention to process input sequences. In self-attention, each element in the input sequence attends to all other elements, producing a weighted sum of their representations. This allows the model to consider context from all positions in the sequence simultaneously.
Multi-Head Attention: Transformers often employ multiple attention heads, each learning a different attention pattern. This multi-head attention mechanism allows the model to focus on different aspects of the input data.
Positional Encoding: Unlike recurrent neural networks (RNNs) and convolutional neural networks (CNNs), transformers don't have an inherent sense of the order of elements in a sequence. To address this, positional encodings are added to the input embeddings to provide information about the positions of words in the sequence.
Encoder-Decoder Architecture: Transformers can be used for both sequence-to-sequence tasks, such as machine translation, and single-sequence tasks, such as language modeling. For sequence-to-sequence tasks, transformers typically use an encoder-decoder architecture, where the encoder processes the input sequence, and the decoder generates the output sequence.
Pretraining and Fine-Tuning: Transformers are often pretrained on massive amounts of text data in an unsupervised manner. This pretraining allows the model to learn a general understanding of language. After pretraining, the model can be fine-tuned on specific downstream tasks, such as text classification, translation, or question answering.
Attention Isolation: Transformers allow for attention isolation, which means you can visualize which parts of the input sequence the model is paying attention to during inference. This makes them interpretable compared to some other deep learning architectures.
Scalability: Transformers are highly parallelizable, making them well-suited for GPU and distributed computing. This scalability has contributed to their popularity.

Prominent examples of transformer-based models include BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and T5 (Text-to-Text Transfer Transformer), among others. These models have achieved state-of-the-art results on various NLP tasks and have had a significant impact on the field of artificial intelligence.

Transformer Architecture - overview

The transformer is made up of two main units, the encoder and decoder. Both encoders and decoders units are made of a stack of N blocks

Transformer architecture - adapted from Vaswani A. et al., 2017 — **Transformer architecture - adapted from** **Vaswani** **A. et al., 2017**

At the core of both units lies the attention mechanism:

Attention formula - Vaswani A. et al., 2017 — **Attention formula -** **Vaswani** **A. et al., 2017**

The encoder utilizes a single attention mechanism, aka self-attention, where the query (Q), key (K), and value (V) matrices are identical, derived from the word embeddings combined with positional encoding. Unlike the decoder, the encoder's attention is not masked, allowing each token to attend to all others in the sequence. Note that pad tokens are still masked in the self-attention to prevent them from influencing the attention scores. This ensures that padding does not contribute to the learned representations, maintaining the integrity of the sequence's contextual understanding

In the decoder, Qs, Ks, and Vs are also derived from embeddings, but they follow a different structure due to the two types of attention mechanisms: masked self-attention and encoder-decoder attention.

Masked Self-Attention:
- Similar to the encoder, the decoder first computes Q, K, and V from the embedded input (summed with positional encoding).
- However, a crucial difference is that the attention mechanism is masked, meaning each word can only attend to previous words (or itself), preventing the model from seeing future tokens during training.
Encoder-Decoder Attention:
- This mechanism allows the decoder to incorporate information from the encoder’s output.
- Here, Q comes from the decoder, while K and V come from the encoder's output.
- This enables the decoder to focus on relevant parts of the encoded sequence when generating the output.

Coding the Transformer from scratch

We will code each component of the transformer and later we will trained it to translate English to Italian.

1 - Attention layer

The attention layer is the most crucial component of the Transformer architecture, as it enables the model to capture dependencies between tokens regardless of their distance in the sequence. Unlike traditional sequential models like LTSMs and RNNs, which process tokens one by one, the attention mechanism allows each token to attend to all other tokens simultaneously, making it highly efficient for capturing long-range dependencies.

a. Initialization. The Attention layer is initialized with:

embed_size: the total dimensionality of the input word embeddings.
heads: the number of attention heads (each head learns different attention patterns).
head_dim: the dimensionality of each individual attention head, calculated as embed_size // heads.
An assert statement ensures that embed_size is divisible by heads for even splitting.
The following Torch layers are initialized:
- self.values, self.keys, self.queries: these layers are instantiate to learns separate Q, K, and V transformations. These layers project input embeddings into different spaces for computing attention scores.
- self.fc_out: recombine the multi-head attention outputs into a single vector representation for each token
- self.dropout: to reduce overfitting

b. Forward pass:

Values, keys, and queries represent the word embeddings.
The input embeddings are passed through three separate linear layers: this ensures that the query, key, and value transformations are learned independently, even though they originate from the same input (A).
Each input is reshaped into self.heads attention heads, each with self.head_dim dimensions: this allows the model to process different patterns of relationships between words in parallel (multi head attention) (B).
Computing Scaled Dot-Product Attention:
- Attention (queries and keys multiplied together) is calculated by performing matrix multiplication using torch.eisum (C).
- Masking (optional): If a mask is provided (e.g., for preventing attention to padding tokens or future words in decoding), it replaces masked positions with a very low value (-1e10) (D).
- The attention scores are then normalized using softmax and the division by √(embed_size) stabilizes gradients (E).
Dropout is applied to prevent overfitting.
The computed attention scores are multiplied by the value vectors (V) to generate the final attention-weighted representations (F).
The heads are then concatenated into a single tensor of size (G).
The final representation is projected back to the original embedding size using fc_out: This step ensures that the attention mechanism outputs a tensor with the same dimensionality as the input, making it compatible with the next Transformer layer (H).

2 - TransformerBlock layer

The TransformerBlock class implements a single block of a Transformer model, which consists of two key components:

Multi-Head Self-Attention (attention layer)
Feed-Forward Network

Each component is followed by layer normalization and residual connections to ensure stability and efficient learning.

a. Initialization. The Transformer block is initialized with:

embed_size: the dimensionality of input embeddings.
heads: number of attention heads for SelfAttention.
dropout: dropout rate to prevent overfitting.
expansion: expansion factor for the feed-forward network.
The following Torch layers are initialized:
- self.attention: self-attention layer
- self.norm1, self.norm2: layer normalization
- self.feed_forward: Feed-Forward Network (FFN) layer, to expand and project embedding.
- self.dropout: to reduce overfitting

b. Forward pass:

The self-attention mechanism computes the attention-weighted representation of the input sequence (A).
Residual Connection + Layer Normalization (B). A skip connection (residual connection) is added:
- Instead of using just the attention output, we add the original query (residual learning).
- Layer normalization stabilizes the training process.
- Dropout helps reduce overfitting.
Feed-Forward Network (FFN) (C). The transformed representation is passed through a two-layer feed-forward network, which introduces non-linearity and expands the embedding representation.
Residual Connection + Layer Normalization (Again) (D).:
- Another skip connection is added, ensuring gradient flow.
- Layer normalization is applied again after the FFN.
- Dropout is applied to further regularize the network.

3 - Encoder.

This Encoder class implements the encoder component of a Transformer model. It processes an input sequence by embedding the words, adding positional encodings, and passing the result through multiple Transformer blocks.

a. Initialization. The Encoder is initialized with:

source_vocab_size: size of the vocabulary (number of unique tokens).
embed_size: dimensionality of word embeddings.
num_layers: number of Transformer blocks in the encoder.
heads: number of attention heads per block.
device: specifies if the model runs on GPU or CPU.
expansion: expansion factor for the feed-forward network in Transformer blocks.
dropout: dropout probability to prevent overfitting.
max_sentence_length: maximum allowed sentence length for positional encoding.
The following Torch layers are initialized:
- self.word_embedding: this word Embedding layer maps input words (integers) to dense vectors of size embed_size
- self.pos_embedding: this positional Encoding layer adds position information to each word (since self-attention has no inherent order).
- self.layers: this is a stack of TransformerBlocks where each block applies self-attention and a feed-forward network.
- self.dropout: it helps prevent overfitting by randomly zeroing some values.

b. Forward pass:

Positional embeddings and word embeddings are computed and summed together. Next dropout is applied for regularization (A).
Pass through transformer blocks (B):
- Each Transformer block refines the representation of the input sequence.
- Self-attention is applied (same query, key, and value since it's an encoder).
- The final output is a contextual representation of each word.

4 - DecoderBlock

This DecoderBlock class implements a single block of the Transformer decoder. It differs from the encoder by incorporating masked self-attention and an additional encoder-decoder attention mechanism that allows the decoder to focus on relevant parts of the encoder's output.

a. Initialization. The DecoderBlock is initialized with:

The following Torch layers are initialized:
- self.attention: this Masked Self-Attention layer ensures the model cannot attend to future tokens (used in autoregressive generation).
- self.norm: the Layer Normalization stabilizes training and prevents vanishing gradients.
- self.tranformer_block: this layer performs Encoder-Decoder Attention + Feed-Forward. It uses a standard TransformerBlock to attend to encoder outputs.
- self.dropout: it reduces overfitting by randomly setting activations to zero.

b. Forward pass:

Masked Self-Attention:
- It computes self-attention on x (decoder input). Masking ensures no future tokens are accessed (A).
- Applies skip connection (residual learning) + layer normalization (B).
Encoder-Decoder Attention (C):
- Value & Key are from the encoder output, Query is from the decoder attention output
- The decoder's query is refined via cross-attention with the encoder.
- The Transformer block applies encoder-decoder attention + feed-forward processing.

5 - Decoder

The Decoder class implements the full Transformer decoder, which generates text (or sequences) based on encoder outputs.

a. Initialization. The Decoder is initialized with:

embed_size: dimensionality of word embeddings.
device: specifies if the model runs on GPU or CPU.
The following Torch layers are initialized:
- self.word_embedding: converts input token indices into dense vectors.
- self.pos_embedding: adds positional encodings to retain sequence order.
- self.layers: this is a stack of DecoderBlocks. Each block contains:
  - Masked self-attention (to prevent peeking at future tokens).
  - Encoder-decoder attention (to attend to encoder outputs).
  - Feed-forward network (to process information further).
- self.fc_out: final linear layer which maps decoder output embeddings to vocabulary size for final predictions
- self.dropout: it reduces overfitting by randomly setting activations to zero.

b. Forward pass:

Compute Token + Positional Embeddings (A):
- Embeds input tokens (x).
- Adds positional encodings (pos_embedding) to retain order information.
Pass Through Decoder Blocks. Each DecoderBlock (B):
- Masked self-attention on x (decoder input).
- Encoder-decoder attention (cross-attention with enc_out).
- Feed-forward transformation.
- Skip connections + normalization to stabilize learning.
Map to Vocabulary (C):
- Projects the decoder output to vocabulary size.

5 - Transformer Model

This Transformer class is the full sequence-to-sequence Transformer model, consisting of:

An encoder which processes the source sequence.
A decoder which generates the target sequence.
Masking functions to handle padding and prevent information leakage.

a. Initialization. The Transformer Model is initialized with:

embed_size: dimensionality of word embeddings.
device: specifies if the model runs on GPU or CPU.
self.source_pad_index & self.target_pad_index: used to create masks.
self.debug: prints intermediate steps for debugging.
The Transformer Model is initialized with the following Torch layers:
- self.encoder: the Encoder layer processes the input sequence into meaningful embeddings.
- self.decoder: the Decoder layer uses encoder outputs to generate the target sequence.
- self.linear_out: the final Linear Layer projects decoder outputs to the vocabulary size.

b. Masking Functions:

make_source_mask: it creates source masks (A):
- Ensures the model ignores padding tokens in the encoder.
- Output shape: (N, 1, 1, source_length), where N is batch size.
make_target_mask: it creates target masks (B):
- It combines two masks:
  - Padding mask: Ensures the decoder ignores padding tokens.
  - Look-ahead mask: Prevents the decoder from "cheating" (seeing future tokens).
- Output shape: (N, 1, seq_len, seq_len).

c. Forward pass:

Masks are generated (A):
- Ensures that the encoder and decoder correctly ignore padding.
- Prevents the decoder from looking at future words.
Encoder Processes Source Input (B):
- Converts source tokens into contextual representations.
Decoder Generates Output (C):
- The decoder attends to the encoded representation (enc_src).
- The target sequence is generated step by step, following masking constraints.
Debugging (Optional) (D):
- Displays intermediate results for debugging

Training the Transformer - helper functions and classes

In the git repo readers can find the full code to collect the data, filter it, tokenize words, create the data set and train the model. Below I will explain at high level what each function and class does:

1 - Function filter_data: this function filters a list of translation dictionaries based on length constraints (either by character count or word count) for two specified languages, ensuring the translations fall within a given minimum and maximum range.

2 - Function create_tokenizer_and_corpus_train_test: this function retrieves, tokenizes, and splits bilingual text data into training and validation sets while returning trained tokenizers for both source and target languages. Inside it calls two self-explanatory functions, build_tokenizer and retrieve_data.

3 - Class TranslationDataset: this class defines a PyTorch dataset for machine translation, tokenizing and processing source and target language sentences while ensuring they adhere to specified sequence lengths and include special tokens like [SOS], [EOS], and [PAD].

4 - Function train: this function trains a sequence-to-sequence model using an encoder-decoder architecture, tracks training and validation losses, implements early stopping, saves the best-performing model, and plots loss curves.

- Function translate: this function translates a given source text using a trained sequence-to-sequence model by tokenizing the input, passing it through an encoder-decoder architecture, iteratively predicting tokens until the end-of-sequence token is reached, and then decoding the output into the target language.

Training the Transformer - Results

I instantiated a 256M-parameter Transformer model for English-to-Italian translation using the OpusBooks dataset. Sentences were filtered to include only those with character counts between 5 and 500, and the dataset was trimmed to 31,303 translation pairs (90% for training, 10% for validation). The model was trained for 13 epochs before early stopping was triggered.

The training and validation loss throughout the training process is shown below.

The best model was from epoch 8 and was used to translate random sentences from the validation set. While some translations closely match the ground truth, others are entirely inaccurate. It's important to note that the dataset I used is relatively small compared to the original Transformer paper, where the English-German dataset contained 4.5 million sentence pairs and the English-French dataset had 36 million pairs.

Two example of acceptable translations:

Two example of inaccurate translations:

Conclusion

In this blog post, I walked you through the step-by-step process of implementing a Transformer model from scratch. Despite using a relatively small dataset, the model successfully generated translations with varying degrees of accuracy, showcasing both the strengths and limitations of training on limited data. This experiment highlights the effectiveness of Transformers in language tasks and lays the groundwork for further improvements, such as expanding the dataset, fine-tuning hyperparameters, or exploring more advanced architectures.

Follow me on Twitter and Facebook to stay updated.