Originally developed as the backbone of the seminal ‘Attention is All You Need’ paper by Vaswani et al. in 2017, the Transformer has evolved into a foundational technology that has redefined the landscape of various machine learning tasks. In this post, we begin our exploration of the Transformer architecture by understanding the multi-head attention mechanism and how it powers the Transformer. Then, we use this understanding to break down the fundamental components of the Transformer architecture.
The need for an attention mechanism stems from the inherent limitations associated with implementing RNNs. These limitations are discussed below along with an explanation of how the transformer solves them by using an attention mechanism:
- When it comes to long term dependencies, the use of RNNs is subjected to the vanishing gradient or the exploding gradient problem.
Transformer networks predominantly rely on attention mechanisms. These attention mechanisms facilitate connections across various elements within a sequence, effectively mitigating the challenges associated with long-range dependencies. With Transformers, the capacity to consider both long and short-range dependencies becomes uniform, greatly reducing the likelihood of gradient vanishing or explosion problems. The entire sequence is trained concurrently, with the addition of only a few extra layers, minimising the occurrence of gradient-related issues.
- RNNs lack the ability for parallel computation. While GPUs enable parallel processing, RNNs, operating as sequential models, conduct computations in a step-by-step fashion and cannot be parallelised.
In the self-attention mechanism, each element in the sequence can attend to all other elements with weighted connections. This means that computations involving different elements are independent of each other, allowing for parallel processing. Moreover, Transformers use multi-head attention, where the self-attention mechanism is applied in parallel across multiple ‘heads’. Each head can focus on different aspects of the input sequence, enabling multiple computations to occur concurrently.
Transformer attention comprises a few critical components. q and k are vectors representing queries and keys respectively of dimension dₖ. Similarly the vector v of dᵥ dimension represents the values. Corresponding to these vectors, we have Q, K and V matrices packing together sets of queries, keys, and values, respectively. These query, keys, and values that are used as inputs to these attention mechanisms are different projections of the same input sentence (in the context of machine translation). The output is generated by calculating a weighted sum of the values. These weights are determined using a compatibility function that measures the relationship between the query and its corresponding key.
The attention mechanism essentially maps a query and a set of key-value pairs to an output. In ‘Attention is All You Need’, Vaswani et al. propose a scaled dot-product attention and then build on it to propose multi-head attention.
Scaled dot-product attention mechanism is implemented by computing the dot product of each query, q with all of the keys, k. Each result is then scaled by a factor of √dₖ after which a softmax function is applied. The purpose of the scaling factor is to pull the results generated by the dot product multiplication down, preventing the vanishing gradient problem.
This provides us with the results that are then used to scale v. This process is carried out by using the matrices, Q, K and V as inputs. Essentially, the process can be represented as follows:
attention(Q, K, V) = softmax(QK/dₖ)V
This single headed attention mechanism is used as a foundation for the multi head attention mechanism that is proposed by Vaswani et al. The motivation behind using a multi-head system for attention is to allow the attention mechanism to extract information from different representation subspaces. Essentially, this involves linearly projecting our input matrices h times using learned projections. To each of these heads, the scaled dot product attention discussed earlier is applied in parallel to produce h outputs. These are then concatenated and projected again to produce the final result.
Now that we have familiarized ourselves with the attention mechanism of the Transformer, we explore how this fits into the Transformer architecture. The Transformer is essentially an encoder-decoder structure. While the basic function of the encoder is to generate a sequence of continuous representations that can be used as the decoder input, the decoder receives this output as well as the decoder output from the previous step to produce the output sequence.
Since the transformer architecture does not make use of recurrence, it does not inherently provide for understanding the relative position of words in a sequence. To solve this, positional encodings are applied to the input embeddings. These encodings are sinusoidal in nature and are of the same dimensions as the embeddings so that they can be added to these input embeddings to provide information on the relative position of the elements.
The encoder structure is made up of 6 identical stacks. Each layer is further composed of two sublayers performing distinct functions. The first sublayer functions as a multi-head attention mechanism that receives distinct linearly projected versions of the queries, key and values and produces h outputs in parallel which are then processed to generate the output. The second sublayer performs the function of a fully connected feed forward network. This network performs a linear transformation, followed by a Rectified Linear Unit (ReLU) Activation and another linear transformation after that. This is represented as:-
FFN(x) = ReLU(Wx₁ + b₁)W₂ + b₂
Each layer employs a different set of weights and biases. Additionally each sublayer has a residual connection around it. The architecture also consists of a normalisation layer after every sublayer. The function of this layer is to normalise the sum of the sublayer input and the output of the sublayer :
layernorm(x + sublayer(x))
The decoder side is similar to the encoder in the sense that it also consists of 6 identical layers. However, each layer is made up of 3 sublayers. Each sublayer is succeeded by a normalisation layer and has residual connections. In addition to this, the decoder also implements positional encoding to the input embeddings of the decoder. The first sublayer takes the output from the preceding decoder stack layer, augments positional encoding and then executes multi-head attention on it.
One major difference between the encoder and the decoder is that irrespective of the position, the encoder attends to all words in the input sequence, while the decoder only attends to the preceding words. This means that for any given position, the prediction of the decoder can only depend on the words that come before it. To make this possible, a mask is introduced over the values produced by the scaled multiplication of Q and K. This process suppresses the matrix values corresponding to illegal connections or connections to future positions. This allows the decoder to be unidirectional in nature and retain its auto-regressive property. This sublayer is then followed by a sublayer that implements the multi-head attention mechanism. In the case of the decoder, this is carried out by using the queries provided by the previous layer of the decoder and the key and value pairs from the output of the encoder. This sublayer is followed by a fully connected feed forward network.