Artificial intelligence Machine LearningAttention in Deep Learning

Attention is one of the most important ideas in the Deep Learning community. Although this mechanism is now used in various problems such as image captions and others, it was originally designed in the context of neural machine translation using Seq2Seq models.

Seq2Seq model

The seq2seq model is normally composed of an encoder-decoder architecture, in which the encoder processes the input sequence and encodes / compresses the information in a context vector (or “thought vector”) of fixed length. This representation should be a good summary of the complete input sequence. The decoder is then initialized with this context vector, using which it begins to produce the transformed or translated output.

Disadvantage of the Seq2Seq model

A critical disadvantage of this fixed-length vector context design is the system's inability to maintain longer sequences. He often forgot the previous elements of the input sequence after processing the complete sequence. The attention mechanism was created to solve this problem of long dependencies.

Graph showing the trend of the BLEU score as the length of a sentence varies.

BLEU (Bilingual Evaluation Understudy) is a score for comparing a candidate translation of text with one or more reference translations. The graph above shows that the encoder-decoder unit cannot memorize the entire long sentence. Hence, what is reflected by the graph above is that the encoder-decoder unit works well for shorter sentences (high BLEU score).

The idea behind attention

Attention was presented by Dzmitry Bahdanau, et al. in the 2014 document "Neural Machine Translation by Jointly Learning to Align and Translate", which reads as a natural extension of their previous work on the model Encoder-Decoder. This same article laid the foundation for Vaswani's famous "Attention is all you need" article et al., on the transformers that revolutionized the arena of deep learning with the concept of parallel processing of words instead of processing them in sequence.

So, going back, the central idea is that whenever the model predicts an output word, it only uses parts of the input where the most relevant information is concentrated instead of the whole sequence. In simpler words, just pay attention to a few words of input.

Attention is an interface that connects the encoder and the decoder that provides the decoder with information from each hidden state of the encoder. With this framework, the model is able to selectively focus on valuable parts of the input sequence and thus learn the association between them. This helps the model cope efficiently with long input sentences.

Intuition

The idea is to keep the decoder as it is, and we simply replace sequential RNN / LSTM with bidirectional RNN / LSTM in the encoder.

Here, we pay attention to a few words considering a window size Tx (let's say four words x1, x2, x3 is x4 ). Using these four words, we will create a context vector c1, which is supplied as an input to the decoder. Similarly, we will create a context vector c2 using these four words. Furthermore, we have α1, α2 and α3 as weights and the sum of all weights within a window equals 1.

Similarly, we create context vectors from different groups of words with different α values.

The attention model calculates a set of attention weights denoted by α (t, 1), α (t, 2), .., α (t, t) because not all inputs would be used to generate the output corresponding. The context vector there for the output word yi is generated using the weighted sum of the annotations:

Attention weights are calculated by normalizing the output score of a feed-forward neural network described by the function that captures the alignment between the input in j and the output in the.

Mathematical formula of weights in the attention model.

Implementation

Let's take an example where a translator reads the sentence in English (the input language) while writing the keywords from start to finish, and then starts translating into Portuguese (the output language). When translating each English word, use the keywords you understand.

Attention places different attention on different words by assigning each word a score. Then, using the softmax scores, we aggregate the hidden states of the encoder using a weighted sum of the hidden states of the encoder to get the context vector.

Implementations of an attention level can be broken down into 4 steps.

Step 0: Prepare the hidden states.

First, prepare all available hidden encoder states (green) and the first hidden decoder state (red). In our example, we have 4 hidden states of the encoder and the hidden state of the current decoder. (Note: The last hidden state of the consolidated encoder is provided as input to the first time step of the decoder. The output of this first time step of the decoder is called the first hidden state of the decoder.)

Step 1: Get a score for each hidden encoder state.

A (scalar) score is obtained from a scoring function (also known as an alignment score function or alignment model). In this example, the score function is a point product between the hidden states of the decoder and the encoder.

Step 2: Run all scores through a softmax level.

We put the scores on a softmax level so that the softmax (scalar) scores add up to 1. These softmax scores represent the distribution of attention.

Step 3: Multiply each hidden state of the encoder by its softmax score.

Multiplying each hidden state of the encoder with its softmax (scalar) score, we get the alignment vector or annotation vector. This is exactly the mechanism in which alignment takes place.

Step 4: sum the alignment vectors.

The alignment vectors are added together to produce the context vector. A context vector is aggregate information of the alignment vectors from the previous step.

Step 5: Insert the context vector into the decoder.

Types of attention

Depending on how many source states contribute during the derivation of the attention vector (α), there can be three types of attention mechanisms:

Global attention : when the focus is on all the states of origin. In global attention, we require as many weights as the length of the source sentence is.

Local attention : when the focus is on some states of origin.

Strong attention : when the focus is on one source state only.

We report for the more curious a notebook very interesting to reproduce the concept of attention using TensorFlow.