The Self-attention mechanism as shown in the paper is what will be covered in this post. This paper titled ‘A Structured Self-attentive Sentence Embedding’ is one of the best papers, IMHO, to illustrate the workings of the self-attention mechanism for Natural Language Processing. The structure of Self-attention is shown in the image below, courtesy of the paper:

Suppose one has an LSTM of dim ‘u’ and takes as input batches of sentences of size ‘n’ words. Each sentence can be represented as

$$
S = (w_1, w_2,....w_n)
$$
where each word is the result of and has a size corresponding to the output size of the embedding layer before it. The output of the hidden layer for a bidirectional LSTM is (n x 2u). $$h^{\rightarrow}_t , h^{\leftarrow}_t $$ both have a size of 'u' that is concatenated to get the bidirectional hidden state $$h_t$$ of size '2u'. For each 'n' word there is a corresponding hidden state, therefore we can can write a hidden state matrix 'H' of dimension (n x 2u) as
$$
H^T = (h_1, h_2....h_n)
$$
Now if we want to compute a vector of attention weights given by 'a', according to the paper we can compute this vector using the following formula
$$
\bold{a} = softmax( \bold{w}_{s2} \cdot tanh( \bold{W}_{s1} \cdot \bold{H}^T))
$$
where $$\bold{W}_{s1}$$ is a weight matrix of shape (d x 2u) and $$\bold{w_{s2}}$$ is a weight vector of shape (1 x d). Here 'd' is a hyperparameter that can be set at runtime.

__The two weight matrices are parameters whose values are updated during the training process__. Now if we work through the equations, we can see the dimensionality of each component that is displayed in square brackets next to each term below. $$ \bold{W_{s1}} [d \times 2u] \cdot \bold{H}^T [2u \times n] = \bold{X} [d \times n] $$ $$ \bold{Y} = tanh(\bold{X}) \longrightarrow \bold{W}_{s2} [1 \times d] \cdot \bold{Y} [d \times n] = \bar{\bold{Y}} [1 \times n] $$ $$ \bold{a} = softmax(\bar{\bold{Y}}) [1 \times n] $$ This ensures that the attention vector of weights 'a' is normalized. We can compute the scaled attention vector 'm' as shown below $$ \bold{m} [1 \times 2u] = \bold{a} [1 \times n] \cdot \bold{H} [n \times 2u] $$ The calculations are computed for each focus area in a sentence, this area of focus could be a phrase or it could be a word. In our work, we almost always assume the focus area is a word which implies that 'n' attention vectors of weights can be computed which transforms the scaled vector 'm' to a matrix 'M'. For the sake of generality, let us assume that we have 'r' focus areas in a sentence such that 'r' = 'n' for the specific case mentioned above. In literature, each pass over the focus areas is called a hop. So how does this affect our terms that we discussed above.Instead of a vector of weights for ‘a’ we get a matrix ‘A’ of size (r x n). This results from the following change to the weight vector

$$
\bold{w_{s2}} [1 \times d] \longrightarrow \bold{W_{s2}} [r \times d]
$$
and the resulting scaled attention vector 'm' becomes the scaled attention matrix 'M'
$$
\bold{m} [1 \times 2u] \longrightarrow \bold{M} [r \times 2u]
$$
since
$$
\bold{A} [r \times n] \cdot \bold{H} [n \times 2u] = \bold{M} [r \times 2u]
$$