4. Attention Layers

Attention is a concept in machine learning and AI that goes back many years, especially in computer vision[BP97]. Like the word “neural network”, attention was inspired by the idea of attention in how human brains deal with the massive amount of visual and audio input[TG80]. Attention layers are deep learning layers that evoke the idea of attention. You can read more about attention in deep learning in Luong et al. [LPM15] and get a practical overview here. Attention layers have been empirically shown to be so effective in modeling sequences, like language, that they have become indispensible[VSP+17]. The most common place you’ll see attention layers is in transformers neural networks that model sequences. We’ll also see attention in graph neural networks and they are common in computer vision.

Attention layers are fundamentally a weighted mean reduction. Attention removes one axis from an input tensor by computing a mean. Attention is unusal among layers because it takes three inputs, whereas most layers in deep learning take just one or perhaps two. These inputs are called the query, the values, and the keys. The reduction occurs over the values; so if the values are rank 3, the output will be rank 2. The query should be one less rank than the keys. The keys should be the same rank as the values. The keys and query determine how to weight the values.

The table below summarizes these three input arguments. Note that often the query is batched, so that it’s rank will be 2 if batched and the output’s rank will be 2.







(# of attn features)

input for checking against keys

One word represented as feature vector



(sequence length, # of attn features)

used to compute attention against query

All words in sentence represented as matrix of feature vectors



(sequence length, # of value features)

used to compute value of output

A vector of numbers for each word in a sentence



(# of value features)

attention-weighted mean over values

single vector

4.1. Example

Attention is best conceptualized as operating on a sequence. Let’s use a sentence like “The sleepy child reads a book”. The words in the sentence correspond to the keys. If we represent our words as embeddings, our keys will be rank 2. For example, the word “sleepy” might be represented by an embedding vector of length 2: \([-0.2, 0.4]\), where these embedding values are trained or taken from a standard language embedding. By convention, the zeroth axis of keys will be the position in the sequence and the first axis contains these vectors. The query is often an element from the keys, like the word “book”. The point of attention is to see what parts of the sentence the query should be influenced by. “Book” should have strong attention on “child” and “reads”, but probably not to “sleepy”. You’ll see soon that we will actually compute this as a vector, called the attention vector \(\vec{b}\). The output from the attention layer will be a reduction over the values where each element of values is weighted by the attention between the query and the key. Thus there should be one key and one value for each element in our sentence. The values could be identical to the keys, which is common.

Let’s see how this looks mathematically. The attention layer consists of two steps: (1) computing the attention vector \(\vec{b}\) using the attention mechanism and (2) the reduction over the values using the attention vector \(\vec{b}\). Attention mechanism is a fancy word for the attention equation. Consider our example above. We’ll use a 3 dimensional embedding for our words






















The keys will be a rank 2 tensor (matrix) putting all these together. Note that these are only integers to make this example clearer, typically words are represented with floating point numbers when embedded.

(4.3)\[\begin{equation} \mathbf{K} = \left[ \begin{array}{lccccr} 0 & 2 & 1 & 2 & -2 & 0\\ 0 & 0 & -1 & 3 & 0 & 2\\ 0 & 1 & -2 & 1 & 0 & 1\\ \end{array}\right] \end{equation}\]

They keys are shape \((6, 3)\) because our sentence has 6 words and each word is represented with a 3 dimensional embedding vector. Let’s make our values simple, we’ll have one for each word. These values are what determine our output. Perhaps they could be the sentiment of the word: is it a positive word (happy) or a negative word (angry).

(4.4)\[\begin{equation} \mathbf{V} = \left[ 0, -0.2, 0.3, 0.4, 0, 0.1\right] \end{equation}\]

Note that the values \(\mathbf{V}\) should be the same rank as the keys, so its shape is interpreted as \((6, 1)\). Finally, the query which should be one rank less than the keys. Our query is the word “book”:

(4.5)\[\begin{equation} \vec{q} = \left[0, 2, 1\right] \end{equation}\]

4.2. Attention Mechanism Equation

The first equation is the attention mechanism. This uses query and keys arguments only. It outputs a tensor one rank less than the keys, giving a scalar for each key corresponding to the attention the query should have for the key. The attention vector should be normalized! Usually this is achieved by doing a softmax. The specific equation used is a hyperparameter, like activation functions. In practice, most use a dot product followed by a softmax:

(4.6)\[\begin{equation} \vec{b} = \mathrm{softmax}\left(\vec{q}\cdot \mathbf{K}\right) = \mathrm{softmax}\left(\sum_j q_j k_{ij}\right) \end{equation}\]

where there could be more indices depending on the rank of the inputs. What we’re doing is taking the dot product of each key with the query. In our example, this would be:

(4.7)\[\begin{equation} \vec{b} = \mathrm{softmax}\left(\left[0, 2, 1\right] \times \left[ \begin{array}{lccccr} 0 & 2 & 1 & 2 & -2 & 0\\ 0 & 0 & -1 & 3 & 0 & 2\\ 0 & 1 & -2 & 1 & 0 & 1\\ \end{array}\right]\right) = \mathrm{softmax}\left( \left[0, 1, -4, 7, 0, 5\right]\right) \end{equation}\]
(4.8)\[\begin{equation} \vec{b} = \left[0, 0, 0, 0.88, 0, 0.12\right] \end{equation}\]

I’ve rounded the numbers here, but essentially the attention vector only gives weight to the word itself (book) and the verb “read”. I made this up, remember, but it gives you an idea of how attention gives you a way to connect words. It may even remind you of our graph neural network’s idea of neighbors.

4.3. Attention Reduction

After computing the attention vector \(\vec{b}\), this is used to compute a weighted mean over the values. The output of the attention layer is then

(4.9)\[\begin{equation} \mathbf{V}\vec{b} = \left[0, 0, 0, 0.88, 0, 0.12\right]^ T \left[ 0, -0.2, 0.3, 0.4, 0, 0.1\right] = 0.36 \end{equation}\]

Conceptually, our example computed the attention-weighted sentiment of the query word “book” in our sentence. You can see that attention layers do two things: compute an attention vector with the attention mechanism and then use it to take the attention-weighted average over the values.

4.4. Soft vs Hard Attention

Just like activation, there is an enormous range of choices for attention mechanism. One distinguishing attribute is if the softmax is used or a “hardmax”. Recall a softmax converts real numbers into probabilities by viewing the real numbers as log-odds (logs of probability ratios). A “hardmax” means just taking the max, but the ouput is shaped like a probability distribution. Using hard attention means that you only return the value which had the maximum output from the attention mechanism, instead of taking a weighted average. You can view this is as just another attention mechanism type, but you’ll see some literature discussing hard vs soft attention.

4.5. Tensor-Dot

The most common attention mechanism is a dot product (called tensor-dot to be more general) followed by a softmax [LPM15]. This is divided by the dimension of the keys (last axis dimension). Remember the keys are not normalized. If they are random numbers, the magnitude of the output from the dot product scales with the square root of the dimension of the keys due to the central limit theorem. This can make the soft-max behave poorly, since you’re taking \(e^{\vec{q} \cdot \mathbf{K}}\). Putting this all together, the equation is:

(4.10)\[\begin{equation} \vec{b} = \mathrm{softmax}\left(\frac{1}{\sqrt{d}}\vec{q}\cdot \mathbf{K}\right) \end{equation}\]

where \(d\) is the dimension of the query vector.

4.6. Self-Attention

Remember how everything is batched in deep learning? The batched input to an attention layer is usually the query. So although in the above discussion it was a tensor of one rank less than the keys (typically a query vector), once it has been batched it will be the same rank as the keys. Almost always, the query is in fact equal to the keys. Like in our example, our query was the embedding vector for the word “book”, which is one of the keys. If you consider the query to be batched so that you consider every word in the sentence, the query becomes equal to the keys. A further special case is when the query, values and keys are equal. This is called self-attention. This just means our attention mechanism uses the values directly and there is no extra set of “keys” input to the layer.

4.7. Trainable Attention

There are no trainable parameters in our definitions above yet. How can you do learning with attention? Typically, you don’t. There are exceptions like using a trainable attention mechanism (like a dense layer), but in most definitions of attention you do not have any trainable parameters.

4.8. Multi-head Attention Block

Inspired by the idea of convolutions with multiple filters, there is a block (group of layers) that splits to multiple parallel attentions. These are called “multi-head attention”. If your values are shape \((L, V)\), you will get back a \((H, V)\) tensor, where \(H\) is the number of parallel attention layers (heads). If there are no trainable parameters in attention layers, what’s the point of this though? Well you must introduce weights. These are square weight matrices because we need all shapes to remain constant among all the attention heads.

Consider an attention layer to be defined by \(A(\vec{q}, \mathbf{K}, \mathbf{V})\). The multi-head attention is

(4.11)\[\begin{equation} \left[A(\mathbf{W}_q^0\vec{q}, \mathbf{W}_k^0\mathbf{K}, \mathbf{W}_v^1\mathbf{V}), A(\mathbf{W}_q^1\vec{q}, \mathbf{W}_k^1\mathbf{K}, \mathbf{W}_v^1\mathbf{V}), \ldots, A(\mathbf{W}_q^H\vec{q}, \mathbf{W}_k^H\mathbf{K}, \mathbf{W}_v^H\mathbf{V})\right] \end{equation}\]

where each element of the output vector \([\ldots]\) is itself an output vector from an attention layer, making \(H\) \((L, V)\) shaped tensors. So the whole output is an \((H, L, V)\) tensor. The most famous example of the multi-head attention block is in transformers[VSP+17] where they use self-attention multi-head attention blocks.

Typically we apply multiple sequential blocks of attention, so need the values input to the next block to be of rank 2 again instead of the \((H, L, V)\) tensor. Thus the output from the multi-head attention is often reduced by matrix multiplication with an \((H, V, V)\) weight tensor or a \((H)\) tensor of weights. If this seems confusing, see the example below.

4.9. Running This Notebook

Click the    above to launch this page as an interactive Google Colab.

4.10. Code Examples

Let’s see how attention can be implemented in code. I will use random variables here for the different quantities but I will indicate which variables should be trained with w_ and which should be inputs with i_.

4.10.1. Tensor-Dot Mechanism

We’ll begin with implementing the tensor-dot attention mechanism first. As an example, we’ll use a sequence length of 11 and a keys feature length of 4 and a values feature dimension of 2. Remember the keys and query must share feature dimension size.

import numpy as np

def softmax(x, axis=None):
    return np.exp(x)/np.sum(np.exp(x), axis=axis)

def tensor_dot(q, k):
    b = softmax( (k @ q) / np.sqrt(q.shape[0]) )
    return b

i_query = np.random.normal(size=(4,))
i_keys = np.random.normal(size=(11,4))

b = tensor_dot(i_query, i_keys)
print('b = ', b)
b =  [0.0561316  0.03437597 0.0576251  0.00796326 0.1215294  0.09160425
 0.12627593 0.04554399 0.12371525 0.00857096 0.32666429]

As expected, we get out a vector \(\vec{b}\) whose sum is 1.

4.10.2. General Attention

Now let’s put this attention mechanism into an attention layer.

def attention_layer(q, k, v):
    b = tensor_dot(q, k)
    return b @ v

i_values = np.random.normal(size=(11, 2))
attention_layer(i_query, i_keys, i_values)
array([-0.55615714, -0.19903899])

We get two values, one for each feature dimension.

4.10.3. Self-attention

The change in self-attention is that we make queries, keys, and values equal. We need to make a small change in that the queries are batched in this setting, so we should get a rank 2 output.

def batched_tensor_dot(q, k):
    # a will be batch x seq x feature dim
    # which is N x N x 4
    # batched dot product in einstein notation
    a = np.einsum('ij,kj->ik', q, k) / np.sqrt(q.shape[0]) 
    # now we softmax over sequence
    b = softmax(a, axis=1)
    return b

def self_attention(x):
    b = batched_tensor_dot(x, x)
    return b @ x

i_batched_query = np.random.normal(size=(11, 4))
array([[ 0.30878115, -0.04327977,  0.57134653, -0.03684736],
       [-0.56274619,  0.38784149,  0.20365391, -0.16449202],
       [-0.29167447, -0.63071927,  0.94978182,  0.30294958],
       [-0.13582236,  0.40387523,  0.08585396, -0.16907524],
       [ 0.57223327,  0.76268387, -0.12340202, -0.50117289],
       [ 0.31921624, -0.47405222,  1.16163757,  0.36470874],
       [-0.60644727,  0.50156489,  0.15035922, -0.10066329],
       [ 0.42567712,  0.30208839,  0.26828409,  0.02105865],
       [ 0.40025769, -0.17862679,  0.6578048 , -0.17713321],
       [ 0.15288864,  0.32904545,  0.21296375,  0.01957569],
       [-0.34273484,  0.15013349,  0.36809094,  0.0763516 ]])

We are given as output an \(11\times4\) matrix, which is correct.

4.10.4. Adding Trainable Parameters

You can add trainable parameters to these steps by adding a weight matrix. Let’s do this for the self-attention. Although keys, values, and query are equal in self-attention, I can multiply them by different weights. Just to demonstrate, I’ll have the values change to feature dimension 2.

# weights should be input feature_dim -> desired output feature_dim
w_q = np.random.normal(size=(4, 4))
w_k = np.random.normal(size=(4, 4))
w_v = np.random.normal(size=(4, 2))

def trainable_self_attention(x, w_q, w_k, w_v):
    q = x @ w_q
    k = x @ w_k
    v = x @ w_v
    b = batched_tensor_dot(q, k)
    return b @ v
trainable_self_attention(i_batched_query, w_q, w_k, w_v)
array([[ 0.02642002,  0.45607013],
       [ 1.61948363,  0.44626552],
       [ 4.90417484, -2.72052858],
       [-0.22613452,  1.94504019],
       [-1.73803735,  3.09409044],
       [-0.32432721,  1.7936397 ],
       [ 0.45300564,  1.68677163],
       [-0.27793752,  0.87156272],
       [ 0.33565028, -0.08632834],
       [-0.17481142,  0.94487507],
       [ 0.528007  ,  0.36559953]])

Since we had our values change to feature dimension 2 with the weights, we get out an \(11\times 2\) output.

4.10.5. Multi-head

The only change for multi-head attention is that we have one set of weights for each head and we agree on how to combine after running through the heads. I’ll just use a length \(H\) vector of trainable weights. Other strategies are to concatenate them or use a reduction (e.g., mean, max).

w_q_h1 = np.random.normal(size=(4, 4))
w_k_h1 = np.random.normal(size=(4, 4))
w_v_h1 = np.random.normal(size=(4, 2))
w_q_h2 = np.random.normal(size=(4, 4))
w_k_h2 = np.random.normal(size=(4, 4))
w_v_h2 = np.random.normal(size=(4, 2))
w_h = np.random.normal(size=2)

def multihead_attention(x, w_q_h1, w_k_h1, w_v_h1, w_q_h2, w_k_h2, w_v_h2):
    h1_out = trainable_self_attention(x, w_q_h1, w_k_h1, w_v_h1)
    h2_out = trainable_self_attention(x, w_q_h2, w_k_h2, w_v_h2)
    # join along last axis so we can use dot.
    all_h = np.stack((h1_out, h2_out), -1)
    return all_h @ w_h

multihead_attention(i_batched_query, w_q_h1, w_k_h1, w_v_h1, w_q_h2, w_k_h2, w_v_h2)
array([[-1.57582545e-01, -2.72474707e-01],
       [-2.01812233e-01, -6.86304867e-01],
       [ 1.94968935e+01,  2.12923329e+01],
       [-2.21560784e+00, -4.47338822e+00],
       [-8.03519930e+01, -1.44300594e+02],
       [ 9.60258895e+00,  1.06407379e+01],
       [ 1.06216313e+00,  9.12221488e-01],
       [-2.68612449e-01, -5.65886250e-01],
       [-1.34366442e+00, -2.27838290e+00],
       [ 5.59463428e-02, -1.11195364e-01],
       [ 8.12378057e+00,  8.71745505e+00]])

As expected, we do get an \(11\times2\) rank 2 output.

4.11. Attention in Graph Neural Networks

Recall that the key attribute of a graph neural network is permutation invariance. We used reductions like sum or mean over neighbors as the way to make the graph neural network layers be permutation invariant. Attention layers are also permutation invariant! This has made attention a popular choice for how to aggregate neighbor information. Attention layers are good at finding important neighbors and so are important with high-degree graphs (lots of neighbors). This is rare in molecules, but you can just define all atoms to be connected and then put distances as the edge attributes. Recall that graph convolution layers (GCN layer), and most GNN layers, only allow information to propagate one-bond per layer. Thus joining all atoms and using attention can give you long-range communication without so many layers. The disadvantage is that your network must now learn this, so perhaps you can reduce model bias but at the cost of requiring more training data/having more model variance.

Let’s see how attention fits into the Battaglia equations[BHB+18]. Recall that the Battaglia equations are general standard equations for defining a GNN. Attention can appear in multiple places, but as discussed above it appears when considering neighbors. Specifically, the query will be the \(i\)th node, and the keys/values will be some combination of neighboring node and edge features. There is no step in the Battaglia equations where this fits neatly, but we can split up the attention layer as follows. Most of the attention layer will fit into the edge update equation:

(4.12)\[\begin{equation} \vec{e}^{'}_k = \phi^e\left( \vec{e}_k, \vec{v}_{rk}, \vec{v}_{sk}, \vec{u}\right) \end{equation}\]

Recall that this is a general equation and our choice of \(\phi^e()\) defines the GNN. \(\vec{e}_k\) is the feature vector of edge \(k\), \(\vec{v}_{rk}\) is the receiving node feature vector for edge \(k\), \(\vec{v}_{sk}\) is the sending node feature vector for edge \(k\), \(\vec{u}\) is the global graph feature vector. We will use this step for attention mechanism where the query is the receiving node \(\vec{v}_{rk}\) and the keys/values are composed of the senders and edges vectors. To be specific, we’ll use the approach from Zhang et al. [ZSX+18] with a tensor-dot mechanism. They only considered node features and set the keys and values to be identical as the node features. However, they put trainable parameters at each layer that translated the node features in to the keys/query.

(4.13)\[\begin{equation} \vec{q} = \mathbf{W}_q\vec{v}_{rk} \end{equation}\]
(4.14)\[\begin{equation} \mathbf{K} = \mathbf{W}_k\vec{v}_{sk} \end{equation}\]
(4.15)\[\begin{equation} \mathbf{V} = \mathbf{W}_v\vec{v}_{sk} \end{equation}\]
(4.16)\[\begin{equation} \vec{b}_k = \mathrm{softmax}\left(\frac{1}{\sqrt{d}} \vec{q}\cdot \mathbf{K}\right) \end{equation}\]
(4.17)\[\begin{equation} \vec{e}^{'}_k = \vec{b} V \end{equation}\]

Putting it compactly into one equation:

(4.18)\[\begin{equation} \vec{e}^{'}_k = \mathrm{softmax}\left(\frac{1}{\sqrt{d}} \mathbf{W}_q\vec{v}_{rk}\cdot \mathbf{W}_k\vec{v}_{sk}\right)\mathbf{W}_v\vec{v}_{sk} \end{equation}\]

Now we have weighted edge feature vectors from the attention. Finally, we sum over these edge features in the edge aggregation step.

(4.19)\[\begin{equation} \bar{e}^{'}_i = \rho^{e\rightarrow v}\left( E_i^{'}\right) = \sum E_i^{'} \end{equation}\]

In Zhang et al. [ZSX+18], they used multi-headed attention as well. How would multi-headed attention work? Your edge feature matrix \( E_i^{'}\) now becomes an edge feature tensor, where axis 0 is edge (\(k\)), axis 1 is feature, and axis 2 is the head. Recall that the “head” just means which set of \(\mathbf{W}^h_q, \mathbf{W}^h_k, \mathbf{W}^h_v\) we used. To reduce the tensor back to the expected matrix, we simply use another weight matrix that maps from the last two axes (feature, head) down to features only. I will write-out the indices explicitly to be more clear:

(4.20)\[\begin{equation} \bar{e}^{'}_{il} = \rho^{e\rightarrow v}\left( E_i^{'}\right) = \sum_k e_{ikjh}^{'}w_{jhl} \end{equation}\]

where \(j\) is edge feature input index, \(l\) is our output edge feature matrix, and \(k,h,i\) are defined as before. Transformer is another name for a network built on multi-headed attention, so you’ll also see transformer graph neural networks [MDM+20] building.

4.12. Chapter Summary

  • Attention layers are inspired by human ideas of attention, but is fundamentally a weighted mean reduction.

  • The attention layer takes in three inputs: the query, the values, and the keys. These inputs are often identical, where the query is one key and the keys and the values are equal.

  • They are good at modeling sequences, such as language.

  • The attention vector should be normalized, which can be achieved using a softmax activation function, but the attention mechanism equation is a hyperparameter.

  • Attention layers compute an attention vector with the attention mechanism, and then reduce it by computing the attention-weighted average.

  • Using hard attention (hardmax function) returns the maximum output from the attention mechanism.

  • The tensor-dot followed by a softmax is the most common attention mechanism.

  • Self-attention is achieved when the query, values, and the keys are equal.

  • Attention layers by themselves are not trainable.

  • Multi-head attention block is a group of layers that splits to multiple parallel attentions.

4.13. Cited References


Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, and others. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.


Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung. Gaan: gated attention networks for learning on large and spatiotemporal graphs. arXiv preprint arXiv:1803.07294, 2018.


Shumeet Baluja and Dean A. Pomerleau. Expectation-based selective attention for visual monitoring and control of a robot vehicle. Robotics and Autonomous Systems, 22(3):329 – 344, 1997. Robot Learning: The New Wave. URL: http://www.sciencedirect.com/science/article/pii/S0921889097000468, doi:https://doi.org/10.1016/S0921-8890(97)00046-8.


Anne M Treisman and Garry Gelade. A feature-integration theory of attention. Cognitive psychology, 12(1):97–136, 1980.


Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.


Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, 5998–6008. 2017.


Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, and Stanisław Jastrzębski. Molecule attention transformer. arXiv preprint arXiv:2002.08264, 2020.