2. Standard Layers

We will now see an overview of the enourmous diversity in deep learning layers. This survey is necessarily limited to standard layers and we begin without considering the key layers that enable deep learning of molecules and materials. Almost all the layers listed below came out of a model for a specific task and were not thought-up independently. That means that some of the layers are suited to specific tasks and often the nomenclature around that layer is targeted towards a specific kinds of data. The most common type is image data and we first begin with an overview of how image features are represented. An image is a rank 3 tensor with shape \((H, W, C)\) where \(H\) is the height of the image, \(W\) is the width, and \(C\) is the number of channels (typically 3 – red, green, blue). Since all training is in batches, the input features shape will be \((B, H, W, C)\). Often layers will discuss input as having a batch axis, some number of shape axes, and then finally a channel axis. The layers will then operate on perhaps only the channels or only the shape dimensions. The layers are all quite flexible, so this is not a limitation in practice, but it’s important to know when reading about layer types. Often the documentation or literature will mention batch number or channels and this is typically the first and last axes of a tensor, respectively.


Everything and nothing is batched in deep learning. Practically, data is always batched. Even if your data is not batched, the first axis input to a neural network is of unspecified dimension and called the batch axis. Many frameworks make this implicit, meaning if you say the output from one layer is shape \((4,5)\), it will be \((B, 4, 5)\) when you actually inspect data. So, all data is batched but often the math, frameworks, and documentation make it seem as if there is no batch axis.

2.1. Hyperparameters

We saw from the FC/Dense layer that we have to choose if we use bias, the activation, and the output shape. As we learn about more complex layers, there will be more choices. These choices begin to accumulate and in a neural network you may have billions of possible combinations of them. These chocies about shape, activation, initialization, and other layer arguments are called hyperparameters. They are parameters in the sense that they can be tuned, but they are not trained on our data so we call them hyperparameters to distinguish them from the “regular” parameters like value of weights and biases in the layers. The name is inherited from Bayesian statistics.

Choosing these hyperparameters is difficult and we typically rely on the body of existing literature to understand ranges of reasonable parameters. In deep learning, we usually are in a regime of hyperparameters which yield many trainable parameters (deep networks) and thus our models can represent any function. Our models are expressive. However, optimizing hyperparameters makes training faster and/or require less data. For example, papers have shown that carefully choosing the initial value of weights can be more effective than complex architecture [GB10]. Another example found that convolutions, which are thought to be the most important layer for image recognition, are not necessary if hyperparameters are chosen correctly for dense neural networks[CMGS10]. This is now changing, with options for tuning hyperparameters, but the current state-of-the art is to take hyperparameters from previous work as a starting guess and change a little if you believe it is needed.

2.1.1. Validation

The number of hyperparameters is high-enough that overfitting can actually occur by choosing hyperparameters that minimize error on the test set. This is surprising because we don’t explicitly train hyperparameters. Nevertheless, you will find in your own work that if you use the test data extensively in hyperparameter tuning and for assessing overfitting of the regular training parameters, your performance will be overfit to the testing data. To combat this, we split our data three ways in deep learning:

  1. Training data: used for trainable parameters.

  2. Validation data: used to choose hyperparameters or measure overfitting of training data

  3. Test data: data not used for anything except final reported error

To clean-up our nomenclature here, we use the word genaralization error to refer to performance on a hypothetical infinite stream of unseen data. So regardless of if you split three-ways or use other approaches, generalization error means error on unseen data.

2.1.2. Tuning

So how do you tune hyperparameters? The main answer is by-hand, but this is an active area of research. Hyperparameters are continuous (e.g., regularization strength), categorical (e.g., which activation), and discrete variables (e.g., number of layers). One category of ways to tune hyperparameters is a topic called meta-learning[FAL17], which aims to learn hyperparameters by looking at multiple related datasets. Another area is auto-machine learning (auto-ML)[ZL17], where optimization strategies that do not require derivatives can tune hyperparameters. An important category of optimization related to hyperparameter tuning is multi-armed bandit optimization where we explicitly treat the fact that we have a finite amount of computational resources for tuning hyperparameters[LJD+18]. Specific implementations of these methods can be found in many ML frameworks now, for example Keras-Tuner for Keras or Ray Tune for PyTorch.

2.2. Common Layers

Now that we have some understanding of hyperparameters and their role, let’s now survey the common types of layers.

2.2.1. Convolutions

You can find a more thorough overview of convolutions here and here with more visuals. Here is a nice video on this. Convolutions are the most commonly used input layer when dealing with images or other data defined on a regular grid. In chemistry, you’ll see convolutions on protein or DNA sequences, on 2D imaging data, and occasionally on 3D spatial data like average density from a molecular simulation. What makes a convolution different from a dense layer is that the number of trainable weights is more flexible than input grid shape \(\times\) output shape, which is what you would get with a dense layer. Since the trainable parameters doesn’t depend on the input grid shape, you don’t learn to depend on location in the image. This is important if you’re hoping to learn something independent of location on the input grid – like if a specific object is present in the image independent of where it is located.

In a convolution, you specify a kernel shape that defines the size of trainable parameters. The kernel shape defines a window over your input data in which a dense neural network is applied. The rank of the kernal shape is the rank of your grid + 1, where the extra axis accounts for channels. For example, for images you might define a kernel shape of \(5\times5\). The kernel shape will become \(5\times5\times{}C\), where \(C\) is the number of channels. When referring to a convolution as 1D, 2D, or 3D, we’re referring to the grid of the input data and thus the kernel shape. A 2D convolution actually has an input of rank 4 tensors, the extra 2 axes accounting for batch and channels. The kernel shape of \(5\times5\) means that the output of a specific value in the grid will depend on its 24 nearest neighboring pixels (2 in each direction). Note that the kernel is used like a normal dense layer – it can have bias (dimension \(C\)), output activation, and regularization.

Practically, convolutions are always grouped in parallel. You have a set of \(F\) kernels, where \(F\) is called the number of filters. Each of these filters is completely independent and if you examine what they learn, some filters will learn to identify squares and some might learn to identify color or others will learn textures. Filters is a term left-over from image processing, which is the field where convolutions were first explored. Combining all of these together, a 2D convolution will have an input shape of \((B, H, W, C)\) and an output of \((B, \approx H, \approx W, F)\), where \(F\) is the number of filters chosen, and the \(\approx\) accounts for the fact that when you slide your kernel window over the input data, you’ll lose some values on the edge. This can either be treated by padding, so your input height and width match output height and width, or your dimensionality is reduced by a small amount (e.g., going from \(128\times128\) to \(125\times125\)). A 1D convolution will have input shape \((B, L, C)\) and output shape \((B, \approx L, F)\). As a practical example, consider a convolution on DNA. \(L\) is length of the sequence. \(C\), your channels, will be one-hot indicators for the base (T, C, A, G).

One of the important properties we’ll begin to discuss is invariances and equivariances. An invariance means the output from a neural network (or a general function) is insensitive to changes in input. For example, a translational invariance means that the output does not change if the input is translated. Convolutions and pooling should be chosen when you want to have translation invariance. For example, if you are identifying if a cat exists in an image, you want your network to give the same answer even if the cat is translated in the image to different regions. However, just because you use a convolution layer does not make a neural network automatically translationally invariant. You must include other layers to achieve this. Convolutions are actually translationally equivariant – if you translate all pixels in your input, the output will also be translated. People usually do not distinguish between equivariance and invariance. If you are trying to identify where a cat is located in an image you would still use convolutions but you want your neural network to be translationally equivariant, meaning your guess about where the cat is located is sensitive to where the cat is located in the input pixels. The reason convolutions have this property is that the trainable parameters, the kernel, are location independent. You use the same kernel on every region of the input.

2.2.2. Pooling

Convolutions are commonly paired with pooling layers. If your goal is to produce a single number (regression) or class (classification) from an input image or sequence, you need to reduce the rank to 0, a scalar. After a convolution, you could use a reduction like average or maximum. It has been shown empirically that reducing the number of elements of your features more gradually is better. One way is through pooling. Pooling is similar to convolutions, in that you define a kernel shape (called window shape), but pooling has no trainable parameters. Instead, you run a window across your input grid and compute a reduction. Commonly an average or maximum is computed. If your pool window is a \(2\times2\) on an input of \((B, H, W, F)\), then your output will be \((B, H / 2, W / 2, F)\). In convolutional neural networks, often multiple blocks of convolutions and poolings are combined. For example, you might use three rounds of convolutions and pooling to take an image from \(32 \times 32\) down to a \(4 \times 4\). Pooling is translationally equivariant like convolutions. Read more about pooling here

2.2.3. Embedding

Another important type of input layers are embeddings. Embeddings convert integers into vectors. They are typically used to convert characters or words into numerical vectors. The characters or words are first converted into tokens separately as a pre-processing step and then the input to the embedding layer is the indices of the token. The indices are integer values that index into a dictionary of all possible tokens. It sounds more complex than it is. For example, we might tokenize characters in the alphabet. There are 26 tokens (letters) in the alphabet (dictionary of tokens) and we could convert the word “hello” into the indices \([7, 4, 11, 11, 14]\), where 7 means the 7th letter of the alphabet.

After converting into indices, an embedding layer converts these indices into dense vectors of a chosen dimension. The rationale behind embeddings is to go from a large discrete space (e.g., all words in the English language) into a much smaller space of real numbers (e.g., vectors of size 5). You might use embeddings for converting monomers in a polymer into dense vectors or atom identities in a molecule or DNA bases.

2.3. Running This Notebook

Click the    above to launch this page as an interactive Google Colab. See details below on installing packages, either on your own environment or on Google Colab

2.4. Example

At this point, we have enough common layers to try to build a neural network. We will combine these three layers to predict if a protein is soluble. Our dataset comes from [CSTR14] and consits of proteins known to be soluble or insoluble. As usual, the code below sets-up our imports.

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
import warnings
import urllib
sns.set_style('dark',  {'xtick.bottom':True, 'ytick.left':True, 'xtick.color': '#666666', 'ytick.color': '#666666',
                        'axes.edgecolor': '#666666', 'axes.linewidth':     0.8 , 'figure.dpi': 300})
color_cycle = ['#1BBC9B', '#F06060', '#5C4B51', '#F3B562', '#6e5687']
mpl.rcParams['axes.prop_cycle'] = mpl.cycler(color=color_cycle) 

Our task is binary classification. The data is split into two: positive and negative examples. We’ll need to rearrange a little into a normal dataset with labels and training/testing split. We also really really need to shuffle our data, so it doesn’t see all positives and then all negatives.

    'https://github.com/whitead/dmol-book/raw/master/data/solubility.npz', 'solubility.npz')
with np.load('solubility.npz') as r:
    pos_data, neg_data = r['positives'], r['negatives']

# create labels and stich it all into one 
# tensor
labels = np.concatenate((np.ones((pos_data.shape[0], 1), dtype=pos_data.dtype), np.zeros((neg_data.shape[0], 1) , dtype=pos_data.dtype)), axis=0)
features = np.concatenate((pos_data, neg_data), axis=0)
# we now need to shuffle before creating TF dataset
# so that our train/test/val splits are random
i = np.arange(len(labels))
labels = labels[i]
features = features[i]
full_data = tf.data.Dataset.from_tensor_slices((features, labels))

# now split into val, test, train
N = pos_data.shape[0] + neg_data.shape[0]
print(N, 'examples')
split = int(0.1 * N)
test_data = full_data.take(split).batch(16)
nontest = full_data.skip(split)
val_data, train_data = nontest.take(split).batch(16), nontest.skip(split).shuffle(1000).batch(16)
18453 examples

Before getting to modeling, let’s examine our data. The protein sequences have already been tokenized. There are 20 possible values at each position because there are 20 amino acids possible in proteins. Let’s see a soluble protein

array([13, 17, 15, 16,  1,  1,  1, 17,  8,  9,  7,  1,  1,  4,  7,  6,  2,
       11,  2,  7, 11,  2,  8, 11, 17,  2,  6, 11, 15, 17,  8, 20,  1, 20,
       20, 17,  1,  6,  4,  8,  7, 20,  1,  9,  8,  1, 17, 20, 16, 17, 20,
       16, 20, 11, 16,  6,  6, 15, 11,  2, 10,  8, 20, 16, 11,  2,  2,  8,
       16, 19, 11, 17,  8, 11, 10,  2,  6,  2,  2, 20, 14,  1, 11,  3, 20,
       11, 16, 16,  2,  6, 16,  1, 20,  1,  4, 18, 14,  1,  3, 15,  7,  2,
       15,  2,  8, 18,  2,  6, 14,  4, 19, 20,  2, 18, 17,  1,  9, 15, 12,
        1,  8, 13, 15, 20, 11,  7,  4,  1, 11,  1,  6, 11,  9,  5,  2, 11,
       17,  4, 11, 10, 15, 11,  8,  1, 16,  4,  4, 11, 11, 20,  1,  7, 20,
       11,  4,  8,  2,  8,  2,  3,  8,  2, 15, 11, 20,  3, 14,  3,  8,  2,
       11,  9,  4, 20,  7, 14,  2,  8, 20, 20,  2, 20, 16,  2,  4,  6, 15,
       16,  1, 20, 17, 16, 11,  7,  0,  0,  0,  0,  0,  0])

Notice that integers/indices are used because our data is tokenized already. To make our data all be the same input shape, a special token (0) is inserted at the end indicating no amino acid is present. This needs to be treated carefully, because it should be zeroed throughout the network. Luckily this is built into Keras, so we do not need to worry about it.

This data is perfect for an embedding because we need to convert token indices to real vectors. Then we will use 1D convolutions to look for sequence patterns with pooling. We need to then make sure our final layer is a sigmoid. This architecture is inspired by the original work on pooling with convolutions [LecunBottouBengioHaffner98]. The number of layers and kernel sizes below are hyperparameters. You are encouraged to experiment with these or find improvements!

We begin with an embedding. We’ll use a 2 dimensional embedding, which gives us two channels for our sequence. Our kernel filter size for the 1D convolution will be 5 and we’ll use 16 filters. Beyond that, the rest of the network is about distilling gradually into a final class.

model = tf.keras.Sequential()

# make embedding and indicate that 0 should be treated specially
model.add(tf.keras.layers.Embedding(input_dim=21, output_dim=16, mask_zero=True, input_length=pos_data.shape[-1]))

# now we move to convolutions and pooling
model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=5, activation='relu'))

model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=3, activation='relu'))

model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=3,activation='relu'))

# now we flatten to move to hidden dense layers.
# Flattening just removes all axes except 1 (and implicit batch is still in there as always!)


model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dense(64, activation='tanh'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

Model: "sequential"
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 200, 16)           336       
conv1d (Conv1D)              (None, 196, 16)           1296      
max_pooling1d (MaxPooling1D) (None, 49, 16)            0         
conv1d_1 (Conv1D)            (None, 47, 16)            784       
max_pooling1d_1 (MaxPooling1 (None, 23, 16)            0         
conv1d_2 (Conv1D)            (None, 21, 16)            784       
max_pooling1d_2 (MaxPooling1 (None, 10, 16)            0         
flatten (Flatten)            (None, 160)               0         
dense (Dense)                (None, 256)               41216     
dense_1 (Dense)              (None, 64)                16448     
dense_2 (Dense)              (None, 1)                 65        
Total params: 60,929
Trainable params: 60,929
Non-trainable params: 0

Take a moment to look at the model summary (shapes) and model diagram. This is a fairly complex and modern neural network. If you can understand this, you can understand most of the current research being done in deep learning. Now we’ll begin training. Since we are doing classification, we’ll also examine accuracy on validation data as we train.

model.compile('adam', loss='binary_crossentropy', metrics=['accuracy'])
result = model.fit(train_data, validation_data=val_data, epochs=10, verbose=0)
plt.plot(result.history['loss'], label='training')
plt.plot(result.history['val_loss'], label='validation')

You can see this is a classic case of overfitting, with the validation data rising quickly as we improve our loss on the training data. Indeed, our model is quite expressive in its capability to fit the training data but it is incidentally fititng the noise. We have 61,000 trainable parameters and about 15,000 training examples, so this is not a surprise. However, we still able to learn a little bit – our accuracy is above 50%. This is actually a challenging dataset and the state-of-the art result is 77% accuracy [KRK+18]. We need to expand our tools to include layers that can address overfitting.

2.5. Regularization

As we saw in the ML chapters, regularization is a strategy that changes your training procedure (often by adding loss terms) to prevent overfitting. There is a nice argument for it in the bias-variance trade-off regarding model complexity, however this doesn’t seem to hold in practice\cite{neal2018modern}. Thus we view regularization as an empirical process. Regularization, like other hyperparameter tuning, is dependent on the layers, how complex your model is, your data, and especially if your model is underfit or overfit. Underfitting means you could train longer to improve validation loss. Adding regularization if your model is underfit will usually reduce performance. Consider training longer or adjusting learning rates if you observe this.

2.5.1. Early Stopping

The most commonly used and simplest form of regularization is early stopping. Early stopping means monitoring the loss on your validation data and stopping training once it begins to rise. Normally, training is done until convergened – meaning the loss stops decreasing. Early stopping tries to prevent overfitting by looking at the loss on unseen data (validation data) and stopping once that begins to rise. This is an example of regularization because the weights are limited to move a fix distance from their initial value. Just like in L2 regularization, we’re squeezing our trainable weights.

2.5.2. Weight

Weight regularization is the addition of terms to the loss that depend on the trainable weights in the model like we saw before. These can be L2 \(\sqrt{\sum w_i^2}\) or L1 \(\sum \left|w_i\right|\). You must choose the strength, which is expressed as a parameter (often denoted \(\lambda\)) that should be much less than \(1\). Typically values of \(0.1\) to \(1\times10^{-4}\) are chosen. This may be broken into kernel regularization, which affects the multiplicative weights in a dense or convolution neural network, and bias regularization. Bias regularization is rarely seen in practice.

2.5.3. Activity

Activity regularization is the addition of terms to the loss that depend on the output from a layer. Activity regularization ultimately leads to minimizing weight magnitudes, but it makes the strength of that effect depend on the output from the layers. Weight regularization has the strongest effect on weights that have little effect on layer output, because they have no gradient of they have little effect on the output. In constrast, activity regularization has the strongest effect on weights that greatly affect layer output. Conceptually, weight regularization reduces weights that are unimportant but could harm generalization error if there is a shift in the type of features seen in testing. Activity regularization reduces weights that affect layer output and is more akin to early stopping by reducing how far those weights can move in training.

2.5.4. Batch Normalization

It is arguable if batch normalization is a regularization technique or itself a layer type. Batch normalization is a layer that is added to neural network with trainable weights, but its trainable weights are not updated via gradient descent of the loss. Batch normalization has a layer equation of:

(2.16)\[\begin{equation} f(X) = \frac{X - \bar{X}}{S} \end{equation}\]

where \(\bar{X}\) and \(S\) are the sample mean and variance taken across the batch axis (zeroth axis of \(X\)). This has the effect of “smoothing” out the magnitudes of values seen between batches. You may think this will have the same effect as randomizing your training data, so there is no batch to batch variation. Surprisingly, batch normalization is effective layer and common even if there is no batch to batch variation. At inference time, \(\bar{X}\) and \(S\) are set to the average across all batches seen in training data.

2.5.5. Dropout

The last regularization type is dropout. Like batch normalization, dropout is typically viewed as a layer and has no trainable parameters. In dropout, we randomly zero-out specific elements if the input and then rescale the output so its average magnitude is unchanged. You can think of it like masking. There is a mask tensor \(M\) which contains 1s and 0s and is multiplied by the input. Then the output is multiplied by \(L / \sum M\) where \(L\) is the number of elements in \(M\). Dropout forces your neural network to learn to use different features or “pathways” by zeroing out elements. Weight regularization squeezes unused trainable weights through minimization. Dropout tries to force all trainable weights to be used by randomly negating weights. Dropout is more common than weight or activity regularization but has arguable theoretical merit. Some have proposed it is a kind of sampling mechanism for exploring model variations\cite{gal2016dropout}. Despite it appearing ad-hoc, it is effective. Note that dropout is only used during training, not for inference. You need to choose the dropout rate when using it, another hyperparameter. Usually, you will want to choose a rate of 0.05-0.35. 0.2 is common.

2.6. Residues

One last “layer” note to mention is residues. One of the classic problems in neural network training is vanishing gradients. If your neural network is deep and many features contribute to the label, you can have very small gradients during training that make it difficult to train. This is visible as underfitting. One way this can be addressed is through careful choice of optimization and learning rates. Another way is to add “residue” connections in the neural network. Residue connections are a fancy way of saying “adding” or “concatonating” later layers with early layers. The most common way to do this is:

(2.17)\[\begin{equation} X^{i + 1} = \sigma(W^iX^i + b^i) + X^i \end{equation}\]

This is the usual equation for a dense neural network but we’ve added the previous layer output (\(X^i\)) to our output. Now when you take a gradient of earlier weights from layer \(i - 1\), they will appear through both the \(\sigma(W^iX^i + b^i)\) term via the chain rule and \(X^i\) term. This goes around the activation \(\sigma\) and the effect of \(W^i\). Note that this continues at all layers and then a gradient can propogate back to earlier layers via either term. You can add the “residue” connection to the previous layer as shown here or go back even earlier. You can also be more complex and use a trainable function for how the residue term (\(X^i\)) can be treated. For example:

(2.18)\[\begin{equation} X^{i + 1} = \sigma(W^iX^i + b^i) + W'^i X^i \end{equation}\]

where \(W'^i\) is a set of new trainable parameters. We have seen that there are many hyperaprametes for tuning and adjusting residue connections is one of the least effective thing to adjust. So don’t expect much of an improvement. However, if you’re seeing underfitting and inefficient training, perhaps it’s worth investigating.

2.7. Blocks

You can imagine that we might join a dense layer with dropout, batch normalization, and maybe a residue. When you group multiple layers together, this can be called a block for simplicity. For example, you might use the word “convolution block” to describe a sequential layers of convolution, pooling, and dropout.

2.8. Dropout Regularization Example

Now let’s try to add a few dropout layers to see if we can do better on our example above.

model = tf.keras.Sequential()

# make embedding and indicate that 0 should be treated specially
model.add(tf.keras.layers.Embedding(input_dim=21, output_dim=16, mask_zero=True, input_length=pos_data.shape[-1]))

# now we move to convolutions and pooling
model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=5, activation='relu'))

model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=3, activation='relu'))

model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=3,activation='relu'))

# now we flatten to move to hidden dense layers.
# Flattening just removes all axes except 1 (and implicit batch is still in there as always!)


model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dense(64, activation='tanh'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

Model: "sequential_1"
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 200, 16)           336       
conv1d_3 (Conv1D)            (None, 196, 16)           1296      
max_pooling1d_3 (MaxPooling1 (None, 49, 16)            0         
conv1d_4 (Conv1D)            (None, 47, 16)            784       
max_pooling1d_4 (MaxPooling1 (None, 23, 16)            0         
conv1d_5 (Conv1D)            (None, 21, 16)            784       
max_pooling1d_5 (MaxPooling1 (None, 10, 16)            0         
flatten_1 (Flatten)          (None, 160)               0         
dropout (Dropout)            (None, 160)               0         
dense_3 (Dense)              (None, 256)               41216     
dropout_1 (Dropout)          (None, 256)               0         
dense_4 (Dense)              (None, 64)                16448     
dropout_2 (Dropout)          (None, 64)                0         
dense_5 (Dense)              (None, 1)                 65        
Total params: 60,929
Trainable params: 60,929
Non-trainable params: 0
model.compile('adam', loss='binary_crossentropy', metrics=['accuracy'])
result = model.fit(train_data, validation_data=val_data, epochs=10, verbose=0)
plt.plot(result.history['loss'], label='training')
plt.plot(result.history['val_loss'], label='validation')

We added a few dropout layers and now we can see the validation loss is a little better but additional training will indeed result it in rising. Feel free to try the other ideas above to see if you can get the validation loss to decrease like the training loss.

2.9. L2 Weight Regularization Example

Now we’ll demonstrate adding weight regularization.

model = tf.keras.Sequential()

# make embedding and indicate that 0 should be treated specially
model.add(tf.keras.layers.Embedding(input_dim=21, output_dim=16, mask_zero=True, input_length=pos_data.shape[-1]))

# now we move to convolutions and pooling
model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=5, activation='relu'))

model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=3, activation='relu'))

model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=3, activation='relu'))

# now we flatten to move to hidden dense layers.
# Flattening just removes all axes except 1 (and implicit batch is still in there as always!)


model.add(tf.keras.layers.Dense(256, activation='relu', kernel_regularizer='l2'))
model.add(tf.keras.layers.Dense(64, activation='tanh', kernel_regularizer='l2'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.compile('adam', loss='binary_crossentropy', metrics=['accuracy'])
result = model.fit(train_data, validation_data=val_data, epochs=10, verbose=0)

plt.plot(result.history['loss'], label='training')
plt.plot(result.history['val_loss'], label='validation')

L2 regularization is too strong it appears, preventing learning. You could go back and reduce the strength, but this is just a demonstration.

2.10. Chapter Summary

  • Layers are created for specific tasks, and given the variety of layers, there are a vast number of permutations of layers in a deep neural network.

  • Convolution layers are used for data defined on a regular grid (such as images). In a convolution, one defines the size of the trainable parameters through the kernel shape.

  • An invariance is when the output from a neural network is insensitive to changes in the input.

  • An equivariance is when the output changes linearly with respect to the input.

  • Convolution layers are often paired with pooling layers. A pooling layer behaves similarly to a convolution layer, except a reduction is computed and the output is a smaller shape (same rank) than the input.

  • Embedding layers convert indices into vectors, and are typically used as pre-processing steps.

  • Hyperparameters are choices regarding the shape of the layers, the activation function, initialization parameters, and other layer arguments. They can be tuned but are not trained on the data.

  • Hyperparameters must be tuned by hand, as they can be continuous, categorical, or discrete variables, but there are algorithms being researched that tune hyperparameters.

  • Tuning the hyperparameters can make training faster or require less training data.

  • Using a validation data set can measure the overfitting of training data, and is used to help choose the hyperparameters.

  • Regularization is an empirical technique used to change training procedures to prevent overfitting. There are five common types of regularization: early stopping, weight regularization, activity regularization, batch normalization, and dropout.

  • Vanishing gradient problems can be addressed by adding “residue” connections, essentially adding later layers with early layers in the neural network.

2.11. Cited References


Shumeet Baluja and Dean A. Pomerleau. Expectation-based selective attention for visual monitoring and control of a robot vehicle. Robotics and Autonomous Systems, 22(3):329 – 344, 1997. Robot Learning: The New Wave. URL: http://www.sciencedirect.com/science/article/pii/S0921889097000468, doi:https://doi.org/10.1016/S0921-8890(97)00046-8.


Albert P. Bartók, Risi Kondor, and Gábor Csányi. On representing chemical environments. Phys. Rev. B, 87:184115, May 2013. URL: https://link.aps.org/doi/10.1103/PhysRevB.87.184115, doi:10.1103/PhysRevB.87.184115.


Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, and others. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.


Simon Batzner, Tess E. Smidt, Lixin Sun, Jonathan P. Mailoa, Mordechai Kornbluth, Nicola Molinari, and Boris Kozinsky. Se(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. arXiv preprint arXiv:2101.03164, 2021.


Jörg Behler. Atom-centered symmetry functions for constructing high-dimensional neural network potentials. The Journal of chemical physics, 134(7):074106, 2011.


Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.


Catherine Ching Han Chang, Jiangning Song, Beng Ti Tey, and Ramakrishnan Nagasundara Ramanan. Bioinformatics approaches for improved recombinant protein production in escherichia coli: protein solubility prediction. Briefings in bioinformatics, 15(6):953–962, 2014.


Alex K Chew, Shengli Jiang, Weiqi Zhang, Victor M Zavala, and Reid C Van Lehn. Fast predictions of liquid-phase acid-catalyzed reaction rates using molecular dynamics simulations and convolutional neural networks. Chemical Science, 11(46):12464–12476, 2020.


Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.


Dan Claudiu Cireşan, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. Deep, big, simple neural nets for handwritten digit recognition. Neural Computation, 22(12):3207–3220, 2010. doi:10.1162/NECO\_a\_00052.


Taco Cohen and Max Welling. Group equivariant convolutional networks. In International conference on machine learning, 2990–2999. 2016.


Taco S Cohen, Mario Geiger, and Maurice Weiler. A general theory of equivariant cnns on homogeneous spaces. Advances in neural information processing systems, 32:9145–9156, 2019.


Hari Prasanna Das, Pieter Abbeel, and Costas J Spanos. Dimensionality reduction flows. arXiv preprint arXiv:1908.01686, 2019.


Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.


Vijay Prakash Dwivedi, Chaitanya K Joshi, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982, 2020.


Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli. A fair comparison of graph neural networks for graph classification. In International Conference on Learning Representations. 2019.


Carlos Esteves. Theoretical aspects of group equivariant neural networks. arXiv preprint arXiv:2004.05154, 2020.


Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.


Marc Finzi, Samuel Stanton, Pavel Izmailov, and Andrew Gordon Wilson. Generalizing convolutional neural networks for equivariance to lie groups on arbitrary continuous data. arXiv preprint arXiv:2002.12880, 2020.


Jefferson Foote and Anandi Raman. A relation between the principal axes of inertia and ligand binding. Proceedings of the National Academy of Sciences, 97(3):978–983, 2000.


Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.


Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 249–256. 2010.


Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in neural information processing systems, 1024–1034. 2017.


Bowen Jing, Stephan Eismann, Patricia Suriana, Raphael JL Townshend, and Ron Dror. Learning from protein structure with geometric vector perceptrons. arXiv preprint arXiv:2009.01411, 2020.


Sameer Khurana, Reda Rawi, Khalid Kunji, Gwo-Yu Chuang, Halima Bensmail, and Raghvendra Mall. DeepSol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics, 34(15):2605–2613, 03 2018. URL: https://doi.org/10.1093/bioinformatics/bty166, arXiv:https://academic.oup.com/bioinformatics/article-pdf/34/15/2605/25230875/bty166.pdf, doi:10.1093/bioinformatics/bty166.


Sungwon Kim, Sang-gil Lee, Jongyoon Song, Jaehyeon Kim, and Sungroh Yoon. Flowavenet: a generative flow for raw audio. arXiv preprint arXiv:1811.02155, 2018.


Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, 4743–4751. 2016.


Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.


Johannes Klicpera, Janek Groß, and Stephan Günnemann. Directional message passing for molecular graphs. In International Conference on Learning Representations. 2019.


Johannes Klicpera, Janek Groß, and Stephan Günnemann. Directional message passing for molecular graphs. arXiv preprint arXiv:2003.03123, 2020.


Ivan Kobyzev, Simon Prince, and Marcus Brubaker. Normalizing flows: an introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.


Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. In International Conference on Machine Learning, 2747–2755. 2018.


Mario Krenn, Florian Häse, AkshatKumar Nigam, Pascal Friederich, and Alan Aspuru-Guzik. Self-referencing embedded strings (selfies): a 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4):045024, 2020.


Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: a novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185):1–52, 2018. URL: http://jmlr.org/papers/v18/16-558.html.


Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.


Zhiheng Li, Geemi P Wellawatte, Maghesree Chakraborty, Heta A Gandhi, Chenliang Xu, and Andrew D White. Graph neural network based coarse-grained mapping prediction. Chemical Science, 11(35):9524–9531, 2020.


Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.


Enxhell Luzhnica, Ben Day, and Pietro Liò. On graph classification networks, datasets and baselines. arXiv preprint arXiv:1905.04682, 2019.


Emile Mathieu, Tom Rainforth, N Siddharth, and Yee Whye Teh. Disentangling disentanglement in variational autoencoders. In International Conference on Machine Learning, 4402–4412. PMLR, 2019.


Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, and Stanisław Jastrzębski. Molecule attention transformer. arXiv preprint arXiv:2002.08264, 2020.


Diego Mesquita, Amauri Souza, and Samuel Kaski. Rethinking pooling in graph neural networks. Advances in Neural Information Processing Systems, 2020.


Benjamin Kurt Miller, Mario Geiger, Tess E Smidt, and Frank Noé. Relevance of rotationally equivariant convolutions for predicting molecular properties. arXiv preprint arXiv:2008.08461, 2020.


Felix Musil, Andrea Grisafi, Albert P. Bartók, Christoph Ortner, Gábor Csányi, and Michele Ceriotti. Physics-inspired structural representations for molecules and materials. arXiv preprint arXiv:2101.04673, 2021.


George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762, 2019.


George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, 2338–2347. 2017.


George Papamakarios, David Sterratt, and Iain Murray. Sequential neural likelihood: fast likelihood-free inference with autoregressive flows. In The 22nd International Conference on Artificial Intelligence and Statistics, 837–848. PMLR, 2019.


Siamak Ravanbakhsh, Jeff Schneider, and Barnabás Póczos. Equivariance through parameter-sharing. In International Conference on Machine Learning, 2892–2901. 2017.


David W Romero, Erik J Bekkers, Jakub M Tomczak, and Mark Hoogendoorn. Attentive group equivariant convolutional networks. arXiv, pages arXiv–2002, 2020.


Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert Müller, and O Anatole Von Lilienfeld. Fast and accurate modeling of molecular atomization energies with machine learning. Physical review letters, 108(5):058301, 2012.


Jean-Pierre Serre. Linear representations of finite groups. Volume 42. Springer, 1977.


Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868, 2018.


Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: rotation-and translation-equivariant neural networks for 3d point clouds. arXiv preprint arXiv:1802.08219, 2018.


Anne M Treisman and Garry Gelade. A feature-integration theory of attention. Cognitive psychology, 12(1):97–136, 1980.


Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, 5998–6008. 2017.


Renhao Wang, Marjan Albooyeh, and Siamak Ravanbakhsh. Equivariant maps for hierarchical structures. arXiv preprint arXiv:2006.03627, 2020.


Maurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, and Taco S Cohen. 3d steerable cnns: learning rotationally equivariant features in volumetric data. In Advances in Neural Information Processing Systems, 10381–10392. 2018.


Robin Winter, Frank Noé, and Djork-Arné Clevert. Auto-encoding molecular conformations. arXiv preprint arXiv:2101.01618, 2021.


Hao Wu, Jonas Köhler, and Frank Noé. Stochastic normalizing flows. arXiv preprint arXiv:2002.06707, 2020.


Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2020.


Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations. 2018.


Ziyue Yang, Maghesree Chakraborty, and Andrew D White. Predicting chemical shifts with graph neural networks. bioRxiv, 2020.


Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in neural information processing systems, 3391–3401. 2017.


Anthony Zee. Group theory in a nutshell for physicists. Princeton University Press, 2016.


Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung. Gaan: gated attention networks for learning on large and spatiotemporal graphs. arXiv preprint arXiv:1803.07294, 2018.


Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2017. URL: https://arxiv.org/abs/1611.01578.


Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.