2. Standard Layers

We will now see an overview of the enourmous diversity in deep learning layers. This survey is necessarily limited to standard layers and we begin without considering the key layers that enable deep learning of molecules and materials. Almost all the layers listed below came out of a model for a specific task and were not thought-up independently. That means that some of the layers are suited to specific tasks and often the nomenclature around that layer is targeted towards a specific kinds of data. The most common type is image data and we first begin with an overview of how image features are represented. An image is a rank 3 tensor with shape \((H, W, C)\) where \(H\) is the height of the image, \(W\) is the width, and \(C\) is the number of channels (typically 3 – red, green, blue). Since all training is in batches, the input features shape will be \((B, H, W, C)\). Often layers will discuss input as having a batch axis, some number of shape axes, and then finally a channel axis. The layers will then operate on perhaps only the channels or only the shape dimensions. The layers are all quite flexible, so this is not a limitation in practice, but it’s important to know when reading about layer types. Often the documentation or literature will mention batch number or channels and this is typically the first and last axes of a tensor, respectively.

Note

Everything and nothing is batched in deep learning. Practically, data is always batched. Even if your data is not batched, the first axis input to a neural network is of unspecified dimension and called the batch axis. Many frameworks make this implicit, meaning if you say the output from one layer is shape \((4,5)\), it will be \((B, 4, 5)\) when you actually inspect data. So, all data is batched but often the math, frameworks, and documentation make it seem as if there is no batch axis.

2.1. Hyperparameters

We saw from the FC/Dense layer that we have to choose if we use bias, the activation, and the output shape. As we learn about more complex layers, there will be more choices. These choices begin to accumulate and in a neural network you may have billions of possible combinations of them. These chocies about shape, activation, initialization, and other layer arguments are called hyperparameters. They are parameters in the sense that they can be tuned, but they are not trained on our data so we call them hyperparameters to distinguish them from the “regular” parameters like value of weights and biases in the layers. The name is inherited from Bayesian statistics.

Choosing these hyperparameters is difficult and we typically rely on the body of existing literature to understand ranges of reasonable parameters. In deep learning, we usually are in a regime of hyperparameters which yield many trainable parameters (deep networks) and thus our models can represent any function. Our models are expressive. However, optimizing hyperparameters makes training faster and/or require less data. For example, papers have shown that carefully choosing the initial value of weights can be more effective than complex architecture [GB10]. Another example found that convolutions, which are thought to be the most important layer for image recognition, are not necessary if hyperparameters are chosen correctly for dense neural networks[CMGS10]. This is now changing, with options for tuning hyperparameters, but the current state-of-the art is to take hyperparameters from previous work as a starting guess and change a little if you believe it is needed.

2.1.1. Validation

The number of hyperparameters is high-enough that overfitting can actually occur by choosing hyperparameters that minimize error on the test set. This is surprising because we don’t explicitly train hyperparameters. Nevertheless, you will find in your own work that if you use the test data extensively in hyperparameter tuning and for assessing overfitting of the regular training parameters, your performance will be overfit to the testing data. To combat this, we split our data three ways in deep learning:

  1. Training data: used for trainable parameters.

  2. Validation data: used to choose hyperparameters or measure overfitting of training data

  3. Test data: data not used for anything except final reported error

To clean-up our nomenclature here, we use the word genaralization error to refer to performance on a hypothetical infinite stream of unseen data. So regardless of if you split three-ways or use other approaches, generalization error means error on unseen data.

2.1.2. Tuning

So how do you tune hyperparameters? The main answer is by-hand, but this is an active area of research. Hyperparameters are continuous (e.g., regularization strength), categorical (e.g., which activation), and discrete variables (e.g., number of layers). One category of ways to tune hyperparameters is a topic called meta-learning[FAL17], which aims to learn hyperparameters by looking at multiple related datasets. Another area is auto-machine learning (auto-ML)[ZL17], where optimization strategies that do not require derivatives can tune hyperparameters. An important category of optimization related to hyperparameter tuning is multi-armed bandit optimization where we explicitly treat the fact that we have a finite amount of computational resources for tuning hyperparameters[LJD+18]. Specific implementations of these methods can be found in many ML frameworks now, for example Keras-Tuner for Keras or Ray Tune for PyTorch.

2.2. Common Layers

Now that we have some understanding of hyperparameters and their role, let’s now survey the common types of layers.

2.2.1. Convolutions

You can find a more thorough overview of convolutions here and here with more visuals. Here is a nice video on this. Convolutions are the most commonly used input layer when dealing with images or other data defined on a regular grid. In chemistry, you’ll see convolutions on protein or DNA sequences, on 2D imaging data, and occasionally on 3D spatial data like average density from a molecular simulation. What makes a convolution different from a dense layer is that the number of trainable weights is more flexible than input grid shape \(\times\) output shape, which is what you would get with a dense layer. Since the trainable parameters doesn’t depend on the input grid shape, you don’t learn to depend on location in the image. This is important if you’re hoping to learn something independent of location on the input grid – like if a specific object is present in the image independent of where it is located.

In a convolution, you specify a kernel shape that defines the size of trainable parameters. The kernel shape defines a window over your input data in which a dense neural network is applied. The rank of the kernal shape is the rank of your grid + 1, where the extra axis accounts for channels. For example, for images you might define a kernel shape of \(5\times5\). The kernel shape will become \(5\times5\times{}C\), where \(C\) is the number of channels. When referring to a convolution as 1D, 2D, or 3D, we’re referring to the grid of the input data and thus the kernel shape. A 2D convolution actually has an input of rank 4 tensors, the extra 2 axes accounting for batch and channels. The kernel shape of \(5\times5\) means that the output of a specific value in the grid will depend on its 24 nearest neighboring pixels (2 in each direction). Note that the kernel is used like a normal dense layer – it can have bias (dimension \(C\)), output activation, and regularization.

Practically, convolutions are always grouped in parallel. You have a set of \(F\) kernels, where \(F\) is called the number of filters. Each of these filters is completely independent and if you examine what they learn, some filters will learn to identify squares and some might learn to identify color or others will learn textures. Filters is a term left-over from image processing, which is the field where convolutions were first explored. Combining all of these together, a 2D convolution will have an input shape of \((B, H, W, C)\) and an output of \((B, \approx H, \approx W, F)\), where \(F\) is the number of filters chosen, and the \(\approx\) accounts for the fact that when you slide your kernel window over the input data, you’ll lose some values on the edge. This can either be treated by padding, so your input height and width match output height and width, or your dimensionality is reduced by a small amount (e.g., going from \(128\times128\) to \(125\times125\)). A 1D convolution will have input shape \((B, L, C)\) and output shape \((B, \approx L, F)\). As a practical example, consider a convolution on DNA. \(L\) is length of the sequence. \(C\), your channels, will be one-hot indicators for the base (T, C, A, G).

One of the important properties we’ll begin to discuss is invariances and equivariances. An invariance means the output from a neural network (or a general function) is insensitive to changes in input. For example, a translational invariance means that the output does not change if the input is translated. Convolutions and pooling should be chosen when you want to have translation invariance. For example, if you are identifying if a cat exists in an image, you want your network to give the same answer even if the cat is translated in the image to different regions. However, just because you use a convolution layer does not make a neural network automatically translationally invariant. You must include other layers to achieve this. Convolutions are actually translationally equivariant – if you translate all pixels in your input, the output will also be translated. People usually do not distinguish between equivariance and invariance. If you are trying to identify where a cat is located in an image you would still use convolutions but you want your neural network to be translationally equivariant, meaning your guess about where the cat is located is sensitive to where the cat is located in the input pixels. The reason convolutions have this property is that the trainable parameters, the kernel, are location independent. You use the same kernel on every region of the input.

2.2.2. Pooling

Convolutions are commonly paired with pooling layers. If your goal is to produce a single number (regression) or class (classification) from an input image or sequence, you need to reduce the rank to 0, a scalar. After a convolution, you could use a reduction like average or maximum. It has been shown empirically that reducing the number of elements of your features more gradually is better. One way is through pooling. Pooling is similar to convolutions, in that you define a kernel shape (called window shape), but pooling has no trainable parameters. Instead, you run a window across your input grid and compute a reduction. Commonly an average or maximum is computed. If your pool window is a \(2\times2\) on an input of \((B, H, W, F)\), then your output will be \((B, H / 2, W / 2, F)\). In convolutional neural networks, often multiple blocks of convolutions and poolings are combined. For example, you might use three rounds of convolutions and pooling to take an image from \(32 \times 32\) down to a \(4 \times 4\). Pooling is translationally equivariant like convolutions. Read more about pooling here

2.2.3. Embedding

Another important type of input layers are embeddings. Embeddings convert integers into vectors. They are typically used to convert characters or words into numerical vectors. The characters or words are first converted into tokens separately as a pre-processing step and then the input to the embedding layer is the indices of the token. The indices are integer values that index into a dictionary of all possible tokens. It sounds more complex than it is. For example, we might tokenize characters in the alphabet. There are 26 tokens (letters) in the alphabet (dictionary of tokens) and we could convert the word “hello” into the indices \([7, 4, 11, 11, 14]\), where 7 means the 7th letter of the alphabet.

After converting into indices, an embedding layer converts these indices into dense vectors of a chosen dimension. The rationale behind embeddings is to go from a large discrete space (e.g., all words in the English language) into a much smaller space of real numbers (e.g., vectors of size 5). You might use embeddings for converting monomers in a polymer into dense vectors or atom identities in a molecule or DNA bases.

2.3. Running This Notebook

Click the    above to launch this page as an interactive Google Colab. See details below on installing packages, either on your own environment or on Google Colab

2.4. Example

At this point, we have enough common layers to try to build a neural network. We will combine these three layers to predict if a protein is soluble. Our dataset comes from [CSTR14] and consists of proteins known to be soluble or insoluble. As usual, the code below sets-up our imports.

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
import warnings
import urllib
warnings.filterwarnings('ignore')
sns.set_context('notebook')
sns.set_style('dark',  {'xtick.bottom':True, 'ytick.left':True, 'xtick.color': '#666666', 'ytick.color': '#666666',
                        'axes.edgecolor': '#666666', 'axes.linewidth':     0.8 , 'figure.dpi': 300})
color_cycle = ['#1BBC9B', '#F06060', '#5C4B51', '#F3B562', '#6e5687']
mpl.rcParams['axes.prop_cycle'] = mpl.cycler(color=color_cycle) 
np.random.seed(0)

Our task is binary classification. The data is split into two: positive and negative examples. We’ll need to rearrange a little into a normal dataset with labels and training/testing split. We also really really need to shuffle our data, so it doesn’t see all positives and then all negatives.

urllib.request.urlretrieve(
    'https://github.com/whitead/dmol-book/raw/master/data/solubility.npz', 'solubility.npz')
with np.load('solubility.npz') as r:
    pos_data, neg_data = r['positives'], r['negatives']

# create labels and stich it all into one 
# tensor
labels = np.concatenate((np.ones((pos_data.shape[0], 1), dtype=pos_data.dtype), np.zeros((neg_data.shape[0], 1) , dtype=pos_data.dtype)), axis=0)
features = np.concatenate((pos_data, neg_data), axis=0)
# we now need to shuffle before creating TF dataset
# so that our train/test/val splits are random
i = np.arange(len(labels))
np.random.shuffle(i)
labels = labels[i]
features = features[i]
full_data = tf.data.Dataset.from_tensor_slices((features, labels))

# now split into val, test, train
N = pos_data.shape[0] + neg_data.shape[0]
print(N, 'examples')
split = int(0.1 * N)
test_data = full_data.take(split).batch(16)
nontest = full_data.skip(split)
val_data, train_data = nontest.take(split).batch(16), nontest.skip(split).shuffle(1000).batch(16)
18453 examples

Before getting to modeling, let’s examine our data. The protein sequences have already been tokenized. There are 20 possible values at each position because there are 20 amino acids possible in proteins. Let’s see a soluble protein

pos_data[0]
array([13, 17, 15, 16,  1,  1,  1, 17,  8,  9,  7,  1,  1,  4,  7,  6,  2,
       11,  2,  7, 11,  2,  8, 11, 17,  2,  6, 11, 15, 17,  8, 20,  1, 20,
       20, 17,  1,  6,  4,  8,  7, 20,  1,  9,  8,  1, 17, 20, 16, 17, 20,
       16, 20, 11, 16,  6,  6, 15, 11,  2, 10,  8, 20, 16, 11,  2,  2,  8,
       16, 19, 11, 17,  8, 11, 10,  2,  6,  2,  2, 20, 14,  1, 11,  3, 20,
       11, 16, 16,  2,  6, 16,  1, 20,  1,  4, 18, 14,  1,  3, 15,  7,  2,
       15,  2,  8, 18,  2,  6, 14,  4, 19, 20,  2, 18, 17,  1,  9, 15, 12,
        1,  8, 13, 15, 20, 11,  7,  4,  1, 11,  1,  6, 11,  9,  5,  2, 11,
       17,  4, 11, 10, 15, 11,  8,  1, 16,  4,  4, 11, 11, 20,  1,  7, 20,
       11,  4,  8,  2,  8,  2,  3,  8,  2, 15, 11, 20,  3, 14,  3,  8,  2,
       11,  9,  4, 20,  7, 14,  2,  8, 20, 20,  2, 20, 16,  2,  4,  6, 15,
       16,  1, 20, 17, 16, 11,  7,  0,  0,  0,  0,  0,  0])

Notice that integers/indices are used because our data is tokenized already. To make our data all be the same input shape, a special token (0) is inserted at the end indicating no amino acid is present. This needs to be treated carefully, because it should be zeroed throughout the network. Luckily this is built into Keras, so we do not need to worry about it.

This data is perfect for an embedding because we need to convert token indices to real vectors. Then we will use 1D convolutions to look for sequence patterns with pooling. We need to then make sure our final layer is a sigmoid. This architecture is inspired by the original work on pooling with convolutions [LecunBottouBengioHaffner98]. The number of layers and kernel sizes below are hyperparameters. You are encouraged to experiment with these or find improvements!

We begin with an embedding. We’ll use a 2 dimensional embedding, which gives us two channels for our sequence. Our kernel filter size for the 1D convolution will be 5 and we’ll use 16 filters. Beyond that, the rest of the network is about distilling gradually into a final class.

model = tf.keras.Sequential()

# make embedding and indicate that 0 should be treated specially
model.add(tf.keras.layers.Embedding(input_dim=21, output_dim=16, mask_zero=True, input_length=pos_data.shape[-1]))

# now we move to convolutions and pooling
model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=5, activation='relu'))
model.add(tf.keras.layers.MaxPooling1D(pool_size=4))

model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=3, activation='relu'))
model.add(tf.keras.layers.MaxPooling1D(pool_size=2))

model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=3,activation='relu'))
model.add(tf.keras.layers.MaxPooling1D(pool_size=2))

# now we flatten to move to hidden dense layers.
# Flattening just removes all axes except 1 (and implicit batch is still in there as always!)

model.add(tf.keras.layers.Flatten())

model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dense(64, activation='tanh'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.summary()
tf.keras.utils.plot_model(model)
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 200, 16)           336       
_________________________________________________________________
conv1d (Conv1D)              (None, 196, 16)           1296      
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 49, 16)            0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 47, 16)            784       
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 23, 16)            0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 21, 16)            784       
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 10, 16)            0         
_________________________________________________________________
flatten (Flatten)            (None, 160)               0         
_________________________________________________________________
dense (Dense)                (None, 256)               41216     
_________________________________________________________________
dense_1 (Dense)              (None, 64)                16448     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 65        
=================================================================
Total params: 60,929
Trainable params: 60,929
Non-trainable params: 0
_________________________________________________________________
../_images/layers_11_1.png

Take a moment to look at the model summary (shapes) and model diagram. This is a fairly complex and modern neural network. If you can understand this, you’ll have a grasp on most current networks used in deep learning. Now we’ll begin training. Since we are doing classification, we’ll also examine accuracy on validation data as we train.

model.compile('adam', loss='binary_crossentropy', metrics=['accuracy'])
result = model.fit(train_data, validation_data=val_data, epochs=10, verbose=0)
plt.plot(result.history['loss'], label='training')
plt.plot(result.history['val_loss'], label='validation')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
../_images/layers_14_0.png

You can see this is a classic case of overfitting, with the validation data rising quickly as we improve our loss on the training data. Indeed, our model is quite expressive in its capability to fit the training data but it is incidentally fititng the noise. We have 61,000 trainable parameters and about 15,000 training examples, so this is not a surprise. However, we still able to learn a little bit – our accuracy is above 50%. This is actually a challenging dataset and the state-of-the art result is 77% accuracy [KRK+18]. We need to expand our tools to include layers that can address overfitting.

2.5. Regularization

As we saw in the ML chapters, regularization is a strategy that changes your training procedure (often by adding loss terms) to prevent overfitting. There is a nice argument for it in the bias-variance trade-off regarding model complexity, however this doesn’t seem to hold in practice [NMB+18]. Thus we view regularization as an empirical process. Regularization, like other hyperparameter tuning, is dependent on the layers, how complex your model is, your data, and especially if your model is underfit or overfit. Underfitting means you could train longer to improve validation loss. Adding regularization if your model is underfit will usually reduce performance. Consider training longer or adjusting learning rates if you observe this.

2.5.1. Early Stopping

The most commonly used and simplest form of regularization is early stopping. Early stopping means monitoring the loss on your validation data and stopping training once it begins to rise. Normally, training is done until convergened – meaning the loss stops decreasing. Early stopping tries to prevent overfitting by looking at the loss on unseen data (validation data) and stopping once that begins to rise. This is an example of regularization because the weights are limited to move a fix distance from their initial value. Just like in L2 regularization, we’re squeezing our trainable weights.

2.5.2. Weight

Weight regularization is the addition of terms to the loss that depend on the trainable weights in the model like we saw before. These can be L2 (\(\sqrt{\sum w_i^2}\)) or L1 (\(\sum \left|w_i\right|\)). You must choose the strength, which is expressed as a parameter (often denoted \(\lambda\)) that should be much less than \(1\). Typically values of \(0.1\) to \(1\times10^{-4}\) are chosen. This may be broken into kernel regularization, which affects the multiplicative weights in a dense or convolution neural network, and bias regularization. Bias regularization is rarely seen in practice.

2.5.3. Activity

Activity regularization is the addition of terms to the loss that depend on the output from a layer. Activity regularization ultimately leads to minimizing weight magnitudes, but it makes the strength of that effect depend on the output from the layers. Weight regularization has the strongest effect on weights that have little effect on layer output, because they have no gradient if they have little effect on the output. In constrast, activity regularization has the strongest effect on weights that greatly affect layer output. Conceptually, weight regularization reduces weights that are unimportant but could harm generalization error if there is a shift in the type of features seen in testing. Activity regularization reduces weights that affect layer output and is more akin to early stopping by reducing how far those weights can move in training.

2.5.4. Batch Normalization

It is arguable if batch normalization is a regularization technique or itself a layer type. Batch normalization is a layer that is added to a neural network with trainable weights, but its trainable weights are not updated via gradient descent of the loss. Batch normalization has a layer equation of:

(2.16)\[\begin{equation} f(X) = \frac{X - \bar{X}}{S} \end{equation}\]

where \(\bar{X}\) and \(S\) are the sample mean and variance taken across the batch axis (zeroth axis of \(X\)). This has the effect of “smoothing” out the magnitudes of values seen between batches. You may think this will have the same effect as randomizing your training data, so there is no batch to batch variation. Surprisingly, batch normalization is an effective layer and common even if there is no batch to batch variation. At inference time, \(\bar{X}\) and \(S\) are set to the average across all batches seen in training data, although this is not always the case.

2.5.5. Dropout

The last regularization type is dropout. Like batch normalization, dropout is typically viewed as a layer and has no trainable parameters. In dropout, we randomly zero-out specific elements of the input and then rescale the output so its average magnitude is unchanged. You can think of it like masking. There is a mask tensor \(M\) which contains 1s and 0s and is multiplied by the input. Then the output is multiplied by \(|M| / \sum M\) where \(|M|\) is the number of elements in \(M\). Dropout forces your neural network to learn to use different features or “pathways” by zeroing out elements. Weight regularization squeezes unused trainable weights through minimization. Dropout tries to force all trainable weights to be used by randomly negating weights. Dropout is more common than weight or activity regularization but has arguable theoretical merit. Some have proposed it is a kind of sampling mechanism for exploring model variations[GG16]. Despite it appearing ad-hoc, it is effective. Note that dropout is only used during training, not for inference. You need to choose the dropout rate when using it, another hyperparameter. Usually, you will want to choose a rate of 0.05-0.35. 0.2 is common.

2.6. Residues

One last “layer” note to mention is residues. One of the classic problems in neural network training is vanishing gradients. If your neural network is deep and many features contribute to the label, you can have very small gradients during training that make it difficult to train. This is visible as underfitting. One way this can be addressed is through careful choice of optimization and learning rates. Another way is to add “residue” connections in the neural network. Residue connections are a fancy way of saying “adding” or “concatonating” later layers with early layers. The most common way to do this is:

(2.17)\[\begin{equation} X^{i + 1} = \sigma(W^iX^i + b^i) + X^i \end{equation}\]

This is the usual equation for a dense neural network but we’ve added the previous layer output (\(X^i\)) to our output. Now when you take a gradient of earlier weights from layer \(i - 1\), they will appear through both the \(\sigma(W^iX^i + b^i)\) term via the chain rule and the \(X^i\) term. This goes around the activation \(\sigma\) and the effect of \(W^i\). Note this continues at all layers and then a gradient can propogate back to earlier layers via either term. You can add the “residue” connection to the previous layer as shown here or go back even earlier. You can also be more complex and use a trainable function for how the residue term (\(X^i\)) can be treated. For example:

(2.18)\[\begin{equation} X^{i + 1} = \sigma(W^iX^i + b^i) + W'^i X^i \end{equation}\]

where \(W'^i\) is a set of new trainable parameters. We have seen that there are many hyperaprametes for tuning and adjusting residue connections is one of the least effective thing to adjust. So don’t expect much of an improvement. However, if you’re seeing underfitting and inefficient training, perhaps it’s worth investigating.

2.7. Blocks

You can imagine that we might join a dense layer with dropout, batch normalization, and maybe a residue. When you group multiple layers together, this can be called a block for simplicity. For example, you might use the word “convolution block” to describe a sequential layers of convolution, pooling, and dropout.

2.8. Dropout Regularization Example

Now let’s try to add a few dropout layers to see if we can do better on our example above.

model = tf.keras.Sequential()

# make embedding and indicate that 0 should be treated specially
model.add(tf.keras.layers.Embedding(input_dim=21, output_dim=16, mask_zero=True, input_length=pos_data.shape[-1]))

# now we move to convolutions and pooling
# NOTE: Keras doesn't respect masking here
# I should switch to PyTorch.
model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=5, activation='relu'))
model.add(tf.keras.layers.MaxPooling1D(pool_size=4))

model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=3, activation='relu'))
model.add(tf.keras.layers.MaxPooling1D(pool_size=2))

model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=3,activation='relu'))
model.add(tf.keras.layers.MaxPooling1D(pool_size=2))

# now we flatten to move to hidden dense layers.
# Flattening just removes all axes except 1 (and implicit batch is still in there as always!)

model.add(tf.keras.layers.Flatten())

#HERE IS THE DROPOUT
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(64, activation='tanh'))
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.summary()
tf.keras.utils.plot_model(model)
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 200, 16)           336       
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 196, 16)           1296      
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 49, 16)            0         
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 47, 16)            784       
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 23, 16)            0         
_________________________________________________________________
conv1d_5 (Conv1D)            (None, 21, 16)            784       
_________________________________________________________________
max_pooling1d_5 (MaxPooling1 (None, 10, 16)            0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 160)               0         
_________________________________________________________________
dropout (Dropout)            (None, 160)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 256)               41216     
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 64)                16448     
_________________________________________________________________
dropout_2 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 65        
=================================================================
Total params: 60,929
Trainable params: 60,929
Non-trainable params: 0
_________________________________________________________________
../_images/layers_19_1.png
model.compile('adam', loss='binary_crossentropy', metrics=['accuracy'])
result = model.fit(train_data, validation_data=val_data, epochs=10, verbose=0)
plt.plot(result.history['loss'], label='training')
plt.plot(result.history['val_loss'], label='validation')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
../_images/layers_21_0.png

We added a few dropout layers and now we can see the validation loss is a little better but additional training will indeed result it in rising. Feel free to try the other ideas above to see if you can get the validation loss to decrease like the training loss.

2.9. L2 Weight Regularization Example

Now we’ll demonstrate adding weight regularization.

model = tf.keras.Sequential()

# make embedding and indicate that 0 should be treated specially
model.add(tf.keras.layers.Embedding(input_dim=21, output_dim=16, mask_zero=True, input_length=pos_data.shape[-1]))

# now we move to convolutions and pooling
model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=5, activation='relu'))
model.add(tf.keras.layers.MaxPooling1D(pool_size=4))

model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=3, activation='relu'))
model.add(tf.keras.layers.MaxPooling1D(pool_size=2))

model.add(tf.keras.layers.Conv1D(filters=16, kernel_size=3, activation='relu'))
model.add(tf.keras.layers.MaxPooling1D(pool_size=2))

# now we flatten to move to hidden dense layers.
# Flattening just removes all axes except 1 (and implicit batch is still in there as always!)

model.add(tf.keras.layers.Flatten())

#HERE IS THE REGULARIZATION:
model.add(tf.keras.layers.Dense(256, activation='relu', kernel_regularizer='l2'))
model.add(tf.keras.layers.Dense(64, activation='tanh', kernel_regularizer='l2'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))


model.compile('adam', loss='binary_crossentropy', metrics=['accuracy'])
result = model.fit(train_data, validation_data=val_data, epochs=10, verbose=0)

plt.plot(result.history['loss'], label='training')
plt.plot(result.history['val_loss'], label='validation')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()
../_images/layers_23_0.png

L2 regularization is too strong it appears, preventing learning. You could go back and reduce the strength, but this is just a demonstration.

2.10. Discussion

Designing and training neural networks is a complex task. The best approach is to always start simple and work your way up in complexity. Remember, you have to write correct code, create a competent model, and have clean data. If you start with a complex model it can be hard to discern if learning problems are due to bugs, the model, or the data. My advice is to always start with a pre-trained or simple baseline network from a previous paper. If you find yourself designing and training your own neural network, read through Andrej Karpathy’s excellent guide on how to approach this task.

2.11. Chapter Summary

  • Layers are created for specific tasks, and given the variety of layers, there are a vast number of permutations of layers in a deep neural network.

  • Convolution layers are used for data defined on a regular grid (such as images). In a convolution, one defines the size of the trainable parameters through the kernel shape.

  • An invariance is when the output from a neural network is insensitive to changes in the input.

  • An equivariance is when the output changes linearly with respect to the input.

  • Convolution layers are often paired with pooling layers. A pooling layer behaves similarly to a convolution layer, except a reduction is computed and the output is a smaller shape (same rank) than the input.

  • Embedding layers convert indices into vectors, and are typically used as pre-processing steps.

  • Hyperparameters are choices regarding the shape of the layers, the activation function, initialization parameters, and other layer arguments. They can be tuned but are not trained on the data.

  • Hyperparameters must be tuned by hand, as they can be continuous, categorical, or discrete variables, but there are algorithms being researched that tune hyperparameters.

  • Tuning the hyperparameters can make training faster or require less training data.

  • Using a validation data set can measure the overfitting of training data, and is used to help choose the hyperparameters.

  • Regularization is an empirical technique used to change training procedures to prevent overfitting. There are five common types of regularization: early stopping, weight regularization, activity regularization, batch normalization, and dropout.

  • Vanishing gradient problems can be addressed by adding “residue” connections, essentially adding later layers with early layers in the neural network.

2.12. Cited References

NMB+18

Brady Neal, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, Simon Lacoste-Julien, and Ioannis Mitliagkas. A modern take on the bias-variance tradeoff in neural networks. arXiv preprint arXiv:1810.08591, 2018.

GB10

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 249–256. 2010.

CMGS10

Dan Claudiu Cireşan, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. Deep, big, simple neural nets for handwritten digit recognition. Neural Computation, 22(12):3207–3220, 2010. doi:10.1162/NECO\_a\_00052.

FAL17

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.

ZL17

Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2017. URL: https://arxiv.org/abs/1611.01578.

LJD+18

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: a novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185):1–52, 2018. URL: http://jmlr.org/papers/v18/16-558.html.

CSTR14

Catherine Ching Han Chang, Jiangning Song, Beng Ti Tey, and Ramakrishnan Nagasundara Ramanan. Bioinformatics approaches for improved recombinant protein production in escherichia coli: protein solubility prediction. Briefings in bioinformatics, 15(6):953–962, 2014.

LecunBottouBengioHaffner98

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

KRK+18

Sameer Khurana, Reda Rawi, Khalid Kunji, Gwo-Yu Chuang, Halima Bensmail, and Raghvendra Mall. DeepSol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics, 34(15):2605–2613, 03 2018. URL: https://doi.org/10.1093/bioinformatics/bty166, arXiv:https://academic.oup.com/bioinformatics/article-pdf/34/15/2605/25230875/bty166.pdf, doi:10.1093/bioinformatics/bty166.

GG16

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, 1050–1059. 2016.