# 5. Input Data & Equivariances¶

Molecular graphs and structures (xyz coordinates) are the fundamental features in molecules and materials. As discussed, often these are converted into *molecular descriptors* or some other representation. Why is that? Why can we not work with the data directly? For example, let’s say we have a butane molecule and would like to predict its potential energy from its position. You could train a linear model \(\hat{E}\) that predicts energy

where \(\textbf{X}\) is \(14\) (atoms) \(\times\) \(3\) (xyz coordinates) matrix containing positions and \(\textbf{W}, b\) are trainable parameters. So far, this is all reasonable. Now what if I translate all the coordinates by -10:

We know the energy should not change if we translate all the coordinates equally – the molecule is not changing conformations. However, our linear regression will change by \( - 10 |\textbf{W}|\). We have accidentally made our model sensitive to the origin of our coordinate system, which is not physical. This is **translational variance** – our model changes when we translate the coordinates.

Consider another example from our butane molecule. What if we swapped the order of the atoms in our \(\textbf{X}\) matrix. There is no such thing as a “left” or “right” side of our molecule, so it should not matter. However, you’ll see that changing the order of \(\textbf{X}\) changes which weights are multiplied and thus the predicted energy will change. This is called a **permutation variance**. Our model changes if we re-order or inputs, even though from our knowledge of chemistry this should not matter. Similarly, our output energy should not be sensitive to a rotation of the molecular coordinates.

## 5.1. Equivariances¶

Models that work with molecules must be permutation equivariant. Permutation equivariant means that if you rearrange the order of atoms, the output changes in the same way. For example, if you’re predicting partial charge per atom \(\vec{y} = f(\textbf{X})\) and you rearrange \(\textbf{X}\), you expect the \(\vec{y}\) to rearrange to match that. Let’s try to state this with an equation. Consider \(\mathcal{P}_{34}\) to be the *permutation operator*. It swaps indices 3 and 4 in axis 0 of a tensor. Then a permutation equivariant model equation should have:

where \(i,j\) are indices of \(\textbf{X}\) (atom indices). This is a bit of a narrow definition, you can find a more general form in [TSK+18]. Now what happens if we output a scalar like energy? Then our permutation operator does nothing:

we call this case a **permutation invariance**.

Note

An invariance is special type of equivariance. If something is equivariant, you can easily make it invariant (e.g., averaging over your equivariant axes).

## 5.2. Equivariances of Coordinates¶

When we work with molecular coordinates as features we need to be a bit more careful in distinguishing between the “features” that might be element identity and those which specify the location in space. This kind of data is referred to as **point clouds** in computer science. Let’s break our features into \((\vec{r}, \vec{x})\) where \(\vec{r}\) is the location of the atom/point and \(\vec{x}\) are its features (e.g., element, charge, spin, etc.). Similarly, we may have labels at each point so that we write our labels as \((\vec{r}', \vec{y})\). The label might be direction \(\vec{r}'\) and force magnitude \(y\) or perhaps a field at \(\vec{r}'\) with vectors \(\vec{y}\). You may not have \(\vec{r}'\) like if you’re predicting energy. It may be that \(\vec{r} = \vec{r}'\). With this notation, we can write out **translation equivariance** as:

You can also write this out if you have a matrix of atom positions (like a molecule):

In the case that you do not have output coordinates \(\vec{r}'\), then it becomes:

which we call **translation invariance.** It’s important to note that these equations do not apply to some specific \(\vec{t}\), but any \(\vec{t}\).

**Rotational equivariance** can be similarly defined. Consider \(\mathcal{R}\) to be a rotation operator (e.g., a quaternion). Then our rotation equivariance equation is

an example might be again that \((\textbf{R}', \vec{y}_i)\) defines some field and our equivariance says that if we rotate our input points, our output points will obey the same rotation. Again, we can also have **rotation invariance** if our model does not output \(\textbf{R}'\).

## 5.3. Constructing Equivariant Models¶

There has been some work to unify equivariances into a single “layer” type so that you can just pick what equivariances you want like you would a hyperparameter [TSK+18][WGW+18]. However, these are still in their early stages and suffer from uneven performance and low speed. A recent review may be found in [Est20] and a popular implementation is available here. A recent application in molecules was in predicting dipole moments, a great example of where rotation invariance would fail because dipole moments should be affected when the molecular rotates[MGSNoe20].

Instead of using specific equivariant neural network architectures, you can use a few standard data transformations to make the model equivariant regardless of its architecture. Some of the common transforms are summarized in the table below and you can find a much more detailed discussion of data transformations in Musil et al. [MGB+21]. The list below omits data augmentation where you try to teach your model these equivariances through training. Alternatively, if your invariants are finite (e.g., only 90 degree rotations or you have small permutation invariant sets) you can just apply each possible transformation (rotation/permutation) and average the results to make a quick and easy invariant network[ZKR+17]. That is sometimes called **test augmentation**.

Method |
Equivariance |
---|---|

Matrix Determinant |
Permutation Invariance |

Eigendecomposition |
Permutation Invariance |

Reduction (sum, mean) |
Permutation Invariance |

Self-attention |
Permutation Equivariant |

Graph Neural Networks |
Permutation Equivariant |

Pairwise Distance/Vector |
Translation/Rotation Invariance |

Angles |
Translation/Rotation Invariance |

Convolutions |
Translation Equivariance |

Atom-centered Symmetry Functions |
Rotation/Translation Invariance |

Trajectory Alignment |
Rotation/Translation Invariance |

Molecular Descriptors |
All invariant |

### 5.3.1. Matrix Determinant¶

A matrix determinant is not quite permutation invariant. If you swap two rows in a matrix, it makes the determinant change signs. However, you can easily just square the determinant to remove the sign change. How would you use a determinant in a model? It’s just a building block. You could build a matrix of all neighbor features in a molecular graph and then do a determinant to arrive at a permutation invariant output. The determinant has two major disadvantages: (i) it outputs a single number (not that expressive) and (ii) it’s really expensive \(O(n^3)\). Thus you won’t see a determinant too frequently in deep learning.

### 5.3.2. Eigendecomposition¶

The eigendecomposition is when you factorize a matrix into its eignvalues and eignevectors. The (sorted) eignvalues of a matrix are permutation invariant. Compared with the determinant, you get more eigenvalues from a matrix than determinants so there is less loss of information. Computing eignvalues is still expensive at \(O(n^3)\) and they are not differentiable. Nevertheless, you will see this strategy in kernel learning where you do not need to propogate derivatives through the kernel. One important application of this is in some of the early work on quantum machine learning [RTMullerVL12].

### 5.3.3. Reductions¶

An obvious way to remove permutation variance is to just sum over the points or atoms. More generally, you can use any kind of reduction like a mean or product. Like a determinant, this results in a single number or at least removal of an axis. Reductions are fast and differentiable.

### 5.3.4. Self-attention¶

Attention with careful choice of keys/values and self-attention (where queries, keys, values are equal) are similar to reductions and indeed are permutation equivariant. They are not *invariant* because your query will be each point/atom and thus you should have one output for each atom/point. You must be careful though if you add weights to the self-attention, for example by multiplying the keys by a weight matrix. You can remove the equivariance by making unique weights for each point/atom index. Attention is differentiable and fast.

### 5.3.5. Graph Neural Networks¶

Graph neural networks (GNNs) are constructed to be permutation equivariant. Also, if using the general Battaglia formulation of GNNs [BHB+18] there is an additional graph level feature that is permutation invariant, making it possible to have either permutation equivariance or invariance.

### 5.3.6. Pairwise Distance¶

Using pairwise distances or vectors is the standard solution to translation invariance. Rarely are we actually that concerned with translational equivariance. If using pairwise vectors between atoms instead of the xyz coordinates, this naturally removes the choice of origin and thus makes the model translation invariant. If we go further and use pairwise distance instead of vectors, this also removes the effect of rotations giving a rotation invariance. This is fast, differentiable, and the usual approach to add translation and rotation invariance.

### 5.3.7. Angles¶

Angles are rotation, translation, and scale invariant. You can define angles any way, but often they are done between consecutive bonds (3 atoms total). Combining angles and pairwise distances (along bonds only) are called **internal coordinates.**

### 5.3.8. Convolutional Layers¶

Convolutional layers are well-known to be translationally equivariant. Remember that they work in 3D as well, so they can be an input layer for if translational equivariance is desired. However, convolutions work on pixels or voxels in 3D, so you must first bin your coordinates into a voxel grid. See Chew et al. for an example [CJZ+20].

### 5.3.9. Atom-centered Symmetry Functions¶

These are a large class of functions described by Behler[Beh11] that transform from the input coordinate/features \((\mathbf{R}, \mathbf{X})\) to a new set of features \(\mathbf{X}'\) that obey rotational and translational symmetry. This makes them translational and rotationally invariant. Behler didn’t propose a single function to get these features, but instead explored the choices and theory. Bartók et al. [BartokKCsanyi13] provided a specific recipe which is called a SOAP descriptor. These are drop-in replacements for \((\mathbf{R}, \mathbf{X})\) that are translation and rotation invariant but do not lose much information. They are differentiable, although complex to implement.

### 5.3.10. Trajectory Alignment¶

One special case is when your points are part of a time-dependent trajectory. This usually implies that the order of the points does not change. Thus, we do not need to worry about permutation invariance. This is the case when analyzing results from a molecular dynamics trajectory, like a dynamic simulation of a protein. One way to make our data translation and rotation invariant in this case is to align it to some reference coordinates. For example, you could always make the center of mass of the trajectory be at the origin. This will make the points translation invariant, because you always center a new set of points. A natural question is if it must be the center of mass? No, because your points are permutation invariant, you could just pick point 0 as the origin. Then if the points are translated, this will be undone when you move the points such that point 0 is the origin. To be more precise, we always apply a centering function \(c\left(\mathbf{R}\right)\) that redefines the origin before processing a set of points. We can also define a rotation function \(r\left(\mathbf{R}\right)\) that will be applied where we align to some definite set of three Euler angles (two vectors). See the example below for this.

### 5.3.11. Molecular Descriptors¶

Molecular descriptors are the classic way to convert molecules into translation/rotation/permutation invariant features. There exists 3D descriptors as well that can treat structure. They are also called **fingerprints**. We won’t focus on these in this course because they are untrainable and choosing the correct combination of descriptors is a unsolved problem which has no clear process.

## 5.4. Examples¶

Let’s demonstrate with some code how to go about creating functions that obey these equivariances. We won’t be training these models because training has no effect on equivariances

I’ll define my butane molecule as a set of coordinates \(\mathbf{R}_i\) and features \(\mathbf{X}_i\). My features are just one-hot vectors indicating if a particular point is a carbon atom \([1,0]\) or a hydrogen atom \([0,1]\). In our example, we will just be interested in predicting energy. We will not train our models, so the enegry will not be accurate.

## 5.5. Running This Notebook¶

Click the above to launch this page as an interactive Google Colab.

```
import numpy as np
R_i = np.array([
-0.5630, 0.5160, 0.0071,
0.5630, -0.5159, 0.0071,
-1.9293, -0.1506, -0.0071,
1.9294, 0.1505, -0.0071,
-0.4724, 1.1666, -0.8706,
-0.4825, 1.1551, 0.8940,
0.4825, -1.1551, 0.8940,
0.4723, -1.1665, -0.8706,
-2.0542, -0.7710, -0.9003,
-2.0651, -0.7856, 0.8742,
-2.7203, 0.6060, -0.0058,
2.0542, 0.7709, -0.9003,
2.7202, -0.6062, -0.0059,
2.0652, 0.7854, 0.8743
]).reshape(-1, 3)
N = R_i.shape[0]
X_i = np.zeros((N, 2))
X_i[:4, 0] = 1
X_i[4:, 1] = 1
```

### 5.5.1. No equivariances¶

A one-hidden layer dense neural network is an example of a model with no equivariances. To fit our data into this dense layer, we’ll just stack the positions and features into a large input tensor and output energy. We’ll use a tanh as activation, 16 hidden layer dimension, and our output layer has no activation because we’re doing regression to energy. Our weights will always be randomly initialized.

```
# our 1-hidden layer model
def hidden_model(r, x, w1, w2, b1, b2):
# stack into one large input
i = np.concatenate((r, x), axis=1).flatten()
v = np.tanh(i @ w1 + b1)
v = v @ w2 + b2
return v
# initialize our weights
w1 = np.random.normal(size=(N * 5, 16)) # 5 -> 3 xyz + 2 features
b1 = np.random.normal(size=(16,))
w2 = np.random.normal(size=(16,))
b2 = np.random.normal()
```

Let’s see what the predicted energy is with our coordinates.

```
hidden_model(R_i, X_i, w1, w2, b1, b2)
```

```
-8.25600129780722
```

This is not trained, so we aren’t that concerned about the value. Now let’s see if our model is sensitive to translation.

```
hidden_model(R_i + np.array([1, 2, 3]), X_i, w1, w2, b1, b2)
```

```
-5.577887693616731
```

As expected, it is sensitive to translation. I added the vector \((1, 2, 3)\) to all input points and the energy changed. This model is not translation invariant. The choice of \((1,2,3)\) is arbitrary, the model should not change output regardless of the choice of the translation vector if the model is translation invariant. Rotations can be done using the `scipy.transformation`

library which takes care of some of the messiness of working with quaternions, which are the operators that perform rotations.

```
import scipy.spatial.transform as trans
# rotate around x coord by 45 degrees
rot = trans.Rotation .from_euler('x', 45, degrees=True)
print('No rotation', hidden_model(R_i, X_i, w1, w2, b1, b2))
print('Rotated', hidden_model(rot.apply(R_i), X_i, w1, w2, b1, b2))
```

```
No rotation -8.25600129780722
Rotated -8.551914894325305
```

Our model is affected by the rotation, meaning it is not rotation invariant. Permutation invariance comes from swapping indices.

```
# swap 0, 1 rows
perm_R_i = np.copy(R_i)
perm_R_i[0], perm_R_i[1] = R_i[1], R_i[0]
# we do not need to swap X_i 0,1 because they are identical
print('original', hidden_model(R_i, X_i, w1, w2, b1, b2))
print('permuted', hidden_model(perm_R_i, X_i, w1, w2, b1, b2))
```

```
original -8.25600129780722
permuted -5.554268946535305
```

Our model is not permutation invariant!

### 5.5.2. Permutation Invariant¶

We will use a reduction to achieve permutation invariance. All that is needed is to ensure that weights are not a function of our atom number axis and then do a reduction (sum) prior to the output layer. Here is the implementation.

```
# our 1-hidden layer model with perm inv
def hidden_model_pi(r, x, w1, w2, b1, b2):
# stack into one large input
i = np.concatenate((r, x), axis=1)
v = np.tanh(i @ w1 + b1)
# reduction
v = np.sum(v, axis=0)
v = v @ w2 + b2
return v
# initialize our weights
w1 = np.random.normal(size=(5, 16)) # note it no lonegr has N!
b1 = np.random.normal(size=(16,))
w2 = np.random.normal(size=(16,))
b2 = np.random.normal()
```

We made three changes: we kept the atom axis (no more `flatten`

), we removed the atom axis after the hidden layer (`sum`

), and we made our weights not depend on the atom axis. Now let’s observe if this is indeed permutation invariant.

```
print('original', hidden_model_pi(R_i, X_i, w1, w2, b1, b2))
print('permuted', hidden_model_pi(perm_R_i, X_i, w1, w2, b1, b2))
```

```
original 30.669697848057446
permuted 30.669697848057446
```

It is!

### 5.5.3. Translation Invariant¶

The next change we will make is to convert our \(N\times3\) shaped coordinates into \(N\times N\times 3\) pairwise distance vectors. This gives us translation invariance. This causes an issue because our distance features went from being \(3\) per atom to \(N \times 3\) per atom. Thus we’ve introduced a dependence on atom number in our distance features and that means it’s easy to accidentally break our permutation invariance. We can just sum over this new axis though. Let’s see an implementation:

```
# our 1-hidden layer model with perm inv, trans inv
def hidden_model_pti(r, x, w1, w2, b1, b2):
# compute pairwise distances using broadcasting
d = r - r[:,np.newaxis]
# stack into one large input of N x N x 5
# concatenate doesn't broadcast, so I manually broadcast the Nx2 x matrix
# into N x N x 2
i = np.concatenate((d, np.broadcast_to(x, (d.shape[:-1] + x.shape[-1:]))), axis=-1)
v = np.tanh(i @ w1 + b1)
# reduction over both axes
v = np.sum(v, axis=(0,1))
v = v @ w2 + b2
return v
```

```
print('original', hidden_model_pti(R_i, X_i, w1, w2, b1, b2))
print('permuted', hidden_model_pti(perm_R_i, X_i, w1, w2, b1, b2))
print('translated', hidden_model_pti(R_i + np.array([-2, 3, 4]), X_i, w1, w2, b1, b2))
print('rotated', hidden_model_pti(rot.apply(R_i), X_i, w1, w2, b1, b2))
```

```
original 355.8549377572814
permuted 355.85493775728145
translated 355.85493775728145
rotated 345.7658165778177
```

It is now translation and permutation invariant. But not yet rotation invariant.

### 5.5.4. Rotation Invariant¶

It is a simple change to make it rotationally invariant. We just convert the pairwise distance vectors into pairwise distances. We’ll use squared distances for simplicity.

```
# our 1-hidden layer model with perm, trans, rot inv.
def hidden_model_ptri(r, x, w1, w2, b1, b2):
# compute pairwise distances using broadcasting
d = r - r[:,np.newaxis]
# x^2 + y^2 + z^2 of pairwise distance vectors
# keepdims so we get an N x N x 1 output
d2 = np.sum(d**2, axis=-1, keepdims=True)
# stack into one large input of N x N x 3
# concatenate doesn't broadcast, so I manually broadcast the Nx2 x matrix
# into N x N x 2
i = np.concatenate((d2, np.broadcast_to(x, (d2.shape[:-1] + x.shape[-1:]))), axis=-1)
v = np.tanh(i @ w1 + b1)
# reduction over both axes
v = np.sum(v, axis=(0,1))
v = v @ w2 + b2
return v
# initialize our weights
w1 = np.random.normal(size=(3, 16)) # now just 1 dist feature
b1 = np.random.normal(size=(16,))
w2 = np.random.normal(size=(16,))
b2 = np.random.normal()
# test it
print('original', hidden_model_ptri(R_i, X_i, w1, w2, b1, b2))
print('permuted', hidden_model_ptri(perm_R_i, X_i, w1, w2, b1, b2))
print('translated', hidden_model_ptri(R_i + np.array([-2, 3, 4]), X_i, w1, w2, b1, b2))
print('rotated', hidden_model_ptri(rot.apply(R_i), X_i, w1, w2, b1, b2))
```

```
original -226.96608366775686
permuted -226.96608366775683
translated -226.96608366775686
rotated -226.96608366775686
```

We have achieved our invariances! Remember that you could use different choices to achieve these invariances. Also, you may not want an invariance sometimes. You may want equivariances or are not concerned at all. For example, if you’re always working with one molecule you may never need to switch around atom orders.

Finally as a sanity check, let’s make sure that if we change the coordinates our predicted energy changes.

```
R_i[0] = 2.
print('changed', hidden_model_ptri(R_i, X_i, w1, w2, b1, b2))
```

```
changed -244.723501248453
```

Our model is still sensitive to the input features

## 5.6. Trajectory Alignment Example¶

As described above, if you’re working with a trajectory there is no requirement to have permutation equivariance. To achieve translation and rotation invariance, we can align to some fixed points that will be present in all structures (like center of mass). As we’ll see below, trajectory alignment is fraught with issues like rotation ambiguities, unphysical rotations, and creating fictitious covariances between far away points due to alignment. Internal coordinates (pairwise dist, angles) are almost always better. However, trajectory alignment has good scaling properties.

The movie below shows an example trajectory that we’ll examine. I’ve made it 2D to make it simple to visualize, but the same principles apply to 3D.

Let’s start by loading the trajectory and defining/testing our centering function. The trajectory is a tensor that is shape `time, point number, xy -> (2048, 12, 2)`

. I’ll examine two centering functions: align to center of mass and align to point 0 to see what the two look like.

```
# new imports
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('notebook')
sns.set_style('dark', {'xtick.bottom':True, 'ytick.left':True, 'xtick.color': '#666666', 'ytick.color': '#666666',
'axes.edgecolor': '#666666', 'axes.linewidth': 0.8 , 'figure.dpi': 300})
color_cycle = ['#1BBC9B', '#F06060', '#5C4B51', '#F3B562', '#6e5687']
mpl.rcParams['axes.prop_cycle'] = mpl.cycler(color=color_cycle)
```

```
#TODO change to github link
paths = np.load('../data/paths.npz')['arr']
# plot the first point
plt.title('First Frame')
plt.plot(paths[0, :, 0], paths[0, :, 1], 'o-')
plt.xticks([])
plt.yticks([])
plt.show()
```

```
def center_com(paths):
'''Align paths to COM at each frame'''
coms = np.mean(paths, axis=-2, keepdims=True)
return paths - coms
def center_point(paths):
'''Align paths to particle 0'''
return paths - paths[:, :1]
ccpaths = center_com(paths)
cppaths = center_point(paths)
```

To compare, we’ll draw a sample of frames on top of one another to see the now translationally invariant incoordinates.

```
fig, axs = plt.subplots(ncols=3, squeeze=True, figsize=(16, 4))
axs[0].set_title('No Center')
axs[1].set_title('COM Center')
axs[2].set_title('Point 0 Center')
cmap = plt.get_cmap('cool')
for i in range(0, 2048, 16):
axs[0].plot(paths[i, :, 0], paths[i, :, 1], '.-', alpha=0.2, color=cmap(i / 2048))
axs[1].plot(ccpaths[i, :, 0], ccpaths[i, :, 1], '.-', alpha=0.2, color=cmap(i / 2048))
axs[2].plot(cppaths[i, :, 0], cppaths[i, :, 1], '.-', alpha=0.2, color=cmap(i / 2048))
for i in range(3):
axs[i].set_xticks([])
axs[i].set_yticks([])
```

The color indicates time. You can see that having no alignment makes the spatial coordinates depend on time implicitly because the points drift over time. Both aligning to COM or point 0 removes this effect. The COM implicitly removes 2 degrees of freedom. Point 0 alignment makes point 0 have no variance (also remove degrees of freedom), which could affect how you design or interpret your model.

Now we will align by rotation. We need to define a unique rotation. A simple way is to choose 1 (or 2 in 3D) vectors that define our coordinate system directions. For example, we could choose that the vector from point 0 to point 1 defines the positive direction of the x-axis. A more sophisticated way is to find the principal axes of our points and align along these. For 2D, we only need to align to one of them. Again, this implicitly removes a degree of freedom. We will examine both. Computing principle axes requires an eigenvalue decomposition, so it’s a bit more numerically intense [FR00].

Warning

For all rotation alignment methods, you must have already centered the points.

```
def make_2drot(angle):
mats = np.array([[np.cos(angle), -np.sin(angle)], [np.sin(angle), np.cos(angle)]])
# swap so batch axis is first
return np.swapaxes(mats, 0, -1)
def align_point(paths):
'''Align to 0-1 vector assuming 2D data'''
vecs = paths[:, 0, :] - paths[:, 1, :]
# find angle to rotate so these are pointed towards pos x
cur_angle = np.arctan2(vecs[:,1], vecs[:,0])
rot_angle = -cur_angle
rot_mat = make_2drot(rot_angle)
# to mat mult at each frame
return paths @ rot_mat
def find_principle_axis(points, naxis=2):
'''Compute single principle axis for points'''
inertia = points.T @ points
evals, evecs = np.linalg.eig(inertia)
order = np.argsort(evals)[::-1]
# return largest vectors
return evecs[:, order[:naxis]].T
def align_principle(paths, axis_finder=find_principle_axis):
# someone should tell me how to vectorize this in numpy
vecs = [axis_finder(p) for p in paths]
vecs = np.array(vecs)
# find angle to rotate so these are pointed towards pos x
cur_angle = np.arctan2(vecs[:,0,1], vecs[:,0,0])
cross = np.cross(vecs[:,0], vecs[:,1])
rot_angle = -cur_angle - (cross < 0) * np.pi
rot_mat = make_2drot(rot_angle)
# rotate at each frame
rpaths = paths @ rot_mat
return rpaths
appaths = align_point(cppaths)
apapaths = align_principle(ccpaths)
```

```
fig, axs = plt.subplots(ncols=3, squeeze=True, figsize=(16, 4))
axs[0].set_title('No Align')
axs[1].set_title('COM Align')
axs[2].set_title('Point 0 Align')
for i in range(0, 2048, 16):
axs[0].plot(paths[i, :, 0], paths[i, :, 1], '.-', alpha=0.2, color=cmap(i / 2048))
axs[1].plot(apapaths[i, :, 0], apapaths[i, :, 1], '.-', alpha=0.2, color=cmap(i / 2048))
axs[2].plot(appaths[i, :, 0], appaths[i, :, 1], '.-', alpha=0.2, color=cmap(i / 2048))
for i in range(3):
axs[i].set_xticks([])
axs[i].set_yticks([])
```

You can see how points far away on the chain from 0 have much more variance in the point 0 align, whereas the COM alignment looks better spread. Remember, to apply these methods you must do them to your both your training data and any prediction points. Thus, they should be viewed as part of your neural network. We can now check that rotating has no effect on these. The plots below have the trajectory rotated by 1 radian and you can see that both alignment methods have no change (the lines are overlapping).

```
rot_paths = paths @ make_2drot(1)
rot_appaths = align_point(center_point(rot_paths))
rot_apapaths = align_principle(center_com(rot_paths))
fig, axs = plt.subplots(ncols=3, squeeze=True, figsize=(16, 4))
axs[0].set_title('No Align')
axs[1].set_title('COM Align')
axs[2].set_title('Point 0 Align')
for i in range(0, 500, 20):
axs[0].plot(paths[i, :, 0], paths[i, :, 1], '.-', alpha=1, color='C1')
axs[1].plot(apapaths[i, :, 0], apapaths[i, :, 1], '.-', alpha=1, color='C1')
axs[2].plot(appaths[i, :, 0], appaths[i, :, 1], '.-', alpha=1, color='C1')
axs[0].plot(rot_paths[i, :, 0], rot_paths[i, :, 1], '.-', alpha=0.2, color='C2')
axs[1].plot(rot_apapaths[i, :, 0], rot_apapaths[i, :, 1], '.-', alpha=0.2, color='C2')
axs[2].plot(rot_appaths[i, :, 0], rot_appaths[i, :, 1], '.-', alpha=0.2, color='C2')
# plot again to get handles
axs[0].plot(np.nan, np.nan, '.-', alpha=1, color='C2', label='rotated')
axs[0].plot(np.nan, np.nan, '.-', alpha=1, color='C1', label='non-rotated')
axs[0].legend()
for i in range(3):
axs[i].set_xticks([])
axs[i].set_yticks([])
```

Now which method is better? Aligning based on arbitrary points is indeed easier, but it creates an unusual new variance in your features. For example, let’s see what happens if we make a small perturbation to one conformation. The code is hidden for simplicity. We try changing point 1, then point 0, then point 11 to see the effects of perturbations along the chain.

```
NP = 16
def perturb_paths(perturb_point):
perturbation = np.zeros_like(paths[:NP])
perturbation[:, perturb_point, 0] = np.linspace(0,0.2, NP)
test_paths = paths[0:1] - perturbation
# compute aligned trajs
appaths = align_point(center_point(test_paths))
apapaths = align_principle(center_com(test_paths))
fig, axs = plt.subplots(ncols=3, squeeze=True, figsize=(16, 4))
axs[0].set_title(f'Perturb {perturb_point} - No Align')
axs[1].set_title(f'Perturb {perturb_point} - COM Align')
axs[2].set_title(f'Perturb {perturb_point} - Point 0 Align')
for i in range(NP):
axs[0].plot(test_paths[i, :, 0], test_paths[i, :, 1], '.-', alpha=0.2, color=cmap(i / NP))
axs[1].plot(apapaths[i, :, 0], apapaths[i, :, 1], '.-', alpha=0.2, color=cmap(i / NP))
axs[2].plot(appaths[i, :, 0], appaths[i, :, 1], '.-', alpha=0.2, color=cmap(i / NP))
for i in range(3):
axs[i].set_xticks([])
axs[i].set_yticks([])
perturb_paths(0)
perturb_paths(1)
perturb_paths(11)
```

As you can see, perturbing one point alters all others after alignment. This makes these transformed features sensitive to noise, especially aligning to point 0 or 1. More importantly, this effect is uneven in the alignment to point 0. This can in-turn make training quite difficult. Of course, neural networks are universal approximators so in theory this should not matter. However, I expect that using the COM alignment approach will give better training because the network will not need to account for this unusual variance structure.

The final analysis shows a video of the two frames. One thing you’ll note are the jumps, when the principle axes swap direction. You can see that the ambiguity caused by these can create artifacts.

### 5.6.1. Using Unsupervised Methods for Alignment¶

There are additional methods for aligning trajectories. You could define one frame as the “reference” and find the translation and rotations that best align with that reference. This could give some interpretability to your rotation and translation alignment. A tempting option is to use dimensionality reduction (like PCA), but these are not rotation invariant. This is confusing at first, because remember PCA should remove translation and rotation. It removes it though from the training data and not an arbitrary frame because it examines motion along a trajectory. You can easily see this getting principle components and then trying to align a new frame to them and a rotated version of the new frame. You’ll get different results. Another important consideration is if the unsupervised method can handle new data. Mannifold embeddings do not provide a linear transform that can handle new data for inference. Mannifold embeddings are also not necessarily rotation invariant or equivariant.

## 5.7. Distance Features¶

One more topic on parsing data is treating distances. As we saw above, pairwise distance is a wise transformation because it is translation and rotation invariant. However, we may want to sometimes transform this further. One obvious choice is to use \(1 / r\) as the input to our neural network. This is because most properties of atoms (and thus molecules) are affected by their closest nearby atoms. An oxygen nearby a carbon is much more important than an oxygen that is 100 nanometers away. Choosing \(1 / r\) as an input makes it easier for a neural network to train because it encodes this physical insight about local interactions being the most important. Of course, a neural network could learn to turn \(r\) into \(1/r\) because they are universal approximators. Yet this approach means we do not need to waste training data and weights on learning to change \(r\) into \(1 / r\).

Another detail on distances is that we often want to “featurize” them; we’d like to go from one a single number like \(r = 3.2\) to a vector of reals. Why? Well that’s just how neural networks learn. Hidden-layers need to have more than 1 dimension to be expressive enough to model any function. This seems like an obvious point though: if you used \(r\) in a neural network it would obviously get larger as it goes through hidden layers. However, there are a few “standard” ways that people like to do this process. There are valid reasons, like making it smoothly differentiable, that you might choose one of these special “featurizing” functions.

### 5.7.1. Repeating¶

The first approach is to just repeat \(r\) up to the desired hidden dimension. This is representable as a dense neural network with no activation. Repeating \(r\) is just sometimes mentioned as a way to featurize distances.

### 5.7.2. Binning¶

As explored in Gilmer et al. [GSR+17] and others, you can bin your distances to be a one-hot vector. Essentially, you histogram \(r\) into fixed bins so that you only have one bin being “hot”. Each bin will represent a segment of positions (e.g., 4.5-5.0). This has discontinuous derivatives wrt distance and rarely used, though discussed frequently.

### 5.7.3. Radial Basis Functions¶

I’ve seen these used in a number of papers and I’m not sure who to credit for it. Radial basis functions use the following equation:

where \(d_i\) is an equally spaced vector of distances (e.g., \([1, 2, 3, 4]\)) and \(w\) is a trainable (or hyper) parameter. This computes a Gaussian kernel between \(r\) and all distances \(d_i\). What is nice about this expression is that has smooth well-behaved derivatives with respect to \(r\) and no trainable parameters. You can represent this with a dense layer and a softmax activation, but that is slightly more complex and has trainable parameters.

### 5.7.4. Sub NN¶

Another strategy used in Gilmer et al. [GSR+17] is to just put your distances through a series of dense layers prior to get features. For example, if you’re going to use the distance in a graph neural network you could run it through three dense layers first to get a larger feature dimension. Remember that repeating and radial basis functions are equivalent to dense layers (depending on activation), so this strategy can be a simple solution to the above choice.

## 5.9. Chapter Summary¶

Machine learning models that work with molecules must be permutation invariant, such that if the atoms are rearranged, the output will not change.

Translational invariance of molecular coordinates is when the coordinates are shifted and the resulting output does not change.

Rotational invariance is similar, except the molecular coordinates are rotated.

Data augmentation is when you try to teach your model the various types of equivariances by rotating and translating your training data to create additional examples.

There are various techniques, such as eigendecomposition or pairwise distance to make molecular coordinates invariant.

A one-hidden layer dense neural network is an example of a model with no equivariances.

You can try alignment for trajectories, where each training example has the same ordering and number of atoms.

## 5.10. Cited References¶

- BP97
Shumeet Baluja and Dean A. Pomerleau. Expectation-based selective attention for visual monitoring and control of a robot vehicle.

*Robotics and Autonomous Systems*, 22(3):329 – 344, 1997. Robot Learning: The New Wave. URL: http://www.sciencedirect.com/science/article/pii/S0921889097000468, doi:https://doi.org/10.1016/S0921-8890(97)00046-8.- BartokKCsanyi13
Albert P. Bartók, Risi Kondor, and Gábor Csányi. On representing chemical environments.

*Phys. Rev. B*, 87:184115, May 2013. URL: https://link.aps.org/doi/10.1103/PhysRevB.87.184115, doi:10.1103/PhysRevB.87.184115.- BHB+18
Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, and others. Relational inductive biases, deep learning, and graph networks.

*arXiv preprint arXiv:1806.01261*, 2018.- BSS+21
Simon Batzner, Tess E. Smidt, Lixin Sun, Jonathan P. Mailoa, Mordechai Kornbluth, Nicola Molinari, and Boris Kozinsky. Se(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials.

*arXiv preprint arXiv:2101.03164*, 2021.- Beh11
Jörg Behler. Atom-centered symmetry functions for constructing high-dimensional neural network potentials.

*The Journal of chemical physics*, 134(7):074106, 2011.- CJZ+20
Alex K Chew, Shengli Jiang, Weiqi Zhang, Victor M Zavala, and Reid C Van Lehn. Fast predictions of liquid-phase acid-catalyzed reaction rates using molecular dynamics simulations and convolutional neural networks.

*Chemical Science*, 11(46):12464–12476, 2020.- CW16
Taco Cohen and Max Welling. Group equivariant convolutional networks. In

*International conference on machine learning*, 2990–2999. 2016.- CGW19
Taco S Cohen, Mario Geiger, and Maurice Weiler. A general theory of equivariant cnns on homogeneous spaces.

*Advances in neural information processing systems*, 32:9145–9156, 2019.- Est20
Carlos Esteves. Theoretical aspects of group equivariant neural networks.

*arXiv preprint arXiv:2004.05154*, 2020.- FSIW20
Marc Finzi, Samuel Stanton, Pavel Izmailov, and Andrew Gordon Wilson. Generalizing convolutional neural networks for equivariance to lie groups on arbitrary continuous data.

*arXiv preprint arXiv:2002.12880*, 2020.- FR00
Jefferson Foote and Anandi Raman. A relation between the principal axes of inertia and ligand binding.

*Proceedings of the National Academy of Sciences*, 97(3):978–983, 2000.- GSR+17(1,2)
Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry.

*arXiv preprint arXiv:1704.01212*, 2017.- KGrossGunnemann20
Johannes Klicpera, Janek Groß, and Stephan Günnemann. Directional message passing for molecular graphs.

*arXiv preprint arXiv:2003.03123*, 2020.- KT18
Risi Kondor and Shubhendu Trivedi. On the generalization of equivariance and convolution in neural networks to the action of compact groups. In

*International Conference on Machine Learning*, 2747–2755. 2018.- LPM15
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation.

*arXiv preprint arXiv:1508.04025*, 2015.- MRST19
Emile Mathieu, Tom Rainforth, N Siddharth, and Yee Whye Teh. Disentangling disentanglement in variational autoencoders. In

*International Conference on Machine Learning*, 4402–4412. PMLR, 2019.- MDM+20
Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, and Stanisław Jastrzębski. Molecule attention transformer.

*arXiv preprint arXiv:2002.08264*, 2020.- MGSNoe20
Benjamin Kurt Miller, Mario Geiger, Tess E Smidt, and Frank Noé. Relevance of rotationally equivariant convolutions for predicting molecular properties.

*arXiv preprint arXiv:2008.08461*, 2020.- MGB+21
Felix Musil, Andrea Grisafi, Albert P. Bartók, Christoph Ortner, Gábor Csányi, and Michele Ceriotti. Physics-inspired structural representations for molecules and materials.

*arXiv preprint arXiv:2101.04673*, 2021.- RSPoczos17
Siamak Ravanbakhsh, Jeff Schneider, and Barnabás Póczos. Equivariance through parameter-sharing. In

*International Conference on Machine Learning*, 2892–2901. 2017.- RBTH20
David W Romero, Erik J Bekkers, Jakub M Tomczak, and Mark Hoogendoorn. Attentive group equivariant convolutional networks.

*arXiv*, pages arXiv–2002, 2020.- RTMullerVL12
Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert Müller, and O Anatole Von Lilienfeld. Fast and accurate modeling of molecular atomization energies with machine learning.

*Physical review letters*, 108(5):058301, 2012.- Ser77
Jean-Pierre Serre.

*Linear representations of finite groups*. Volume 42. Springer, 1977.- TSK+18(1,2)
Nathaniel Thomas, Tess Smidt, Steven Kearnes, Lusann Yang, Li Li, Kai Kohlhoff, and Patrick Riley. Tensor field networks: rotation-and translation-equivariant neural networks for 3d point clouds.

*arXiv preprint arXiv:1802.08219*, 2018.- TG80
Anne M Treisman and Garry Gelade. A feature-integration theory of attention.

*Cognitive psychology*, 12(1):97–136, 1980.- VSP+17
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In

*Advances in neural information processing systems*, 5998–6008. 2017.- WAR20
Renhao Wang, Marjan Albooyeh, and Siamak Ravanbakhsh. Equivariant maps for hierarchical structures.

*arXiv preprint arXiv:2006.03627*, 2020.- WGW+18
Maurice Weiler, Mario Geiger, Max Welling, Wouter Boomsma, and Taco S Cohen. 3d steerable cnns: learning rotationally equivariant features in volumetric data. In

*Advances in Neural Information Processing Systems*, 10381–10392. 2018.- WNoeC21
Robin Winter, Frank Noé, and Djork-Arné Clevert. Auto-encoding molecular conformations.

*arXiv preprint arXiv:2101.01618*, 2021.- ZKR+17
Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. In

*Advances in neural information processing systems*, 3391–3401. 2017.- Zee16
Anthony Zee.

*Group theory in a nutshell for physicists*. Princeton University Press, 2016.- ZSX+18
Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung. Gaan: gated attention networks for learning on large and spatiotemporal graphs.

*arXiv preprint arXiv:1803.07294*, 2018.