Predicting Molecule-Protein Interactions with Graph Convolutional Networks

I recently caught up with someone I knew from high school. As is societal convention, we decided to reconnect by teaming up for a Kaggle competition where the goal is to predict molecule-protein binding affinities. I’ll admit that I know quite little about biological systems and interactions but it seemed fun so I decided to give it a go. And hey, maybe this will encourage me to actually post on my blog more often.

The goal is to build an ML model to predict the binding affinity between small molecules and protein targets. Binding affinity simply refers to the strength of the binding interaction between two entities. Predicting the affinity is a crucial part of drug discovery because it enables scientists to programmatically enhance or inhibit functions of the target protein. The dataset I am working with (provided generously by Leash Biosciences) consists of 133M molecules and their interactions with three protein targets. This dataset is called the Big Encoded Library for Chemical Assessment (BELKA) and it gathered data by utilizing DNA-encoded chemical library technologies. In this notebook, we will explore training with a Graph Convolutional Network. Let’s get started!

The first step is to set up the notebook. I am using networkx to get the graphs to put into PyTorch Geometrics. It should be noted that this notebook is hosted on Kaggle since the dataset is already on there.

We import the data and create an initial dataframe. I’m dropping the building blocks columns of the dataframe. From my limited understanding, the building blocks are parts of the SMILES string. I decided to try with the full string for now - I will probably test training with each building block later on but the priority is to build an initial model.

Here is what our dataframe looks like now.

Now that we have our dataframe, let’s perform some data pre-processing. Since SMILES is a string, we need to convert it to a molecular graph. We take RDKit, a cheminformatics python library, to parse the strings.

Now that the SMILES data has been added, I need add the protein data. Right now, I am going to keep it simple and add the protein by name only - utilizing OneHotEncoder to add that info as a matrix. We have 3 proteins (HSA, sEH, and BRD4), which are represented via boolean values like this: [1, 0, 0]. In the future, I might test out adding the full SMILES string of the proteins or performing docking simulations to find where the molecules and proteins bind. My friend suggested this and I really want to try it out because it makes sense but that is computationally expensive and we want to experiment with a basic model first. Once we have the proteins encoded, we normalize our data and combine them together to create a graph object, then we make turn it into a list for our model.

Finally, we take 80% of our model for training. Let’s get started with our model!

In our model, I started by defining three convolutional layers. These layers are used to process the graph data by taking in node features and edge indices to compute and update the node features. I added some batch normalization layers to help stabilize the training process. Then I added two fully connected layers to process graph features with protein features. We are then able to set an output layer "torch.n.Linear(256, 1)" to produce the final output. Dropout layer is to prevent overfitting.

After that, I defined a forward pass method to determine how input flows through a network. I first extract the input data: x, edge_index, batch, and protein. We then take the node features and edge indices as values to calculate a new node feature. I choose LeakyReLU because ReLU had issues when I ran it. While debugging, there was a lot of data that indicated that neurons were dying so LeakyReLU can solve that by setting it to a very small integer instead of 0. I then perform global mean pooling by averaging node features into a single graph-level feature vector. I concatenated the graph and protein features and finally passed the features into the fully connected layers and computed the output logits.

Finally, we ran the model and tested it against a few samples!

Unfortunately, we are currently facing some issues with the output - it produces values that are very close together. I verified the dataset was well balanced and the preprocessing was done correctly so I think the next step is to tune the hyperparameters and reevaluate the data normalization to see if we get better results. I’ll keep working on this in the meantime - I’ll post an update when I solve this issue.

Previous
Previous

Grid Search Optimization for Graph Neural Networks

Next
Next

User AUTH in MERN