Background Protein-Ligand Binding Affinity Prediction with...
Transcript of Background Protein-Ligand Binding Affinity Prediction with...
Pose Generation Effects
Paul Francoeur, David Ryan Koes
Department of Computational and Systems Biology, University of Pittsburgh
http://github.com/gnina/
Abstract
Virtual Screening is essential in the drug discovery process, as it reduces all of
chemical space (~10
60
) down to a reasonable number of testable compounds (~10
3
).
Our previous work, gnina, utilized convolutional neural networks to score
protein-ligand binding poses in order to determine if a ligand would bind the
protein. As protein-ligand binding affinity is dependent on its pose, we reason that
there could be benefit to joint training on scoring the protein-ligand pose and
predicting the binding affinity of that pose. We present here an extension to gnina,
which simultaneously predicts a score for the protein-ligand complex and the
affinity of said complex. Additionally we show the importance of training on
docked poses, and testing on clustered cross-validated splits of the training data in
order to obtain a model whose predictions are pose sensitive and generalizable to
unseen data, and showing the importance of proper training data.
Refined
PDBbind 2016 refined set
4057 complexes
69,780 ligand poses
Complete affinity data
Redocked
Subset of Cross-Docked
2923 distinct pockets
790,954 ligand poses
Affinity data for ~40%
Datasets
Smina docked and minimized poses are used for training.
Cross-Docked
Structures from Pocketome
2923 distinct pockets
22,767,152 non-redundant ligand poses
Affinity data for ~40% of ligands
Pose Sensitivity
Predicting Affinity Performance
Models
Def2017
Def2018
Training Protocol
Data Representation
24x24x24Å grid at 0.5Å resolution
14 ligand and 14 receptor atom types
Continuous Gaussian density
CUDA optimized grid generation
Background
Importance of Good Training Data
Protein-Ligand Binding Affinity Prediction with GNINA
Protein-ligand scoring provides a metric of binding strength
between small molecules and target proteins; a critical
subroutine of structure-based drug design. An ideal scoring
function would correctly predict the binding affinity and
correctly identify an accurate ligand pose for the protein.
Convolutional neural networks are state-of-the-art in image recognition.
Convolutional layers apply a small non-linear kernel function iteratively across the
input to produce a feature map. More convolutions are then applied to these feature
maps to recognize higher order features in the input.
Data augmentation is performed by
applying random rotations and
translations (±6Å) to protein-ligand
complex structures. This reduces
overfitting and compensates for the
coordinate-frame dependency of a
3D grid representation.
To extend our previous models, we now
perform joint training on the pose of a
complex with a logistic loss (classification)
AND a mean squared error L2 loss for
affinity prediction (regression). Notably, we
only penalize poor poses for over
predicting the affinity of the complex.
Training upon PDBbind refined-core and testing on the core set, like previous
attempts at this task, yields overly optimistic results. A better measure of
generalizability would be to utilize cross-validated sets for training and testing.
Acknowledgements
This research was supported by R01GM108340 from the National Institute of General Medical Sciences and
contributions from aigrant.org, Google Cloud, NVIDIA Corporation, the University of Pittsburgh Center for
Simulation and Modeling, and the University of Pittsburgh Center for Research Computing.
We observe that in general
there is a left shift (IE more
negative correlations) when
joint training with the Pose
and Affinity as expected.
Observe the inconsistent
performance drop when
crystal poses are removed
from the test set. It is
unclear if this is due to the
model detecting differences
between crystal and docked
poses, or simply a lack of
positive examples when
training. The lack of drop
in affinity prediction
suggests pose information
is not being utilized.