Machine learning - and especially deep learning - models require large datasets for training. As such datasets, especially those containing protein-ligand-complex information - are more rare in the drug design landscape, we assess the use of in silico structural docking data for machine learning.

To this end, we perform template docking using the OpenEye software on a large kinase activity dataset (kinodata) following the complex generation pipeline developed in kinoml.

To asses the performance gain of using generated structural data, we extensively compare affinity prediction models with access to the 3D complexes to various baseline models without access to this structure. Overall, we observe an increased performance of the model trained on the docked complexes.

The dataset is also intended as a basis for the KinfragML project.

Software and resources