Deep learning requires lots of data which in the case of physico- chemical and bioactivity remains scarce. Here, we exploit that one compound can be represented by various SMILES strings as means of data augmentation and we explore several augmentation techniques. The best strategies lead to the Maxsmi models, the models that maximize the performance in SMILES augmentation. These models are trained on four data sets, including experimental solubility, lipophilicity, and bioactivity measurements, and are available for prediction on novel compounds.
Moreover, the uncertainty of the models is assessed by applying augmentation on the test set. Our results show that data augmentation improves the accuracy independently of the deep learning model and of the size of the data.
- maxsmi · Data augmentation for molecular property prediction using deep learning
- Maxime Gagnebin
- The Einstein Foundation & Stiftung Charité · BIH Einstein Visiting Fellowship