Sample datasets

ChemPlot provides some sample datasets that can be used to get started right away with exploring the libraries features. These datasets can be loaded with the following function:

from chemplot import load_data

df = load_data("BBBP")

In these case we are loading the BBBP dataset, used in the previous sections of this manual. load_data() returns a pandas DataFrame built using the sample dataset provided as a parameter. Chemplot contains the following sample datasets:

ID

Name

Type

Size

C_1478_CLINTOX_2

Clintox (Toxicity) 1 2 3 4

Classification

1478

C_1513_BACE_2

BACE (Inhibitor) 5

Classification

1513

C_2039_BBBP_2

BBBP (Blood-brain barrier penetration) 6

Classification

2039

C_41127_HIV_3

HIV 7

Classification

41127

R_642_SAMPL

SAMPL (Hydration free energy) 8

Regression

642

R_1513_BACE

BACE (Binding affinity) 5

Regression

1513

R_4200_LOGP

LOGP (Lipophilicity) 9

Regression

4200

R_1291_LOGS

LOGS (Aqueous Solubility) 10

Regression

1291

R_9982_AQSOLDB

AQSOLDB (Aqueous Solubility) 11

Regression

9982

The datasets ID are constructed in the following way:

Name Formatting: type_size_name_num_of_classes.csv

  • type: R->Numerical and C->Categorical

  • size: Number of instances in the dataset

  • name: Name of dataset

  • num_of_classes: Number of classes (Categorical only)

You can retrieve the datasets by passing their ID to load_data().

Note

The first 8 datasets in the table are edited versions of the MoleculeNet repository 12.

You can print the available sample datasets to console with ChemPlot using the following function:

from chemplot import info_data

df = info_data()

References:

1

Gayvert, Kaitlyn M., Neel S. Madhukar, and Olivier Elemento. (2016) A data-driven approach to predicting successes and failures of clinical trials. Cell chemical biology 23.10 1294-1301.

2

Artemov, Artem V., et al. (2016) Integrated deep learned transcriptomic and structure-based predictor of clinical trials outcomes. bioRxiv 095653.

3

Novick, Paul A., et al. (2013) SWEETLEAD: an in silico database of approved drugs, regulated chemicals, and herbal isolates for computer-aided drug discovery. PloS one 8.11 e79568.

4

Aggregate Analysis of ClincalTrials.gov (AACT) Database.

5(1,2)

Subramanian, Govindan, et al. (2016) Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based approaches. Journal of chemical information and modeling 56.10 1936-1949.

6

Martins, Ines Filipa, et al. (2014) A Bayesian approach to in silico blood-brain barrier penetration modeling. Journal of chemical information and modeling 52.6 (2012): 1686-1697.

7

AIDS Antiviral Screen Data.

8

Mobley, David L., and J. Peter Guthrie. FreeSolv: a database of experimental and calculated hydration free energies, with input files. Journal of computer-aided molecular design 28.7 711-720.

9

Hersey, A. (2015) ChEMBL Deposited Data Set - AZ dataset

10

Huuskonen, J. (2000) Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology. Journal of Chemical Information and Computer Sciences, 40(3), 773-777.

11

Sorkun, M. C., Khetan, A., & Er, S. (2019) AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds. Scientific data, 6(1), 1-8.

12

Wu, Zhenqin, et al. (2018) MoleculeNet: a benchmark for molecular machine learning. Chemical science 9.2 513-530.