Sample datasets

ChemPlot provides some sample datasets that can be used to get started right away with exploring the libraries features. These datasets can be loaded with the following function:

from chemplot import load_data

df = load_data("BBBP")

In these case we are loading the BBBP dataset, used in the previous sections of this manual. load_data() returns a pandas DataFrame built using the sample dataset provided as a parameter. Chemplot contains the following sample datasets:

ID

Name

Type

Size

C_1478_CLINTOX_2

Clintox (Toxicity) [1] [2] [3] [4]

Classification

1478

C_1513_BACE_2

BACE (Inhibitor) [5]

Classification

1513

C_2039_BBBP_2

BBBP (Blood-brain barrier penetration) [6]

Classification

2039

C_41127_HIV_3

HIV [7]

Classification

41127

R_642_SAMPL

SAMPL (Hydration free energy) [8]

Regression

642

R_1513_BACE

BACE (Binding affinity) [5]

Regression

1513

R_4200_LOGP

LOGP (Lipophilicity) [9]

Regression

4200

R_1291_LOGS

LOGS (Aqueous Solubility) [10]

Regression

1291

R_9982_AQSOLDB

AQSOLDB (Aqueous Solubility) [11]

Regression

9982

The datasets ID are constructed in the following way:

Name Formatting: type_size_name_num_of_classes.csv

  • type: R->Numerical and C->Categorical

  • size: Number of instances in the dataset

  • name: Name of dataset

  • num_of_classes: Number of classes (Categorical only)

You can retrieve the datasets by passing their ID to load_data().

Note

The first 8 datasets in the table are edited versions of the MoleculeNet repository [12].

You can print the available sample datasets to console with ChemPlot using the following function:

from chemplot import info_data

df = info_data()

References: