Sample datasets
ChemPlot provides some sample datasets that can be used to get started right away with exploring the libraries features. These datasets can be loaded with the following function:
from chemplot import load_data
df = load_data("BBBP")
In these case we are loading the BBBP dataset, used in the previous sections of this
manual. load_data()
returns a pandas DataFrame built using the sample dataset
provided as a parameter.
Chemplot contains the following sample datasets:
ID |
Name |
Type |
Size |
---|---|---|---|
C_1478_CLINTOX_2 |
Classification |
1478 |
|
C_1513_BACE_2 |
BACE (Inhibitor) 5 |
Classification |
1513 |
C_2039_BBBP_2 |
BBBP (Blood-brain barrier penetration) 6 |
Classification |
2039 |
C_41127_HIV_3 |
HIV 7 |
Classification |
41127 |
R_642_SAMPL |
SAMPL (Hydration free energy) 8 |
Regression |
642 |
R_1513_BACE |
BACE (Binding affinity) 5 |
Regression |
1513 |
R_4200_LOGP |
LOGP (Lipophilicity) 9 |
Regression |
4200 |
R_1291_LOGS |
LOGS (Aqueous Solubility) 10 |
Regression |
1291 |
R_9982_AQSOLDB |
AQSOLDB (Aqueous Solubility) 11 |
Regression |
9982 |
The datasets ID are constructed in the following way:
Name Formatting: type_size_name_num_of_classes.csv
type: R->Numerical and C->Categorical
size: Number of instances in the dataset
name: Name of dataset
num_of_classes: Number of classes (Categorical only)
You can retrieve the datasets by passing their ID to load_data()
.
Note
The first 8 datasets in the table are edited versions of the MoleculeNet repository 12.
You can print the available sample datasets to console with ChemPlot using the following function:
from chemplot import info_data
df = info_data()
References:
- 1
Gayvert, Kaitlyn M., Neel S. Madhukar, and Olivier Elemento. (2016) A data-driven approach to predicting successes and failures of clinical trials. Cell chemical biology 23.10 1294-1301.
- 2
Artemov, Artem V., et al. (2016) Integrated deep learned transcriptomic and structure-based predictor of clinical trials outcomes. bioRxiv 095653.
- 3
Novick, Paul A., et al. (2013) SWEETLEAD: an in silico database of approved drugs, regulated chemicals, and herbal isolates for computer-aided drug discovery. PloS one 8.11 e79568.
- 4
- 5(1,2)
Subramanian, Govindan, et al. (2016) Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based approaches. Journal of chemical information and modeling 56.10 1936-1949.
- 6
Martins, Ines Filipa, et al. (2014) A Bayesian approach to in silico blood-brain barrier penetration modeling. Journal of chemical information and modeling 52.6 (2012): 1686-1697.
- 7
- 8
Mobley, David L., and J. Peter Guthrie. FreeSolv: a database of experimental and calculated hydration free energies, with input files. Journal of computer-aided molecular design 28.7 711-720.
- 9
Hersey, A. (2015) ChEMBL Deposited Data Set - AZ dataset
- 10
Huuskonen, J. (2000) Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology. Journal of Chemical Information and Computer Sciences, 40(3), 773-777.
- 11
Sorkun, M. C., Khetan, A., & Er, S. (2019) AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds. Scientific data, 6(1), 1-8.
- 12
Wu, Zhenqin, et al. (2018) MoleculeNet: a benchmark for molecular machine learning. Chemical science 9.2 513-530.