Sample datasets =============== ChemPlot provides some sample datasets that can be used to get started right away with exploring the libraries features. These datasets can be loaded with the following function: .. code:: python3 from chemplot import load_data df = load_data("BBBP") In these case we are loading the BBBP dataset, used in the previous sections of this manual. ``load_data()`` returns a pandas DataFrame built using the sample dataset provided as a parameter. Chemplot contains the following sample datasets: .. list-table:: :header-rows: 1 * - ID - Name - Type - Size * - C_1478_CLINTOX_2 - Clintox (Toxicity) [1]_ [2]_ [3]_ [4]_ - Classification - 1478 * - C_1513_BACE_2 - BACE (Inhibitor) [5]_ - Classification - 1513 * - C_2039_BBBP_2 - BBBP (Blood-brain barrier penetration) [6]_ - Classification - 2039 * - C_41127_HIV_3 - HIV [7]_ - Classification - 41127 * - R_642_SAMPL - SAMPL (Hydration free energy) [8]_ - Regression - 642 * - R_1513_BACE - BACE (Binding affinity) [5]_ - Regression - 1513 * - R_4200_LOGP - LOGP (Lipophilicity) [9]_ - Regression - 4200 * - R_1291_LOGS - LOGS (Aqueous Solubility) [10]_ - Regression - 1291 * - R_9982_AQSOLDB - AQSOLDB (Aqueous Solubility) [11]_ - Regression - 9982 The datasets ID are constructed in the following way: **Name Formatting:** type_size_name_num_of_classes.csv - **type:** R->Numerical and C->Categorical - **size:** Number of instances in the dataset - **name:** Name of dataset - **num_of_classes:** Number of classes (Categorical only) You can retrieve the datasets by passing their ID to ``load_data()``. .. note:: The first 8 datasets in the table are edited versions of the MoleculeNet repository [12]_. You can print the available sample datasets to console with ChemPlot using the following function: .. code:: python3 from chemplot import info_data df = info_data() -------------- .. raw:: html

References: .. raw:: html

.. [1] **Gayvert, Kaitlyn M., Neel S. Madhukar, and Olivier Elemento.** (2016) `A data-driven approach to predicting successes and failures of clinical trials.` Cell chemical biology 23.10 1294-1301. .. [2] **Artemov, Artem V., et al.** (2016) `Integrated deep learned transcriptomic and structure-based predictor of clinical trials outcomes.` bioRxiv 095653. .. [3] **Novick, Paul A., et al.** (2013) `SWEETLEAD: an in silico database of approved drugs, regulated chemicals, and herbal isolates for computer-aided drug discovery.` PloS one 8.11 e79568. .. [4] `Aggregate Analysis of ClincalTrials.gov (AACT) Database. `_ .. [5] **Subramanian, Govindan, et al.** (2016) `Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based approaches.` Journal of chemical information and modeling 56.10 1936-1949. .. [6] **Martins, Ines Filipa, et al.** (2014) `A Bayesian approach to in silico blood-brain barrier penetration modeling.` Journal of chemical information and modeling 52.6 (2012): 1686-1697. .. [7] `AIDS Antiviral Screen Data. `_ .. [8] **Mobley, David L., and J. Peter Guthrie.** `FreeSolv: a database of experimental and calculated hydration free energies, with input files.` Journal of computer-aided molecular design 28.7 711-720. .. [9] **Hersey, A.** (2015) `ChEMBL Deposited Data Set - AZ dataset `_ .. [10] **Huuskonen, J.** (2000) `Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology.` Journal of Chemical Information and Computer Sciences, 40(3), 773-777. .. [11] **Sorkun, M. C., Khetan, A., & Er, S.** (2019) `AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds.` Scientific data, 6(1), 1-8. .. [12] **Wu, Zhenqin, et al.** (2018) `MoleculeNet: a benchmark for molecular machine learning.` Chemical science 9.2 513-530.