Similarity Analysis

In this example we will use two molecular datasets: the BBBP (blood-brain barrier penetration) dataset 1, already used in the previous section, and the BACE (β-secretase inhibitors) dataset 2. While the target values of the molecules collected by the BBBP dataset are binary, and therefore discrete, the target values of the molecules collected by the BACE dataset are continuous.

from chemplot import Plotter, load_data

data_BBBP = load_data("BBBP")
data_BACE = load_data("BACE")

In order to plot a subset of the chemical space over a 2D graph it is necessary to define the metric according to which a certain molecule will be plotted on a certain location of the graph. What ChemPlot uses when deciding which molecules need to be plotted where is the concept of “molecular similarity”. Similar molecules will be displayed closer together, while molecules which are less similar will be displayed further apart.

ChemPlot distinguishes between two definitions of molecular similarity: structural and tailored 3.

Structural

Structural similarity is defined as the number and dimensions of “fragments” different molecules share. Molecular fragments are groups of atoms and bonds which a molecule can be divided into. The higher the number and dimensions of fragments two molecules share the more similar they are according to structural similarity. ChemPlot uses Extended-Connectivity Fingerprints (ECFPs) 4 to define which fragments are present in each molecule. To create a Plotter object which visualizes the desired molecules according to structural similarity we need to pass the keyword “structural” as the sim_type parameter when constructing the object.

cp_BBBP = Plotter.from_smiles(data_BBBP["smiles"], target=data_BBBP["target"], target_type="C", sim_type="structural")
cp_BACE = Plotter.from_smiles(data_BACE["smiles"], target=data_BACE["target"], target_type="R", sim_type="structural")
cp_BBBP.tsne()
cp_BBBP.visualize_plot()
../_images/tsne_struct_bbbp.png
cp_BACE.tsne()
cp_BACE.visualize_plot()
../_images/tsne_struct_bace.png

Tailored

Tailored similarity is a similarity metric between molecules which takes into account the target property for determining if two molecules are similar or not. Indeed after a general set of descriptors is calculated for each molecule, a subset of those is selected by optimizing for the target property. Finally depending on the values of the subset ChemPlot can decide which molecules are more similar than others. To create a Plotter object which visualizes the desired molecules according to structural similarity, we need to pass the keyword “tailored” as the sim_type parameter when constructing the object. Since “tailored” is the default value of sim_type if a list of target values is passed in construction, in the following example we could have omitted the last parameter and still have got the same objects.

cp_BBBP = Plotter.from_smiles(data_BBBP["smiles"], target=data_BBBP["target"], target_type="C", sim_type="tailored")
cp_BACE = Plotter.from_smiles(data_BACE["smiles"], target=data_BACE["target"], target_type="R", sim_type="tailored")
cp_BBBP.tsne()
cp_BBBP.visualize_plot()
../_images/gs_tsne.png
cp_BACE.tsne()
cp_BACE.visualize_plot()
../_images/bace_tsne.png

References:

1

Martins, Ines Filipa, et al. (2012). A Bayesian approach to in silico blood-brain barrier penetration modeling. Journal of chemical information and modeling 52.6, 1686-1697

2

Subramanian, Govindan, et al. (2016). Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based approaches. Journal of chemical information and modeling 56.10, 1936-1949.

3

Basak, S.C. and Grunwald, G.D. (1995) Predicting mutagenicity of chemicals using topological and quantum chemical parameters: a similarity based study. Chemosphere 31, 2529–2546

4

Rogers, D., Hahn, M. (2010).** Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5), 742-754.