Dimensionality Reduction

ChemPlot uses different machine learning techniques to reduce the number of dimensions, or features, of each molecule to only two in order to then create 2D graphs. These algorithms are: PCA 1, t-SNE 2 and UMAP 3.

For the following examples we will use two molecular datasets, already mentioned in the previous section: the BBBP (blood-brain barrier penetration) dataset 4 and the BACE (β-secretase inhibitors) dataset 5.

from chemplot import Plotter, load_data

data_BBBP = load_data("BBBP")
data_BACE = load_data("BACE")
cp_BBBP = Plotter.from_smiles(data_BBBP["smiles"], target=data_BBBP["target"], target_type="C")
cp_BACE = Plotter.from_smiles(data_BACE["smiles"], target=data_BACE["target"], target_type="R")

PCA

ChemPlot uses PCA from the scikit-learn package to compute the two principal components of the molecular dataset. PCA allows for time efficient results and for a visualization which gives a global view of the data.

cp_BBBP.pca()
cp_BBBP.visualize_plot()
../_images/gs_pca.png
cp_BACE.pca()
cp_BACE.visualize_plot()
../_images/bace_pca.png

t-SNE

ChemPlot uses t-SNE from the scikit-learn package to reduce to only 2 the number of features of the molecular dataset. t-SNE looks at local neighbourhoods of molecules when it is reducing their dimensions. In this way the local structure of the dataset is better preserved, while the global structure is mostly lost when plotting the results in a 2D graph. However because of the locality preservation that t-SNE offers it is possible to visualize well-defined clusters of similar molecules that exhibit similar properties.

cp_BBBP.tsne()
cp_BBBP.visualize_plot()
../_images/gs_tsne.png
cp_BACE.tsne()
cp_BACE.visualize_plot()
../_images/bace_tsne.png

Two important parameters of the tsne() method are perplexity and pca. The former is a positive integer parameter which defines the size of the neighbourhoods the algorithm will look for when analyzing the dataset. The higher the value of perplexity the wider the analyzed neighbourhoods. The recommended values for perplexity range from 5 to 50. The pca parameter is a Boolean value which indicates if the data has to be preprocessed with PCA. Its value is taken into account when plotting according to structural similarities when each molecule is encoded with a long number of features. Since t-SNE is computationally expensive, preprocessing the data can save substantial amounts of time when generating plots, at the cost of losing some of the molecular structural information.

UMAP

ChemPlot uses UMAP from the umap-learn package to reduce to only 2 the number of features of the molecular dataset. As t-SNE, UMAP looks at local neighbourhoods of molecules when it is reducing their dimensions. While this also results in 2D clusters of locally similar molecules, compared to t-SNE, UMAP retains more of the global structure of the dataset. Compared to t-SNE, furthermore, UMAP is much more computationally efficient and faster.

cp_BBBP.umap()
cp_BBBP.visualize_plot()
../_images/gs_umap.png
cp_BACE.umap()
cp_BACE.visualize_plot()
../_images/bace_umap.png

Two important parameters of the umap() method are n_neighbors, min_dist and pca. The former is a positive integer parameter which constrains the size of the local neighbourhood the algorithm will look for when analyzing the dataset. Low values of n_neighbors will make ChemPlot visualize very local structures. The min_dist parameter is a value which ranges from 0.0 to 0.99. It provides the minimum distance apart that points are allowed to be in the 2D graph. The pca parameter is a Boolean value which indicates if the data has to be preprocessed with PCA.


References:

1

Wold, S., Esbensen, K., Geladi, P. (1987). Principal component analysis. Chemometrics and intelligent laboratory systems. 2(1-3). 37-52.

2

van der Maaten, Laurens, Hinton, Geoffrey. (2008). Viualizingdata using t-SNE. Journal of Machine Learning Research. 9. 2579-2605.

3

McInnes, L., Healy, J., Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXivpreprint arXiv:1802.03426.

4

Martins, Ines Filipa, et al. (2012). A Bayesian approach to in silico blood-brain barrier penetration modeling. Journal of chemical information and modeling 52.6, 1686-1697

5

Subramanian, Govindan, et al. (2016). Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based approaches. Journal of chemical information and modeling 56.10, 1936-1949.