ChemPlot: A Beginner’s Guide to Visualizing Chemical Data

ChemPlot Tutorial: From SMILES to Interactive PlotsChemPlot is an open-source Python library designed to make chemical data visualization simple, flexible, and informative. This tutorial walks through a practical workflow: starting with SMILES strings, converting them into molecular representations, computing descriptors/fingerprints, reducing dimensionality, and producing both static and interactive plots for exploratory data analysis, model debugging, and presentation-ready figures.


What you’ll learn

  • How to prepare molecular data from SMILES
  • Generating fingerprints and descriptors compatible with ChemPlot
  • Dimensionality reduction methods commonly used with chemical data
  • Creating static and interactive plots with ChemPlot
  • Best practices and tips for interpreting plots and avoiding common pitfalls

1. Installation and setup

Install ChemPlot (and common dependencies) using pip:

pip install chemplot rdkit-pypi matplotlib plotly scikit-learn pandas 

Note: RDKit must be installed; on some systems you may prefer conda:

conda install -c conda-forge rdkit pip install chemplot plotly 

Then import libraries in Python:

import pandas as pd from chemplot import Plotter from rdkit import Chem from rdkit.Chem import AllChem 

2. Preparing data from SMILES

Start with a CSV or list of SMILES strings and optional labels (activity, property, cluster ids).

Example CSV structure:

  • smiles
  • id (optional)
  • activity (optional, numeric or categorical)

Load the data:

df = pd.read_csv('molecules.csv')  # contains a 'smiles' column df.head() 

Validate and sanitize SMILES; convert to RDKit molecules and remove invalid entries:

def sanitize_smiles(smiles):     mol = Chem.MolFromSmiles(smiles)     if mol:         Chem.SanitizeMol(mol)     return mol df['rdkit_mol'] = df['smiles'].apply(sanitize_smiles) df = df[df['rdkit_mol'].notnull()].reset_index(drop=True) 

3. Generating fingerprints and descriptors

ChemPlot supports fingerprints (e.g., Morgan) and descriptors. Fingerprints are often used for similarity and visualization.

Create Morgan fingerprints:

from rdkit.Chem import AllChem def mol_to_morgan_fp(mol, radius=2, n_bits=2048):     arr = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=n_bits)     return list(arr) df['morgan_fp'] = df['rdkit_mol'].apply(lambda m: mol_to_morgan_fp(m)) 

If you want descriptors (physicochemical properties), compute them with RDKit or other libraries and attach as numeric columns.


4. Building a ChemPlot object

ChemPlot works with SMILES directly or with precomputed features. For a SMILES-to-plot workflow, you can pass SMILES and let ChemPlot compute fingerprints internally, or pass your fingerprint matrix.

Option A — pass SMILES directly:

from chemplot import Plotter cp = Plotter.from_smiles(df['smiles'].tolist(), method='similarity', fingerprint='morgan', n_jobs=4) 

Option B — pass features (fingerprint matrix):

fp_matrix = list(df['morgan_fp']) cp = Plotter.from_descriptors(fp_matrix, method='similarity') 

Key parameters:

  • fingerprint: ‘morgan’, ‘maccs’, etc.
  • method: ‘similarity’ (preserves distances based on similarity) or ‘dimensionality’ (reduces descriptors)
  • n_jobs: parallelism for faster fingerprint calculation

5. Dimensionality reduction techniques

ChemPlot supports several dimensionality reduction algorithms to produce 2D coordinates for plotting:

  • PCA — fast, linear
  • t-SNE — preserves local structure, good for clusters
  • UMAP — faster than t-SNE, preserves both local and global structure
  • MDS — classical multidimensional scaling

Choose based on dataset size and the pattern you want to emphasize. Example with UMAP:

coords = cp.reduce_dims(reducer='umap', random_state=42, n_neighbors=15, min_dist=0.1) 

For t-SNE (larger datasets may be slow):

coords = cp.reduce_dims(reducer='tsne', perplexity=30, n_iter=1000) 

6. Creating static plots (Matplotlib)

To create a basic scatter plot colored by an activity column:

cp.plot_scatter(df['activity'].tolist(), plot_backend='matplotlib', title='Activity Landscape') 

For categorical coloring (e.g., class labels):

cp.plot_scatter(df['class_label'].tolist(), plot_backend='matplotlib', palette='Set1') 

You can save figures with Matplotlib’s savefig:

import matplotlib.pyplot as plt plt.savefig('chemplot_activity.png', dpi=300, bbox_inches='tight') 

7. Creating interactive plots (Plotly)

Interactive plots allow zoom, hover tooltips, and click events. Use Plotly backend:

fig = cp.plot_interactive(df['activity'].tolist(), plot_backend='plotly', title='Interactive Activity Plot') fig.show() 

Customize hover info to show SMILES, ID, or property:

hover_data = df[['id', 'smiles', 'activity']].to_dict('records') fig = cp.plot_interactive(df['activity'].tolist(), plot_backend='plotly', hover_data=hover_data) 

Export interactive HTML:

fig.write_html('interactive_chemplot.html') 

8. Clustering and annotations

Combine clustering with ChemPlot to highlight groups (e.g., KMeans on the embedding or on fingerprints):

from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=5, random_state=0).fit(coords) df['cluster'] = kmeans.labels_ cp.plot_interactive(df['cluster'].tolist(), plot_backend='plotly') 

Annotate specific molecules (e.g., outliers or exemplars) by adding markers or using different symbols/colors.


9. Tips for interpretation and common pitfalls

  • Fingerprint choice matters: Morgan is general-purpose; MACCS may capture different substructure features.
  • Scaling: descriptor matrices often need scaling (StandardScaler) before PCA/UMAP.
  • t-SNE stochasticity: set random_state and try multiple perplexities.
  • Overplotting: for large datasets use alpha transparency, density maps, or subsampling.
  • Chemical relevance: clustering in embedding space suggests similarity in the chosen representation but may not map to biological activity directly—validate with domain knowledge or experiments.

10. Example end-to-end script

import pandas as pd from chemplot import Plotter from rdkit import Chem from rdkit.Chem import AllChem from sklearn.cluster import KMeans df = pd.read_csv('molecules.csv')  # must include 'smiles' column df['rdkit_mol'] = df['smiles'].apply(Chem.MolFromSmiles) df = df[df['rdkit_mol'].notnull()].reset_index(drop=True) cp = Plotter.from_smiles(df['smiles'].tolist(), fingerprint='morgan', method='similarity', n_jobs=4) coords = cp.reduce_dims(reducer='umap', random_state=42) kmeans = KMeans(n_clusters=4, random_state=42).fit(coords) df['cluster'] = kmeans.labels_ fig = cp.plot_interactive(df['cluster'].tolist(), plot_backend='plotly', hover_data=df[['smiles']].to_dict('records')) fig.write_html('chemplot_example.html') 

11. Advanced topics & resources

  • Custom fingerprints or learned embeddings from graph neural networks can be passed as descriptors for richer visualizations.
  • Integrate with cheminformatics dashboards (Dash, Streamlit) for interactive exploration.
  • Use substructure highlighting in hover/tooltips for SAR analysis.

ChemPlot streamlines the path from SMILES strings to insightful visualizations. With careful choice of representation, dimensionality reduction, and visualization settings, it becomes a powerful tool for exploratory chemical data analysis and communicating molecular relationships.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *