ChemPlot Tutorial: From SMILES to Interactive PlotsChemPlot is an open-source Python library designed to make chemical data visualization simple, flexible, and informative. This tutorial walks through a practical workflow: starting with SMILES strings, converting them into molecular representations, computing descriptors/fingerprints, reducing dimensionality, and producing both static and interactive plots for exploratory data analysis, model debugging, and presentation-ready figures.
What you’ll learn
- How to prepare molecular data from SMILES
- Generating fingerprints and descriptors compatible with ChemPlot
- Dimensionality reduction methods commonly used with chemical data
- Creating static and interactive plots with ChemPlot
- Best practices and tips for interpreting plots and avoiding common pitfalls
1. Installation and setup
Install ChemPlot (and common dependencies) using pip:
pip install chemplot rdkit-pypi matplotlib plotly scikit-learn pandas
Note: RDKit must be installed; on some systems you may prefer conda:
conda install -c conda-forge rdkit pip install chemplot plotly
Then import libraries in Python:
import pandas as pd from chemplot import Plotter from rdkit import Chem from rdkit.Chem import AllChem
2. Preparing data from SMILES
Start with a CSV or list of SMILES strings and optional labels (activity, property, cluster ids).
Example CSV structure:
- smiles
- id (optional)
- activity (optional, numeric or categorical)
Load the data:
df = pd.read_csv('molecules.csv') # contains a 'smiles' column df.head()
Validate and sanitize SMILES; convert to RDKit molecules and remove invalid entries:
def sanitize_smiles(smiles): mol = Chem.MolFromSmiles(smiles) if mol: Chem.SanitizeMol(mol) return mol df['rdkit_mol'] = df['smiles'].apply(sanitize_smiles) df = df[df['rdkit_mol'].notnull()].reset_index(drop=True)
3. Generating fingerprints and descriptors
ChemPlot supports fingerprints (e.g., Morgan) and descriptors. Fingerprints are often used for similarity and visualization.
Create Morgan fingerprints:
from rdkit.Chem import AllChem def mol_to_morgan_fp(mol, radius=2, n_bits=2048): arr = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=n_bits) return list(arr) df['morgan_fp'] = df['rdkit_mol'].apply(lambda m: mol_to_morgan_fp(m))
If you want descriptors (physicochemical properties), compute them with RDKit or other libraries and attach as numeric columns.
4. Building a ChemPlot object
ChemPlot works with SMILES directly or with precomputed features. For a SMILES-to-plot workflow, you can pass SMILES and let ChemPlot compute fingerprints internally, or pass your fingerprint matrix.
Option A — pass SMILES directly:
from chemplot import Plotter cp = Plotter.from_smiles(df['smiles'].tolist(), method='similarity', fingerprint='morgan', n_jobs=4)
Option B — pass features (fingerprint matrix):
fp_matrix = list(df['morgan_fp']) cp = Plotter.from_descriptors(fp_matrix, method='similarity')
Key parameters:
- fingerprint: ‘morgan’, ‘maccs’, etc.
- method: ‘similarity’ (preserves distances based on similarity) or ‘dimensionality’ (reduces descriptors)
- n_jobs: parallelism for faster fingerprint calculation
5. Dimensionality reduction techniques
ChemPlot supports several dimensionality reduction algorithms to produce 2D coordinates for plotting:
- PCA — fast, linear
- t-SNE — preserves local structure, good for clusters
- UMAP — faster than t-SNE, preserves both local and global structure
- MDS — classical multidimensional scaling
Choose based on dataset size and the pattern you want to emphasize. Example with UMAP:
coords = cp.reduce_dims(reducer='umap', random_state=42, n_neighbors=15, min_dist=0.1)
For t-SNE (larger datasets may be slow):
coords = cp.reduce_dims(reducer='tsne', perplexity=30, n_iter=1000)
6. Creating static plots (Matplotlib)
To create a basic scatter plot colored by an activity column:
cp.plot_scatter(df['activity'].tolist(), plot_backend='matplotlib', title='Activity Landscape')
For categorical coloring (e.g., class labels):
cp.plot_scatter(df['class_label'].tolist(), plot_backend='matplotlib', palette='Set1')
You can save figures with Matplotlib’s savefig:
import matplotlib.pyplot as plt plt.savefig('chemplot_activity.png', dpi=300, bbox_inches='tight')
7. Creating interactive plots (Plotly)
Interactive plots allow zoom, hover tooltips, and click events. Use Plotly backend:
fig = cp.plot_interactive(df['activity'].tolist(), plot_backend='plotly', title='Interactive Activity Plot') fig.show()
Customize hover info to show SMILES, ID, or property:
hover_data = df[['id', 'smiles', 'activity']].to_dict('records') fig = cp.plot_interactive(df['activity'].tolist(), plot_backend='plotly', hover_data=hover_data)
Export interactive HTML:
fig.write_html('interactive_chemplot.html')
8. Clustering and annotations
Combine clustering with ChemPlot to highlight groups (e.g., KMeans on the embedding or on fingerprints):
from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=5, random_state=0).fit(coords) df['cluster'] = kmeans.labels_ cp.plot_interactive(df['cluster'].tolist(), plot_backend='plotly')
Annotate specific molecules (e.g., outliers or exemplars) by adding markers or using different symbols/colors.
9. Tips for interpretation and common pitfalls
- Fingerprint choice matters: Morgan is general-purpose; MACCS may capture different substructure features.
- Scaling: descriptor matrices often need scaling (StandardScaler) before PCA/UMAP.
- t-SNE stochasticity: set random_state and try multiple perplexities.
- Overplotting: for large datasets use alpha transparency, density maps, or subsampling.
- Chemical relevance: clustering in embedding space suggests similarity in the chosen representation but may not map to biological activity directly—validate with domain knowledge or experiments.
10. Example end-to-end script
import pandas as pd from chemplot import Plotter from rdkit import Chem from rdkit.Chem import AllChem from sklearn.cluster import KMeans df = pd.read_csv('molecules.csv') # must include 'smiles' column df['rdkit_mol'] = df['smiles'].apply(Chem.MolFromSmiles) df = df[df['rdkit_mol'].notnull()].reset_index(drop=True) cp = Plotter.from_smiles(df['smiles'].tolist(), fingerprint='morgan', method='similarity', n_jobs=4) coords = cp.reduce_dims(reducer='umap', random_state=42) kmeans = KMeans(n_clusters=4, random_state=42).fit(coords) df['cluster'] = kmeans.labels_ fig = cp.plot_interactive(df['cluster'].tolist(), plot_backend='plotly', hover_data=df[['smiles']].to_dict('records')) fig.write_html('chemplot_example.html')
11. Advanced topics & resources
- Custom fingerprints or learned embeddings from graph neural networks can be passed as descriptors for richer visualizations.
- Integrate with cheminformatics dashboards (Dash, Streamlit) for interactive exploration.
- Use substructure highlighting in hover/tooltips for SAR analysis.
ChemPlot streamlines the path from SMILES strings to insightful visualizations. With careful choice of representation, dimensionality reduction, and visualization settings, it becomes a powerful tool for exploratory chemical data analysis and communicating molecular relationships.
Leave a Reply