AI + Chemistry: Building Drug Discovery Pipelines with Free Tools
A complete drug discovery pipeline can now be built entirely with free, open-source tools. From bioactivity data retrieval to ML model training and 3D visualization, the ecosystem covers every step. To make this concrete, this post walks through a pipeline targeting the ABL kinase (c-Abl tyrosine kinase, the target of imatinib), with working code at each stage.
From Data to Discovery: The Open-Source Pipeline (Case Study: ABL Kinase)
The goal: identify new inhibitors of ABL kinase, a key oncology target (Bcr-Abl drives chronic myeloid leukemia). The pipeline covers:
- Gather known bioactivity data for ABL
- Featurize and analyze molecules
- Train an AI model to predict new inhibitors
- Convert and prepare compounds for simulation
- Visualize how they bind
All steps use open resources:
Open Data (ChEMBL)
ABL bioactivity data can be retrieved from ChEMBL, an open database of drug-like molecules and their biological activities with millions of measured compound-target data points. ChEMBL’s web services or bulk downloads provide IC50/Ki values for all compounds tested against ABL, which Pandas can then filter and tabulate.
Data Handling (Pandas & NumPy)
With our ABL dataset in hand (e.g. as a CSV of SMILES and activity labels), we use Pandas to clean and manipulate it and NumPy for any numerical computing. These “classics” form the backbone of any custom pipeline, e.g., grouping data, normalizing values, splitting into train/test sets. They might not be drug discovery-specific, but their flexibility is indispensable. We might do:
import pandas as pd
df = pd.read_csv("ABL_bioactivity.csv")
RDKit – The Unsung Hero of Cheminformatics
If you’re doing anything with molecules in Python, chances are RDKit is working behind the scenes. RDKit is an open-source cheminformatics toolkit widely used for tasks like generating molecular fingerprints, performing substructure searches, computing descriptors, and manipulating chemical structures.
In our ABL example, we use RDKit to:
- Generate Morgan fingerprints
- Perform substructure searches
- Compute molecular descriptors
Code Example:
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem
imatinib_smiles = "Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)NC(=O)c4ccc(cc4)CN5CCN(CC5)C"
nilotinib_smiles = "Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)C(=O)Nc4cc(cc(c4)n5cc(nc5)C)C(F)(F)F"
mol1 = Chem.MolFromSmiles(imatinib_smiles)
mol2 = Chem.MolFromSmiles(nilotinib_smiles)
fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1, radius=2, nBits=2048)
fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2, radius=2, nBits=2048)
sim = DataStructs.TanimotoSimilarity(fp1, fp2)
print(f"Tanimoto similarity: {sim:.2f}")
query = Chem.MolFromSmarts("c1ccc(cc1)Nc2nccc(n2)")
match = mol1.HasSubstructMatch(query)
print("Substructure match:", match)
DeepChem – AI Made Beautifully Simple
DeepChem is an open-source library that brings advanced models like GCNs and multitask networks to your fingertips. It’s built on TensorFlow/PyTorch but hides the complexity.
Code Example:
import deepchem as dc
import numpy as np
smiles_list = [
"Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)NC(=O)c4ccc(cc4)CN5CCN(CC5)C", # active
"CCOC(=O)c1ccc(cc1)N", # inactive
]
labels = np.array([1, 0])
featurizer = dc.feat.MolGraphConvFeaturizer()
X = featurizer.featurize(smiles_list)
y = labels
dataset = dc.data.NumpyDataset(X, y)
model = dc.models.GraphConvModel(n_tasks=1, mode='classification', metrics=[dc.metrics.Metric(dc.metrics.roc_auc_score)])
model.fit(dataset, nb_epoch=20)
pred_probs = model.predict(dataset)
print(pred_probs)
Open Babel – Convert Like a Pro
Open Babel helps switch between formats like SMILES, SDF, PDB, etc.
Code Example:
from openbabel import pybel
smiles = "Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)NC(=O)c4ccc(cc4)CN5CCN(CC5)C"
mol = pybel.readstring("smi", smiles)
mol.addh()
mol.make3D()
mol.write("sdf", "imatinib_3D.sdf", overwrite=True)
PyMOL – Visualize What AI Discovers
PyMOL is great for inspecting protein–ligand complexes and generating publication-quality figures.
Code Example:
import pymol2
with pymol2.PyMOL() as pymol:
cmd = pymol.cmd
cmd.load("ABL_kinase.pdb", "protein")
cmd.load("imatinib_3D.sdf", "ligand")
cmd.hide("everything")
cmd.show("cartoon", "protein")
cmd.show("sticks", "ligand")
cmd.zoom("ligand", 5)
cmd.png("abl_imatinib.png", width=800, height=600, ray=1)
Chemprop – Graph Neural Networks Made Easy
Chemprop offers fast training of MPNNs for tasks like QSAR and virtual screening.
CLI Example:
chemprop_train --data_path abl_activity.csv --smiles_column smiles --target_columns active \
--dataset_type classification --save_dir abl_model
Python Example:
from chemprop.train import run_training
params = {
"data_path": "abl_activity.csv",
"smiles_column": "smiles",
"target_columns": ["active"],
"dataset_type": "classification",
"save_dir": "abl_model",
"epochs": 30
}
run_training(params)
The Classics: Pandas, NumPy, Scikit-Learn – Data Science Backbone
These libraries handle everything from preprocessing to baseline models.
Example Random Forest:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X = [list(map(int, fp.ToBitString())) for fp in [fp1, fp2]]
y = labels
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)
model = RandomForestClassifier(n_estimators=100).fit(X_train, y_train)
print("Validation accuracy:", model.score(X_val, y_val))
The Open-Source Revolution in Action
An end-to-end drug discovery pipeline can be built with open-source tools at every step, each covering a distinct role:
- ChEMBL and PubChem for bioactivity data
- RDKit for cheminformatics and structure manipulation
- DeepChem and Chemprop for ML-based activity prediction
- Open Babel for format conversion
- PyMOL for 3D visualization and pose inspection
- Pandas, NumPy, and scikit-learn for data handling and baseline models
All of these are free, actively maintained, and installable with a single command. The community contributions mean new algorithms and best practices are integrated quickly. This makes a competitive computational pipeline accessible to academic labs and startups without proprietary software costs.