AI + Chemistry: Building Drug Discovery Pipelines with Free Tools

Jul 1, 2025·

Yassir Boulaamane

· 4 min read

A complete drug discovery pipeline can now be built entirely with free, open-source tools. From bioactivity data retrieval to ML model training and 3D visualization, the ecosystem covers every step. To make this concrete, this post walks through a pipeline targeting the ABL kinase (c-Abl tyrosine kinase, the target of imatinib), with working code at each stage.

From Data to Discovery: The Open-Source Pipeline (Case Study: ABL Kinase)

The goal: identify new inhibitors of ABL kinase, a key oncology target (Bcr-Abl drives chronic myeloid leukemia). The pipeline covers:

Gather known bioactivity data for ABL
Featurize and analyze molecules
Train an AI model to predict new inhibitors
Convert and prepare compounds for simulation
Visualize how they bind

All steps use open resources:

Open Data (ChEMBL)

ABL bioactivity data can be retrieved from ChEMBL, an open database of drug-like molecules and their biological activities with millions of measured compound-target data points. ChEMBL’s web services or bulk downloads provide IC50/Ki values for all compounds tested against ABL, which Pandas can then filter and tabulate.

Data Handling (Pandas & NumPy)

With our ABL dataset in hand (e.g. as a CSV of SMILES and activity labels), we use Pandas to clean and manipulate it and NumPy for any numerical computing. These “classics” form the backbone of any custom pipeline, e.g., grouping data, normalizing values, splitting into train/test sets. They might not be drug discovery-specific, but their flexibility is indispensable. We might do:

import pandas as pd
df = pd.read_csv("ABL_bioactivity.csv")

RDKit – The Unsung Hero of Cheminformatics

If you’re doing anything with molecules in Python, chances are RDKit is working behind the scenes. RDKit is an open-source cheminformatics toolkit widely used for tasks like generating molecular fingerprints, performing substructure searches, computing descriptors, and manipulating chemical structures.

In our ABL example, we use RDKit to:

Generate Morgan fingerprints
Perform substructure searches
Compute molecular descriptors

Code Example:

from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem

imatinib_smiles = "Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)NC(=O)c4ccc(cc4)CN5CCN(CC5)C"
nilotinib_smiles = "Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)C(=O)Nc4cc(cc(c4)n5cc(nc5)C)C(F)(F)F"

mol1 = Chem.MolFromSmiles(imatinib_smiles)
mol2 = Chem.MolFromSmiles(nilotinib_smiles)

fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1, radius=2, nBits=2048)
fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2, radius=2, nBits=2048)

sim = DataStructs.TanimotoSimilarity(fp1, fp2)
print(f"Tanimoto similarity: {sim:.2f}")

query = Chem.MolFromSmarts("c1ccc(cc1)Nc2nccc(n2)")
match = mol1.HasSubstructMatch(query)
print("Substructure match:", match)

DeepChem – AI Made Beautifully Simple

DeepChem is an open-source library that brings advanced models like GCNs and multitask networks to your fingertips. It’s built on TensorFlow/PyTorch but hides the complexity.

Code Example:

import deepchem as dc
import numpy as np

smiles_list = [
"Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)NC(=O)c4ccc(cc4)CN5CCN(CC5)C",  # active
"CCOC(=O)c1ccc(cc1)N",  # inactive
]
labels = np.array([1, 0])

featurizer = dc.feat.MolGraphConvFeaturizer()
X = featurizer.featurize(smiles_list)
y = labels

dataset = dc.data.NumpyDataset(X, y)

model = dc.models.GraphConvModel(n_tasks=1, mode='classification', metrics=[dc.metrics.Metric(dc.metrics.roc_auc_score)])
model.fit(dataset, nb_epoch=20)

pred_probs = model.predict(dataset)
print(pred_probs)

Open Babel – Convert Like a Pro

Open Babel helps switch between formats like SMILES, SDF, PDB, etc.

Code Example:

from openbabel import pybel

smiles = "Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)NC(=O)c4ccc(cc4)CN5CCN(CC5)C"
mol = pybel.readstring("smi", smiles)
mol.addh()
mol.make3D()
mol.write("sdf", "imatinib_3D.sdf", overwrite=True)

PyMOL – Visualize What AI Discovers

PyMOL is great for inspecting protein–ligand complexes and generating publication-quality figures.

Code Example:

import pymol2

with pymol2.PyMOL() as pymol:
	cmd = pymol.cmd
	cmd.load("ABL_kinase.pdb", "protein")
	cmd.load("imatinib_3D.sdf", "ligand")
	cmd.hide("everything")
	cmd.show("cartoon", "protein")
	cmd.show("sticks", "ligand")
	cmd.zoom("ligand", 5)
	cmd.png("abl_imatinib.png", width=800, height=600, ray=1)

Chemprop – Graph Neural Networks Made Easy

Chemprop offers fast training of MPNNs for tasks like QSAR and virtual screening.

CLI Example:

chemprop_train --data_path abl_activity.csv --smiles_column smiles --target_columns active \
           --dataset_type classification --save_dir abl_model

Python Example:

from chemprop.train import run_training

params = {
"data_path": "abl_activity.csv",
"smiles_column": "smiles",
"target_columns": ["active"],
"dataset_type": "classification",
"save_dir": "abl_model",
"epochs": 30
}
run_training(params)

The Classics: Pandas, NumPy, Scikit-Learn – Data Science Backbone

These libraries handle everything from preprocessing to baseline models.

Example Random Forest:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X = [list(map(int, fp.ToBitString())) for fp in [fp1, fp2]]
y = labels

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)
model = RandomForestClassifier(n_estimators=100).fit(X_train, y_train)
print("Validation accuracy:", model.score(X_val, y_val))

The Open-Source Revolution in Action

An end-to-end drug discovery pipeline can be built with open-source tools at every step, each covering a distinct role:

ChEMBL and PubChem for bioactivity data
RDKit for cheminformatics and structure manipulation
DeepChem and Chemprop for ML-based activity prediction
Open Babel for format conversion
PyMOL for 3D visualization and pose inspection
Pandas, NumPy, and scikit-learn for data handling and baseline models

All of these are free, actively maintained, and installable with a single command. The community contributions mean new algorithms and best practices are integrated quickly. This makes a competitive computational pipeline accessible to academic labs and startups without proprietary software costs.

Last updated on Jul 1, 2025