Beyond 2D Fingerprints: Encoding Protein-Ligand Interactions for Machine Learning

If you are still feeding two-dimensional fingerprints into your machine learning models, you are likely discarding the most important signal your docking calculations ever produce.

Virtual screening and binding affinity prediction have matured enormously in the last decade, yet a quiet bottleneck persists at the hand-off between docking software and machine learning pipelines: featurization. The docking engine labors to produce rich three-dimensional poses—atomic contacts, pi-stacking geometries, hydrogen bond vectors—and then many workflows collapse all of that information into a flat, topology-only fingerprint before passing it to a classifier or regressor.

This is an understandable compromise. Two-dimensional fingerprints are fast, well-understood, and compatible with virtually every ML framework. But for tasks where the three-dimensional binding geometry is the phenomenon you are trying to predict, that compression is lossy in ways that matter.

Three advanced fingerprinting methods offer a better path. Each is suited to a different modeling context.

PLEC: Protein-Ligand Extended Connectivity

PLEC is an interaction fingerprint inspired by the familiar ECFP circular fingerprint family, but designed specifically for the protein-ligand interface. Rather than describing the ligand or receptor in isolation, it encodes the chemical environments of contacting atom pairs on both sides simultaneously.

The result is a fingerprint rich in contact chemistry: which functional groups on the ligand are adjacent to which residue types, at what distance, and in what local geometry. Your model receives direct information about what is actually touching at the binding site.

Best for: Reactivity and affinity modeling where specific contact chemistry drives the outcome—cases where knowing which chemical groups are touching is more predictive than knowing overall molecular shape.

Computing PLEC with ODDT

import oddt
import oddt.fingerprints as fps
from oddt.toolkits.rdk import readfile

# Load receptor and ligand
receptor = next(readfile('pdb', 'receptor.pdb'))
receptor.protein = True

ligand = next(readfile('sdf', 'docked_pose.sdf'))

# Calculate PLEC fingerprint
# depth_ligand and depth_protein control circular environment depth
plec = fps.PLEC(ligand, receptor, depth_ligand=2, depth_protein=4, size=16384)

The size parameter controls the bit vector length. Larger values reduce collision probability at the cost of memory. For ensemble docking workflows, a size of 16384 is a reasonable starting point.

If your task involves predicting reactivity at a specific catalytic site, appending a hand-crafted distance feature to the PLEC vector is often worth the effort:

import numpy as np
from oddt.interactions import close_contacts

# Example: compute distance from ligand centroid to a metal/active site atom
# Replace with your actual active site coordinate
active_site_coord = np.array([12.3, 45.6, 7.8])
ligand_centroid = ligand.atoms.coords.mean(axis=0)
dist_to_site = np.linalg.norm(ligand_centroid - active_site_coord)

# Append to PLEC vector
plec_extended = np.append(plec, dist_to_site)

SPLIF: Structural Protein-Ligand Interaction Fingerprint

SPLIF focuses on the spatial geometry of interacting fragments rather than cataloguing which atoms touch. Complex non-covalent interactions such as pi-pi stacking and T-shaped aromatic contacts are encoded implicitly through geometric relationships, without relying on hard-coded interaction rules.

This makes SPLIF more robust to the quirks of different docking software and a natural choice when the goal is geometric fidelity to a reference binding mode.

Best for: Structure-based virtual screening where the goal is to find compounds that reproduce the exact three-dimensional binding geometry of a known reference ligand.

Computing SPLIF with ODDT

# SPLIF is also available through ODDT
splif = fps.SPLIF(ligand, receptor, depth_ligand=1, depth_protein=1, size=4096)

SPLIF vectors tend to be sparser than PLEC, reflecting its focus on discrete interaction sites rather than extended chemical environments. When comparing poses against a reference, the Tanimoto similarity between SPLIF vectors is a meaningful geometric similarity score.

from oddt.utils import binary_tanimoto

reference_splif = fps.SPLIF(reference_ligand, receptor)
similarity = binary_tanimoto(splif, reference_splif)
print(f"Geometric similarity to reference: {similarity:.3f}")

E3FP: Extended 3-Dimensional FingerPrint

E3FP adapts the circular fingerprint framework into three-dimensional space. Instead of radiating outward through bonds in a molecular graph, it radiates outward from atoms through Euclidean space, capturing the overall conformational shape and pharmacophore features of a specific molecular pose.

Two compounds with identical 2D structures but different 3D conformations will produce distinct E3FP fingerprints. This makes it sensitive to conformation in a way that ECFP-based fingerprints are not.

Best for: Ligand-based virtual screening and scaffold-hopping—finding molecules with similar three-dimensional shapes and electronic properties even when their two-dimensional structures look entirely different.

Computing E3FP

# Install e3fp: pip install e3fp
from e3fp.pipeline import fprints_from_mol
from rdkit import Chem
from rdkit.Chem import AllChem

# Load mol with a 3D conformer
mol = Chem.MolFromSmiles('your_smiles_here')
mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, AllChem.ETKDGv3())
AllChem.MMFFOptimizeMolecule(mol)

# Generate E3FP fingerprints for all conformers
fprints = fprints_from_mol(mol, fprint_params={'bits': 4096, 'level': 5})

# Convert to dense array for ML
import numpy as np
e3fp_array = np.array(fprints[0].to_vector(sparse=False), dtype=np.float32)

For ensemble docking pipelines where multiple poses are generated per ligand, you can aggregate E3FP fingerprints across conformers before feeding them to an ML model:

# Mean-pool across poses — preserves pharmacophoric signal while reducing noise
fprint_arrays = [np.array(fp.to_vector(sparse=False)) for fp in fprints]
e3fp_consensus = np.mean(fprint_arrays, axis=0)

Choosing the Right Representation

The choice between these methods is not purely a performance question—it is a question about what your model needs to learn.

	PLEC	SPLIF	E3FP
Perspective	Protein + Ligand	Protein + Ligand	Ligand only
Encodes	Contact chemistry	Spatial geometry	Shape + pharmacophore
Handles ensemble noise	Well	Moderate	Requires aggregation
Primary use case	Reactivity / affinity	Geometric screening	Ligand-based VS

Ensemble docking complicates the picture slightly. When poses are sampled across multiple receptor conformations, the fingerprint you choose needs to be robust enough to tolerate conformational noise while still preserving the signal relevant to your prediction task. Interaction fingerprints like PLEC tend to be more tolerant of this variability because the protein environment provides an additional frame of reference that stabilizes the encoding across conformers.

No fingerprint method captures everything. The highest-performing pipelines almost always combine a primary fingerprint with domain-specific hand-crafted features. Before committing to a single representation, it is worth profiling which structural features your model is actually learning to use—the answer often reveals gaps that a targeted descriptor can fill more efficiently than switching fingerprint methods entirely.

Conclusion

Two-dimensional fingerprints will remain useful for fast, ligand-only screening where speed and corpus size matter most. But for tasks where the binding geometry is the phenomenon under study, moving to 3D interaction representations is a prerequisite for building models that are learning the right thing.

PLEC, SPLIF, and E3FP each occupy a distinct niche: contact chemistry, geometric mimicry, and conformational shape respectively. Matching the fingerprint to the modeling task—rather than defaulting to whatever is most convenient—is one of the higher-leverage decisions in building a QSAR pipeline.