Graphs in drug discovery have gone from a quiet background tool to one of the main ways we think about molecules, proteins, and their interactions. This post walks through that story: how the field moved from fingerprints and QSAR to today’s 3D, attention-based graph neural networks operating directly on protein-ligand complexes.
Building a QSAR model is only half the job. The harder question is: does it actually work? Overfitted models routinely pass internal checks while failing completely on new compounds. The OECD principles and decades of best-practice literature have converged on a three-tier validation framework that separates what a model has memorised from what it can genuinely predict.
This post summarizes the key ideas from Zhang et al. (2026), “Molecular Knowledge Representations in the Era of Artificial Intelligence,” a preprint published on ChemRxiv (DOI: 10.26434/chemrxiv.15002830/v1). The Core Problem Molecules are quantum-mechanical objects. Their exact description is computationally intractable, and any real sample is a messy mixture of impurities, conformers, and side products. This means every representation of a molecule is, by necessity, an approximation — shaped by the interactions and length scales we care about.
Your docking pose is only as trustworthy as your starting coordinates. Here is a systematic guide to navigating the PDB, avoiding common pitfalls, and future-proofing your workflow for the coming mmCIF era.
A comprehensive walkthrough of cheminformatics, machine learning, molecular docking, ADMET prediction, and molecular dynamics simulations as the modern toolbox for computer-aided drug discovery.
A beginner-friendly walkthrough of PyTorch Geometric's point cloud tutorial — covering the Data object, transforms, dynamic graph construction, PointNet++ message passing, and graph-level classification.
Simon Willison recently appeared on Lenny’s Podcast to discuss what he calls the November inflection point: the moment in late 2025 when frontier models crossed a threshold where agentic coding went from “mostly works if you watch carefully” to “almost always does what you asked.” His highlights post is worth reading in full, but reading it through the lens of computational drug discovery, several themes land with unusual force.
A practical, beginner-friendly introduction to the Deep Graph Library (DGL) and how to use it to featurize protein–ligand complexes for machine learning in drug discovery.
A practical guide to three advanced 3D fingerprinting methods (PLEC, SPLIF, and E3FP) and how to choose between them when featurizing docking poses for ML-based drug discovery models.
Conceptual overview of the key energetic contributions governing protein–ligand binding in molecular docking, including desolvation, entropy, water displacement, electrostatics, and scoring function behavior.