Beyond SMILES: The Evolving Landscape of Molecular Representations
This post summarizes the key ideas from Zhang et al. (2026), “Molecular Knowledge Representations in the Era of Artificial Intelligence,” a preprint published on ChemRxiv (DOI: 10.26434/chemrxiv.15002830/v1).
The Core Problem
Molecules are quantum-mechanical objects. Their exact description is computationally intractable, and any real sample is a messy mixture of impurities, conformers, and side products. This means every representation of a molecule is, by necessity, an approximation — shaped by the interactions and length scales we care about.
The central challenge in modern chemical AI is therefore not just encoding molecular structure, but doing so in ways that are simultaneously useful to machines and interpretable by humans.
Zhang et al. organize the landscape of molecular representations through the lens of topology, distinguishing three families: discrete, continuous, and hybrid.
The Three Families of Representation
Discrete Representations
Discrete representations encode molecules as countable symbolic units. They are human-readable and traceable, making them foundational to both classical cheminformatics and rule-based AI.
Molecular graphs represent atoms as nodes and bonds as edges. File formats like Molfile and tools like RDKit operate in this paradigm. Graph Neural Networks (GNNs) were built to process exactly this structure, preserving symmetries like atom-index invariance that string representations struggle with.
SMILES strings are the dominant text encoding — a depth-first traversal of the molecular graph serialized into a linear sequence. Extensions like SELFIES guarantee that every string decodes to a valid molecule, making them popular for generative models. SMARTS and SMIRKS extend the grammar to encode reaction transformations and substructure queries.
Expert systems like DENDRAL (1960s) and LHASA were the first to formalize chemical reasoning as explicit rule sets. Modern descendants like Chematica (now SYNTHIA) combine reaction templates with heuristics and cost functions for retrosynthesis planning. Knowledge graphs (e.g., OntoSpecies, ChEBI) take this further, linking entities via formal ontologies that support automated inference.
Continuous Representations
Continuous representations express molecular information as real-valued vectors or functions, enabling gradient-based optimization and learning.
Spatial representations — Cartesian coordinates, Coulomb matrices, atomic symmetry functions — directly encode 3D geometry. They are essential for molecular dynamics, energy prediction, and conformer generation.
Neural network embeddings are learned continuous representations. The key architectural choices follow the input modality: Transformers for strings, GNNs for graphs, equivariant networks for 3D coordinates. Equivariant GNNs like NequIP and MACE are particularly powerful because they build in physical symmetries (rotational, translational, permutational) by design.
Self-supervised learning (SSL) addresses the chronic shortage of labelled chemical data. Models like ChemBERTa pre-train on millions of unlabelled SMILES using masked-token objectives, then fine-tune for downstream tasks. Contrastive learning adds another axis: pulling together representations of the same molecule expressed differently, and pushing apart distinct molecules.
Generative models close the loop — rather than encoding known molecules, they sample novel structures from learned distributions. VAEs, diffusion models, and autoregressive language models have all been applied here, with 3D equivariant diffusion (e.g., EDM) pushing the frontier for structure-based drug design.
Hybrid Representations: The LLM Turn
LLMs like GPT-4, Gemini, Claude, and DeepSeek exemplify the hybrid paradigm: they convert discrete tokens into continuous embeddings, reason in that latent space, then decode back to discrete text. This discrete→continuous→discrete pipeline lets a single model operate over SMILES strings, natural language, and executable code simultaneously.
Key capabilities in chemistry:
- Reaction prediction and retrosynthesis planning
- Molecule captioning and property prediction
- Translating natural language into executable lab protocols (e.g., XDL format)
- Text-guided inverse molecular design
Retrieval-Augmented Generation (RAG) extends LLMs beyond their training cutoff by equipping them with queryable vector databases of chemical literature, reactions, and properties — yielding measurable gains on tasks like reaction condition prediction.
LLM-based agents go further still, using tool calls (RDKit, AutoDock, quantum chemistry codes) to act on the physical and digital world. Systems like ChemCrow (integrated with 18 expert tools) and CoScientist have demonstrated autonomous synthesis planning and execution. The Model Context Protocol (MCP) is emerging as a standardized interface connecting agents to domain-specific tools.
The Big Picture
The authors argue that representations exist on a spectrum between human interpretability and machine utility. No single representation dominates across all tasks:
- Discrete representations excel at transparency and rule-based reasoning.
- Continuous representations enable gradient-based learning and capture physical reality faithfully.
- Hybrid LLM systems bridge both, acting as universal connectors across representation types.
The frontier, they suggest, is co-designing representation ecosystems where humans and AI agents collaborate fluidly — moving from representations for machines or for humans, toward representations with both.
Citation
Zhang, Z., Bai, J., Nakamura, Y., Wang, A., Leong, S. X., Zhang, S., Chen, P., Lo, A., Müller, M., Tom, G., Huang, M., Mantilla, L., Kang, Y., Bernales, V., & Aspuru-Guzik, A. (2026). Molecular Knowledge Representations in the Era of Artificial Intelligence. ChemRxiv. https://doi.org/10.26434/chemrxiv.15002830/v1