Choosing the Right PDB Structure: A Systematic Guide for Docking and MD Simulations

Every computational drug discovery pipeline rests on a deceptively simple foundation: a set of atomic coordinates. Before you run a single docking calculation or spin up a molecular dynamics simulation, you have to make a judgment call about which PDB structure to trust. Get that call wrong and no amount of downstream analysis will save you.

The PDB now hosts over 220,000 experimentally determined structures. That abundance is a gift, but it is also a minefield. This post walks through the decisions that actually matter when selecting a structure, the traps that catch even experienced practitioners, and what the looming format transition means for your workflow.

Why You Cannot Outsource This Decision to an LLM

This one is worth addressing head-on. It is tempting to ask a language model for the “best PDB structure for protein X.” Do not do this, or at least do not act on the answer without verification.

The problem is structural. A good structure selection depends on live, highly specific metrics: resolution (ideally below 2.5 Å), R-free values, Ramachandran outlier percentages, chain completeness in the region you care about, ligand occupancy, and whether surface residues were mutated to force crystallization. Language models do not retrieve or compute these values at inference time. They surface whichever PDB IDs were most frequently cited in their training corpus, which reliably skews toward older, heavily cited structures that may have been superseded by better data deposited last year or last month.

The other failure mode is context blindness. Structure selection is a physical choice that depends on your specific question: Are you studying apo or holo state? Wild-type or a disease-associated variant? A monomeric fragment or the full biological assembly? No general-purpose model can answer these questions without knowing your system in detail. Use LLMs to help parse PDB metadata or draft analysis scripts. Do not let them pick your starting coordinates.

Searching Systematically: Getting Past the Noise

Searching for a protein by common name on RCSB is a reliable way to produce an overwhelming, poorly filtered list. A few habits eliminate most of the noise immediately.

Use UniProt IDs, not protein names. Every protein of interest has a UniProt accession specific to both the protein and the organism you are working with. Searching by accession guarantees you are looking at the exact sequence of interest and filters out homologous structures from other species that may have crystallized more readily but differ in residues you care about.

Filter by your target organism. Many proteins are crystallized from other species because they are more stable in the lab or easier to express. Always confirm the structure matches the organism relevant to your study. A structure solved in a different species is not automatically wrong to use, but that decision needs to be made deliberately, not by accident.

Audit engineered mutations. Crystallographers routinely mutate surface-exposed residues to reduce conformational heterogeneity and encourage lattice contacts. These mutations are disclosed in the PDB entry, but you have to look for them. Always inspect the sequence viewer to confirm the active site and any allosteric regions of interest are wild-type relative to your target.

Check loop completeness. A structure listed as covering your residue range may have gaps at exactly the flexible regions that matter most. Electron density is too weak to model mobile loops, so they are simply omitted. If your docking site includes one of those gaps, you will need to either model it in using loop refinement tools or find a different structure.

A decision flowchart for systematically selecting the right PDB structure. Each step catches a distinct class of error.

X-ray vs. NMR: A Practical Decision Framework

For standard docking and MD setups, X-ray crystallography is the default choice, and the reasoning is structural rather than arbitrary.

Docking algorithms are built around the concept of a rigid receptor. X-ray crystallography delivers exactly that: a single, time-averaged, lowest-energy snapshot with well-defined coordinates for every resolved atom. The data density in a high-resolution crystal structure is high enough that you can have genuine confidence in side-chain rotamers in the binding site, which is exactly what scoring functions depend on.

NMR, by contrast, gives you an ensemble. A typical NMR PDB file contains 20 to 30 models, each a slightly different solution-state conformation. Choosing one model for docking introduces an arbitrary variable that is hard to justify or control. You can get around this with ensemble docking protocols, but that adds substantial workflow complexity and is rarely worth it when a crystal structure is available.

There is one important exception: intrinsically disordered proteins. For targets like alpha-synuclein or tau, X-ray crystallography is not an option in the physiologically relevant monomeric form because these proteins do not adopt a stable fold and cannot form ordered crystal lattices. For IDPs, NMR captures the genuine dynamic reality of the system. Cryo-EM becomes relevant when you are studying fibril aggregates associated with disease, where the ordered repeating structure can be resolved to near-atomic resolution.

The practical rule: use X-ray by default, know when NMR is your only option, and treat the two cases as genuinely different experimental problems rather than interchangeable inputs.

Biological Assembly vs. Asymmetric Unit

This distinction trips up more users than almost any other step in structure preparation, and the consequences can be severe.

The asymmetric unit is the minimal mathematical unit from which the crystal lattice can be generated by symmetry operations. It is a crystallographic convenience, not a biological statement. Depending on the crystal packing, it might contain half of a functional dimer, four monomers wedged together in a non-physiological arrangement, or a single chain that happens to be the correct functional unit. You cannot know in advance without checking.

The biological assembly is the functional form of the protein as it exists under physiological conditions. RCSB calculates this based on crystallographic symmetry, thermodynamic analysis, and author annotations. If your protein is a constitutive dimer and you run MD on a monomer extracted from the asymmetric unit, you are exposing large hydrophobic dimer interfaces to solvent. The simulation will not crash immediately, but it will spend the first hundreds of nanoseconds in an artificial collapse that has nothing to do with the biology you are trying to model.

Left: the asymmetric unit contains only one chain of a functional dimer, leaving a hydrophobic face exposed to solvent. Right: the biological assembly buries the interface correctly.

Always download and inspect the biological assembly first. Verify it matches the oligomeric state documented in the literature for your target. Only deviate from this when you have a specific scientific justification.

The Format Transition You Cannot Ignore

The .pdb file format was designed in the 1970s for punch cards. Its architecture reflects that origin: strict 80-character column widths, a hard limit of 99,999 atoms, and a maximum of 62 chains. For decades these constraints were invisible because most solved structures were small enough to fit comfortably inside them. Cryo-EM changed that.

As cryo-EM enabled the determination of structures for entire viral capsids, ribosomes, and large macromolecular assemblies, the .pdb format simply broke. Structures with hundreds of thousands of atoms and dozens of chains cannot be represented faithfully in the legacy format. The Worldwide Protein Data Bank responded by formally deprecating .pdb in favour of PDBx/mmCIF (.cif).

The mmCIF format abandons fixed column widths in favour of a dictionary-driven, key-value tokenized structure with no limits on atom count or chain number. It encodes data relationships explicitly, which makes it far easier for software to parse complex structural metadata without brittle column-offset logic.

The ID system is also changing. The familiar 4-character alphanumeric IDs are running out of combinations. The new standard uses a 12-character format: the prefix pdb_ followed by 8 lowercase alphanumeric characters, so a legacy ID like 2V5Z becomes pdb_00002v5z. By mid-2027, new depositions will receive only the extended IDs, and .pdb files will no longer be generated for them.

The legacy .pdb format reaches end-of-life for new depositions by mid-2027. The transition to mmCIF and 12-character IDs is already underway.

What this means for your workflow right now:

Stop downloading .pdb files for new projects. Fetch .cif files and build your pipelines around them.
Update PyMOL, ChimeraX, GROMACS, AutoDock Vina, and any other tools in your stack to versions with native mmCIF support. Most major tools already handle .cif correctly in their current releases.
If you are maintaining legacy scripts that parse .pdb files with column offsets, plan their replacement. They will break on new depositions.

The transition is not disruptive if you address it proactively. It becomes a serious problem if you are still writing .pdb-dependent code in 2027.

A Checklist for Structure Selection

To summarise the decision process into something repeatable:

Search by UniProt accession for your target protein and organism, not by protein name.
Confirm the structure matches the organism and sequence context relevant to your study.
Prefer X-ray structures; use NMR only when crystallography is unavailable or inappropriate.
Target resolution below 2.5 Å; inspect R-free and Ramachandran statistics.
Scan for engineered mutations, particularly in and around the binding site.
Verify loop completeness in the regions you intend to dock into.
Download the biological assembly, not the asymmetric unit.
Download in .cif format and confirm your downstream tools can parse it.

Structure selection is one of the few steps in a computational pipeline where careful human judgment consistently outperforms automation. Spending an extra hour at this stage is almost always cheaper than interpreting results generated from a flawed starting point.