How to build and validate 3D-QSAR models - Insights from Xu et al., 2020

Jan 13, 2024·
Yassir Boulaamane
Yassir Boulaamane
· 3 min read

Dataset Preparation

A high-quality dataset is the cornerstone of any QSAR study.

  1. Collect experimental bioactivity data — typically IC₅₀ values from reliable sources.
  2. Normalize the activity scale — convert IC₅₀ to pIC₅₀:
    $$ \text{pIC}_{50} = -\log_{10}(\text{IC}_{50} \, \text{in molar}) $$
  3. Data partitioning:
    • Training set: ~75–80% of compounds for model development.
    • Test set: ~20–25% of compounds for external validation.

A balanced chemical space between sets is critical to avoid extrapolation during prediction.


Molecular Docking & Alignment

Accurate alignment of molecules is pivotal in 3D-QSAR since descriptor calculation is spatially dependent.

  • Protein structure selection: Obtain the target’s crystallographic structure from the Protein Data Bank (PDB).
  • Docking:
    • Use software such as SYBYL, AutoDock, or other docking engines.
    • Perform redocking to verify reliability — RMSD ≤ 2.0 Å is generally considered acceptable.
  • Alignment strategy: Select the best-scoring pose as the reference for aligning all ligands.

Descriptor Calculation

Two classical methods dominate 3D-QSAR descriptor generation:

Comparative Molecular Field Analysis (CoMFA)

  • Fields: Steric + Electrostatic
  • Grid spacing: 2.0 Å
  • Probe: sp³ carbon atom with +1 charge
  • Energy cutoff: 30 kcal/mol

Comparative Molecular Similarity Indices Analysis (CoMSIA)

  • Fields: Steric, Electrostatic, Hydrophobic, H-bond Donor, H-bond Acceptor
  • Attenuation factor: 0.3 — controls the exponential distance dependence.

Model Building with Partial Least Squares (PLS)

PLS regression is the workhorse for 3D-QSAR, capable of handling collinear and noisy descriptors.

  1. Internal validation:
    • Perform Leave-One-Out Cross-Validation (LOO-CV).
    • Record:
      • : cross-validated \( R^2 \), internal predictive power.
      • ONC: Optimal Number of Components.
  2. Final model construction (using ONC):
    • : Goodness-of-fit to training set.
    • SEE: Standard Error of Estimate.
    • F-statistic: Statistical significance of the regression.

Model Validation

Validation ensures the model is predictive, robust, and not the result of chance correlations.

External Validation

  • Predict pIC₅₀ for the test set.
  • Calculate r²ₚᵣₑd for predictive performance.

Tropsha’s Criteria

A set of diagnostic metrics assessing external predictivity:

  • \( r^2_0 \), \( k \), \( k' \), \( r_m^2 \), \( \Delta r_m^2 \)
  • Good predictive models: \( k \approx 1 \), \( r_m^2 > 0.5 \)

Y-Randomization

  • Randomize biological activities and rebuild the model.
  • A valid QSAR will show low q² and r² for randomized trials.

Contour Map Interpretation

CoMFA and CoMSIA produce contour maps that visually indicate where modifications may enhance or reduce activity.

  • Steric maps: Green = bulk-favorable; Yellow = bulk-unfavorable.
  • Electrostatic maps: Blue = electropositive-favorable; Red = electronegative-favorable.
  • Hydrophobic & H-bond maps: Guide lipophilicity and hydrogen bonding optimization.

These maps serve as structure–activity roadmaps for rational ligand design.


Summary of Key Metrics

Metric Purpose
Internal predictivity (LOO-CV)
Fit to training data
r²ₚᵣₑd External predictive power
SEE Estimate of prediction error
F-statistic Significance of model
rₘ², Δrₘ² Robustness (Tropsha criteria)
RMSD Docking pose validation
Y-randomization Protection against chance correlation

Final Note:
A robust 3D-QSAR workflow is not solely about generating good statistical values — it is about building interpretable models that reliably guide the design of novel, potent ligands.


References

  • Xu, Y., He, Z., Liu, H., Chen, Y., Gao, Y., Zhang, S., … & Yang, M. (2020). 3D-QSAR, molecular docking, and molecular dynamics simulation study of thieno [3, 2-b] pyrrole-5-carboxamide derivatives as LSD1 inhibitors. RSC advances, 10(12), 6927-6943.