How to build and validate 3D-QSAR models - Insights from Xu et al., 2020

Jan 13, 2024·

Yassir Boulaamane

· 3 min read

Dataset Preparation

A high-quality dataset is the cornerstone of any QSAR study.

Collect experimental bioactivity data — typically IC₅₀ values from reliable sources.
Normalize the activity scale — convert IC₅₀ to pIC₅₀:
$$ \text{pIC}_{50} = -\log_{10}(\text{IC}_{50} \, \text{in molar}) $$
Data partitioning:
- Training set: ~75–80% of compounds for model development.
- Test set: ~20–25% of compounds for external validation.

A balanced chemical space between sets is critical to avoid extrapolation during prediction.

Molecular Docking & Alignment

Accurate alignment of molecules is pivotal in 3D-QSAR since descriptor calculation is spatially dependent.

Protein structure selection: Obtain the target’s crystallographic structure from the Protein Data Bank (PDB).
Docking:
- Use software such as SYBYL, AutoDock, or other docking engines.
- Perform redocking to verify reliability — RMSD ≤ 2.0 Å is generally considered acceptable.
Alignment strategy: Select the best-scoring pose as the reference for aligning all ligands.

Descriptor Calculation

Two classical methods dominate 3D-QSAR descriptor generation:

Comparative Molecular Field Analysis (CoMFA)

Fields: Steric + Electrostatic
Grid spacing: 2.0 Å
Probe: sp³ carbon atom with +1 charge
Energy cutoff: 30 kcal/mol

Comparative Molecular Similarity Indices Analysis (CoMSIA)

Fields: Steric, Electrostatic, Hydrophobic, H-bond Donor, H-bond Acceptor
Attenuation factor: 0.3 — controls the exponential distance dependence.

Model Building with Partial Least Squares (PLS)

PLS regression is the workhorse for 3D-QSAR, capable of handling collinear and noisy descriptors.

Internal validation:
- Perform Leave-One-Out Cross-Validation (LOO-CV).
- Record:
  - q²: cross-validated $ R^2 $, internal predictive power.
  - ONC: Optimal Number of Components.
Final model construction (using ONC):
- r²: Goodness-of-fit to training set.
- SEE: Standard Error of Estimate.
- F-statistic: Statistical significance of the regression.

Model Validation

Validation ensures the model is predictive, robust, and not the result of chance correlations.

External Validation

Predict pIC₅₀ for the test set.
Calculate r²ₚᵣₑd for predictive performance.

Tropsha’s Criteria

A set of diagnostic metrics assessing external predictivity:

$ r^2_0 $, $ k $, $ k' $, $ r_m^2 $, $ \Delta r_m^2 $
Good predictive models: $ k \approx 1 $, $ r_m^2 > 0.5 $

Y-Randomization

Randomize biological activities and rebuild the model.
A valid QSAR will show low q² and r² for randomized trials.

Contour Map Interpretation

CoMFA and CoMSIA produce contour maps that visually indicate where modifications may enhance or reduce activity.

Steric maps: Green = bulk-favorable; Yellow = bulk-unfavorable.
Electrostatic maps: Blue = electropositive-favorable; Red = electronegative-favorable.
Hydrophobic & H-bond maps: Guide lipophilicity and hydrogen bonding optimization.

These maps serve as structure–activity roadmaps for rational ligand design.

Summary of Key Metrics

Metric	Purpose
q²	Internal predictivity (LOO-CV)
r²	Fit to training data
r²ₚᵣₑd	External predictive power
SEE	Estimate of prediction error
F-statistic	Significance of model
rₘ², Δrₘ²	Robustness (Tropsha criteria)
RMSD	Docking pose validation
Y-randomization	Protection against chance correlation

Final Note:
A robust 3D-QSAR workflow is not solely about generating good statistical values — it is about building interpretable models that reliably guide the design of novel, potent ligands.

References

Xu, Y., He, Z., Liu, H., Chen, Y., Gao, Y., Zhang, S., … & Yang, M. (2020). 3D-QSAR, molecular docking, and molecular dynamics simulation study of thieno [3, 2-b] pyrrole-5-carboxamide derivatives as LSD1 inhibitors. RSC advances, 10(12), 6927-6943.

Last updated on Jan 13, 2024