A Practical Guide to QSAR Model Validation: Internal, Cross, and External Checks

Jun 1, 2026 · 4 min read

Building a QSAR model is only half the job. The harder question is: does it actually work? Overfitted models routinely pass internal checks while failing completely on new compounds. The OECD principles and decades of best-practice literature have converged on a three-tier validation framework that separates what a model has memorised from what it can genuinely predict.

This post walks through that framework, explains what each parameter actually measures, and gives you the threshold values you need to report in a manuscript or regulatory submission.

Bottom line up front: a valid QSAR model must pass all three tiers simultaneously. Good internal statistics with poor external performance is a red flag, not a trade-off.


The Three-Tier Validation Framework

The table below consolidates the standard parameters across internal, cross-validation, and external validation. Threshold values reflect widely accepted criteria in the literature (Golbraikh & Tropsha, Roy et al., Chirico & Gramatica).

Internal parameter Threshold Cross-validation parameter Threshold External parameter Threshold
tr ≥ 0.600 cv (Q²loo) > 0.500 RMSEex and MAEex Lowest possible
adj. Close to R²tr RMSEcv Lowest possible ex > 0.600
LOF Lowest possible MAEcv Lowest possible Q²-F1, Q²-F2, Q²-F3 > 0.600
CCCtr > 0.800 CCCcv > 0.800 CCCex > 0.800
RMSEtr and MAEtr Lowest possible LMO > 0.500 R²-ExPy 0.786
ΔK 0.05 Yscr and Q²Yscr Lowest possible R'o² 0.741
s Lowest possible K' 1
F Highest possible
Key inequality to remember: R² > Q², RMSEtr < RMSEcv, a low RMSEex and MAEex, together with a high R²ex, are all required simultaneously for a reliable model.

Internal Validation: Necessary but Not Sufficient

Internal statistics describe how well your model fits the training data. R²tr must be at least 0.600, but the more diagnostic check is the gap between R²tr and R²adj.. A large gap signals that some descriptors are contributing noise rather than signal and should be removed.

The Lack-of-Fit (LOF) statistic penalises overfitting more aggressively than R² alone. The concordance correlation coefficient CCCtr above 0.800 is particularly informative because it jointly measures precision and accuracy, punishing models that are systematically biased even if they show high correlation.

The ΔK criterion of 0.05 is a lesser-known but important check: it compares slopes of regression lines through the origin between observed and predicted values, detecting systematic over- or under-prediction that R² obscures.


Cross-Validation: Testing Generalisability on Training Data

Cross-validation estimates predictive ability without touching the external set. Q²loo above 0.500 is the minimum bar; many reviewers now expect 0.600 or above for publication. Leave-many-out (Q²LMO) is more conservative and more reliable for small datasets where single compound removals can give optimistic results.

Y-scrambling is often overlooked but should be reported in every paper. If R²Yscr and Q²Yscr are not close to zero, your model is capturing artefacts in the data structure rather than a true structure-activity relationship.


External Validation: The Real Test

Compounds in the external set were never seen during training. This makes external metrics the gold standard. R²ex above 0.600 and CCCex above 0.800 are the headline numbers, but the Golbraikh-Tropsha criteria (R²-ExPy = 0.786 and R’o² = 0.741) add important geometric checks on whether your regression line passes through the origin appropriately.

The Schüürmann Q²-F1, Q²-F2 and Consonni-Todeschini Q²-F3 metrics each use a different reference model in the denominator. Reporting all three gives reviewers and regulators a more complete picture than any single metric alone.


Practical Checklist Before Submission

Before submitting a QSAR paper, run through each tier. A model that passes internal checks but fails cross-validation almost certainly overfits. A model that passes cross-validation but fails external validation likely has a biased or unrepresentative training set, or the applicability domain has not been properly defined. Only models that clear all three tiers simultaneously warrant regulatory or prospective use.

Finally, always report the full set of metrics rather than cherry-picking the ones that look best. Reviewers familiar with this framework will notice missing statistics, and the absence of Y-scrambling results in particular is a common reason for rejection at journals such as JCIM or JCTC.


References

  • Golbraikh, A.; Tropsha, A. J. Mol. Graph. Model. 2002, 20, 269-276.
  • Roy, K.; Kar, S.; Ambure, P. Chemom. Intell. Lab. Syst. 2015, 152, 18-33.
  • Chirico, N.; Gramatica, P. J. Chem. Inf. Model. 2011, 51, 2320-2335.
  • Consonni, V.; Ballabio, D.; Todeschini, R. J. Chem. Inf. Model. 2009, 49, 1669-1678.
  • OECD. Guidance Document on the Validation of (Q)SAR Models. ENV/JM/MONO(2007)2.