A Practical Guide to QSAR Model Validation: Internal, Cross, and External Checks

Jun 1, 2026 · 4 min read

Building a QSAR model is only half the job. The harder question is: does it actually work? Overfitted models routinely pass internal checks while failing completely on new compounds. The OECD principles and decades of best-practice literature have converged on a three-tier validation framework that separates what a model has memorised from what it can genuinely predict.

This post walks through that framework, explains what each parameter actually measures, and gives you the threshold values you need to report in a manuscript or regulatory submission.

Bottom line up front: a valid QSAR model must pass all three tiers simultaneously. Good internal statistics with poor external performance is a red flag, not a trade-off.

The Three-Tier Validation Framework

The table below consolidates the standard parameters across internal, cross-validation, and external validation. Threshold values reflect widely accepted criteria in the literature (Golbraikh & Tropsha, Roy et al., Chirico & Gramatica).

Internal parameter	Threshold	Cross-validation parameter	Threshold	External parameter	Threshold
R²_tr	≥ 0.600	R²_cv (Q²_loo)	> 0.500	RMSE_ex and MAE_ex	Lowest possible
R²_adj.	Close to R²_tr	RMSE_cv	Lowest possible	R²_ex	> 0.600
LOF	Lowest possible	MAE_cv	Lowest possible	Q²-F1, Q²-F2, Q²-F3	> 0.600
CCC_tr	> 0.800	CCC_cv	> 0.800	CCC_ex	> 0.800
RMSE_tr and MAE_tr	Lowest possible	Q²_LMO	> 0.500	R²-ExPy	0.786
ΔK	0.05	R²_Yscr and Q²_Yscr	Lowest possible	R'_o²	0.741
s	Lowest possible			K'	1
F	Highest possible

Key inequality to remember: R² > Q², RMSE_tr < RMSE_cv, a low RMSE_ex and MAE_ex, together with a high R²_ex, are all required simultaneously for a reliable model.

Internal Validation: Necessary but Not Sufficient

Internal statistics describe how well your model fits the training data. R²_tr must be at least 0.600, but the more diagnostic check is the gap between R²_tr and R²_adj.. A large gap signals that some descriptors are contributing noise rather than signal and should be removed.

The Lack-of-Fit (LOF) statistic penalises overfitting more aggressively than R² alone. The concordance correlation coefficient CCC_tr above 0.800 is particularly informative because it jointly measures precision and accuracy, punishing models that are systematically biased even if they show high correlation.

The ΔK criterion of 0.05 is a lesser-known but important check: it compares slopes of regression lines through the origin between observed and predicted values, detecting systematic over- or under-prediction that R² obscures.

Cross-Validation: Testing Generalisability on Training Data

Cross-validation estimates predictive ability without touching the external set. Q²_loo above 0.500 is the minimum bar; many reviewers now expect 0.600 or above for publication. Leave-many-out (Q²_LMO) is more conservative and more reliable for small datasets where single compound removals can give optimistic results.

Y-scrambling is often overlooked but should be reported in every paper. If R²_Yscr and Q²_Yscr are not close to zero, your model is capturing artefacts in the data structure rather than a true structure-activity relationship.

External Validation: The Real Test

Compounds in the external set were never seen during training. This makes external metrics the gold standard. R²_ex above 0.600 and CCC_ex above 0.800 are the headline numbers, but the Golbraikh-Tropsha criteria (R²-ExPy = 0.786 and R’_o² = 0.741) add important geometric checks on whether your regression line passes through the origin appropriately.

The Schüürmann Q²-F1, Q²-F2 and Consonni-Todeschini Q²-F3 metrics each use a different reference model in the denominator. Reporting all three gives reviewers and regulators a more complete picture than any single metric alone.

Practical Checklist Before Submission

Before submitting a QSAR paper, run through each tier. A model that passes internal checks but fails cross-validation almost certainly overfits. A model that passes cross-validation but fails external validation likely has a biased or unrepresentative training set, or the applicability domain has not been properly defined. Only models that clear all three tiers simultaneously warrant regulatory or prospective use.

Finally, always report the full set of metrics rather than cherry-picking the ones that look best. Reviewers familiar with this framework will notice missing statistics, and the absence of Y-scrambling results in particular is a common reason for rejection at journals such as JCIM or JCTC.

References

Golbraikh, A.; Tropsha, A. J. Mol. Graph. Model. 2002, 20, 269-276.
Roy, K.; Kar, S.; Ambure, P. Chemom. Intell. Lab. Syst. 2015, 152, 18-33.
Chirico, N.; Gramatica, P. J. Chem. Inf. Model. 2011, 51, 2320-2335.
Consonni, V.; Ballabio, D.; Todeschini, R. J. Chem. Inf. Model. 2009, 49, 1669-1678.
OECD. Guidance Document on the Validation of (Q)SAR Models. ENV/JM/MONO(2007)2.

Last updated on Jun 1, 2026