Interpretable Machine Learning as a Key to Understanding BBB Permeability
The blood-brain barrier (BBB) is a vital selective barrier in the central nervous system. Assessing the permeability of compounds across the BBB is crucial for drug development targeting the brain. While clinical experiments are accurate, they are time-consuming and costly. Computational methods offer an alternative for predicting BBB permeability.
1. Downloading the dataset
The dataset used here was curated by Meng et al. (2021) and contains over 7000 compounds with 1613 chemical descriptors calculated using Mordred fingerprints. Download it directly from the repository:
!wget https://github.com/theochem/B3DB/raw/87240af2b4e585d56f9681a6426af6b7f2940e96/B3DB/B3DB_classification_extended.tsv.gz
Decompress the file:
import gzip
import shutil
input_file_path = "B3DB_classification_extended.tsv.gz"
output_file_path = "B3DB_classification_extended.tsv"
with gzip.open(input_file_path, 'rb') as f_in:
with open(output_file_path, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
print(f"File '{input_file_path}' has been successfully extracted to '{output_file_path}'.")
Load the dataset into a DataFrame:
df = pd.read_csv("B3DB_classification_extended.tsv", sep='\t')
df
2. Curating the dataset
Drop columns with missing values:
df = df.dropna(axis=1)
3. Labelling the dataset
Encode the BBB+/BBB- column as a binary label (1 = BBB+, 0 = BBB-):
df['labels'] = df['BBB+/BBB-'].apply(lambda x: 0 if x == 'BBB-' else 1)
4. Selecting chemical descriptors
Select the descriptor columns (columns 6–738) as features:
features = df.iloc[:, 6:738]
Extract the labels:
labels = df["labels"]
5. Building the model
Import the necessary scikit-learn modules:
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
Split the data (70/30 train/test, stratified):
X_train, X_test, y_train, y_test=train_test_split(features, labels, test_size=0.3, random_state=42, shuffle=True, stratify=labels)
Train the Random Forest classifier:
rf = RandomForestClassifier()
rf = rf.fit(X_train, y_train)
6. Evaluating the model’s performance
Generate predictions on the test set:
y_pred = rf.predict(X_test)
Print the classification report:
print(classification_report(y_pred, y_test))
Compute the ROC AUC score:
roc_auc = roc_auc_score(y_test, y_pred)
print("ROC AUC Score:", roc_auc)
7. Calculating the most important features
Retrieve the feature importances from the fitted model:
rf.feature_importances_
Build a ranked DataFrame of feature importances:
xfeatures=pd.DataFrame({"features":features.columns, "Imp_values":rf.feature_importances_})
Sort by descending importance:
xfeatures=xfeatures.sort_values("Imp_values", ascending=False)
xfeatures
8. Interpreting the model using SHAP explainer
Install the SHAP library:
!pip install shap
Compute Shapley values for the test set:
import shap
explain=shap.Explainer(rf)
shapvalues=explain.shap_values(X_test)
Generate a global SHAP summary plot:
shap.summary_plot(shapvalues,X_test)
Plot the SHAP values for the first class specifically:
shap.summary_plot(shapvalues[0], X_test)
Generate a dependence plot for the most important feature, TopoPSA:
shap.dependence_plot("TopoPSA", shapvalues[0], X_test)
9. Bottom line
SHAP analysis identified TopoPSA (topological polar surface area) as the most influential descriptor for BBB permeability prediction, consistent with the established literature linking polar surface area to CNS penetration. Interpretable models like this provide not just predictions, but the feature-level reasoning behind them; useful for guiding structural optimization in CNS drug design.