Interpretable Machine Learning as a Key to Understanding BBB Permeability

Jan 23, 2024·

Yassir Boulaamane

· 3 min read

The blood-brain barrier (BBB) is a vital selective barrier in the central nervous system. Assessing the permeability of compounds across the BBB is crucial for drug development targeting the brain. While clinical experiments are accurate, they are time-consuming and costly. Computational methods offer an alternative for predicting BBB permeability.

1. Downloading the dataset

The dataset used here was curated by Meng et al. (2021) and contains over 7000 compounds with 1613 chemical descriptors calculated using Mordred fingerprints. Download it directly from the repository:

!wget https://github.com/theochem/B3DB/raw/87240af2b4e585d56f9681a6426af6b7f2940e96/B3DB/B3DB_classification_extended.tsv.gz

Decompress the file:

import gzip
import shutil

input_file_path = "B3DB_classification_extended.tsv.gz"
output_file_path = "B3DB_classification_extended.tsv"

with gzip.open(input_file_path, 'rb') as f_in:
	with open(output_file_path, 'wb') as f_out:
		shutil.copyfileobj(f_in, f_out)

print(f"File '{input_file_path}' has been successfully extracted to '{output_file_path}'.")

Load the dataset into a DataFrame:

df = pd.read_csv("B3DB_classification_extended.tsv", sep='\t')
df

2. Curating the dataset

Drop columns with missing values:

df = df.dropna(axis=1)

3. Labelling the dataset

Encode the BBB+/BBB- column as a binary label (1 = BBB+, 0 = BBB-):

df['labels'] = df['BBB+/BBB-'].apply(lambda x: 0 if x == 'BBB-' else 1)

4. Selecting chemical descriptors

Select the descriptor columns (columns 6–738) as features:

features = df.iloc[:, 6:738]

Extract the labels:

labels = df["labels"]

5. Building the model

Import the necessary scikit-learn modules:

from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report

Split the data (70/30 train/test, stratified):

X_train, X_test, y_train, y_test=train_test_split(features, labels, test_size=0.3, random_state=42, shuffle=True, stratify=labels)

Train the Random Forest classifier:

rf = RandomForestClassifier()
rf = rf.fit(X_train, y_train)

6. Evaluating the model’s performance

Generate predictions on the test set:

y_pred = rf.predict(X_test)

Print the classification report:

print(classification_report(y_pred, y_test))

Compute the ROC AUC score:

roc_auc = roc_auc_score(y_test, y_pred)
print("ROC AUC Score:", roc_auc)

7. Calculating the most important features

Retrieve the feature importances from the fitted model:

rf.feature_importances_

Build a ranked DataFrame of feature importances:

xfeatures=pd.DataFrame({"features":features.columns, "Imp_values":rf.feature_importances_})

Sort by descending importance:

xfeatures=xfeatures.sort_values("Imp_values", ascending=False)
xfeatures

8. Interpreting the model using SHAP explainer

Install the SHAP library:

!pip install shap

Compute Shapley values for the test set:

import shap
explain=shap.Explainer(rf)
shapvalues=explain.shap_values(X_test)

Generate a global SHAP summary plot:

shap.summary_plot(shapvalues,X_test)

Plot the SHAP values for the first class specifically:

shap.summary_plot(shapvalues[0], X_test)

Generate a dependence plot for the most important feature, TopoPSA:

shap.dependence_plot("TopoPSA", shapvalues[0], X_test)

9. Bottom line

SHAP analysis identified TopoPSA (topological polar surface area) as the most influential descriptor for BBB permeability prediction, consistent with the established literature linking polar surface area to CNS penetration. Interpretable models like this provide not just predictions, but the feature-level reasoning behind them; useful for guiding structural optimization in CNS drug design.

Last updated on Jan 23, 2024