How to Use DataWarrior for Drug Discovery: Key Workflows From the Villoutreix Tutorials

Nov 15, 2025·
Yassir Boulaamane
Yassir Boulaamane
· 4 min read

Why DataWarrior Matters

If you work in drug discovery or cheminformatics, you frequently handle large chemical datasets that require cleaning, filtering, visualization, and export.
DataWarrior is designed exactly for this: an open-source, chemistry-aware data workbench that is free for academic and commercial use.

Key advantages include:

  • Chemistry-aware filtering and visualization
  • 2D/3D plots and interactive exploration
  • Substructure search and chemical intelligence
  • Property calculation, combinatorial libraries, PCA/t-SNE
  • Docking inside a protein pocket
  • Support for SMILES, SDF, CSV, tab-delimited text
  • Cross-platform (Windows, macOS, Linux)

Bruno Villoutreix’s tutorial series provides an excellent guided introduction. This post summarizes Part 1 and connects it to practical drug-discovery workflows.


Core Capabilities Covered in Part 1

A Versatile Workbench

DataWarrior is more than a viewer — it is a chemometrics and data-analysis environment with dynamic plots, smart filtering, and chemical intelligence.

Key Features

  • Real-time text, numeric, and chemical filtering
  • 2D / 3D visualization modules
  • Molecular property prediction
  • Combinatorial library enumeration
  • PCA and t-SNE
  • Docking and non-chemical data support

Data Import & Formats

DataWarrior handles common cheminformatics formats:

  • SMILES, SDF (v2/v3)
  • CSV, tab-delimited text
  • Native DWIR project files

This makes it ideal as a front-end before passing data to docking, QSAR, MD, or ML workflows.


Interface Overview

The interface is flexible and powerful:

  • Many options are accessible via right-click menus
  • Layouts can be customized with sub-windows
  • Some tools are hidden behind pop-ups
  • Learning curve exists, but productivity increases quickly

Think of it as a chemistry-smart Excel designed specifically for cheminformatics.


External Data: Wikipedia Molecule Import

Part 1 demonstrates an impressive feature:

  • DataWarrior can download all chemical structures from Wikipedia (~22,000+ molecules depending on version).
  • You can filter, visualize, and explore the full set offline.

This dataset is perfect for practicing workflows before loading your own project data.


Practical Filtering Workflow (Name + Substructure)

Villoutreix demonstrates a realistic medicinal-chemistry filtering pipeline:

1. Name-Based Filtering

Use regular expressions to identify drug classes by name suffix:
Examples:

  • .*ib (kinase inhibitors)
  • .*sartan (angiotensin receptor blockers)
  • .*azole (antifungals)

Filtering Wikipedia’s 22k molecules yields a much smaller, focused set.

2. Substructure Filtering

Draw a functional group (e.g., piperazine/piperidine) in the structure filter window.
DataWarrior keeps molecules containing the substructure and matching the name rule.
Matching fragments appear highlighted in red.

3. Create a Row List & Export

  • Create a custom list (e.g., cancer).
  • Add SMILES code for each molecule.
  • Remove unnecessary columns (formula, MW, structure).
  • Save as a tab-delimited text file (cancer.txt).

Repeat the workflow for other suffixes, such as “-azole”, to build an antifungal set.


Integrating DataWarrior Into Your Workflow

1. Import Your Data

Load SDF/SMILES and merge with ADMET, assay, or physicochemical tables.

2. Smart Filtering

Use both:

  • Name patterns (series, suffixes, pharmacological classes)
  • Substructure filters (scaffolds, R-groups, functional motifs)

3. Visualization

Quickly inspect:

  • Activity distributions
  • Scatter plots
  • PCA/t-SNE chemical space
  • Outliers and series clusters

4. Export for Modelling

Save clean SMILES + metadata for:

  • QSAR / ML
  • Docking
  • MD setup
  • Library design

DataWarrior becomes the interactive triage layer before computational modeling.


Why Villoutreix’s Series Is Worth Watching

  • Short and dense: ~85 minutes across 9 videos
  • Realistic medicinal chemistry examples
  • Shows where to click and how features connect
  • Useful even for intermediate users — many hidden features become obvious after watching

Getting Started Today

You can reproduce Part 1 in minutes:

  1. Install DataWarrior from the official website.
  2. Import the Wikipedia dataset directly from within the software.
  3. Apply regular expression filtering (e.g., .*ib).
  4. Add a substructure filter.
  5. Create a row list → add SMILES → export your filtered set.
  6. Repeat with another suffix (e.g., .*azole).

This prepares you to bring in your own datasets and use DataWarrior as your daily workbench.


Final Thoughts

DataWarrior is a powerful, free, and chemistry-aware tool that helps you:

  • Explore large datasets interactively
  • Build focused libraries rapidly
  • Clean data before modeling
  • Visualize chemical space
  • Export ready-to-use hit lists

Paired with the Villoutreix tutorial series, it becomes a practical, everyday companion for drug-discovery and cheminformatics work.