Supervised vs. Unsupervised Methods in Machine Learning

Oct 16, 2022·

Yassir Boulaamane

· 3 min read

The growth of biomedical datasets in chemistry and life sciences has made machine learning an increasingly practical tool for drug discovery. Models trained on bioactivity and pharmacokinetic data can guide early-stage screening and reduce experimental burden.

Machine learning approaches fall into three broad categories based on the nature of the feedback available to the learning algorithm.

1. Supervised learning

In supervised learning, a model is trained on a labeled dataset and learns to predict outcomes for new, unseen instances.


Figure 1: Example of a chemical dataset viewed with Pandas.

The names up here which are called: molecule_chembl_id and smiles are called attributes. Other names such as standard_value represents the numerical values for each sample, whereas the class name represents a categorical value which can be either 1 (active) or 0 (inactive). The columns are called features which include the data. If we plot this data, and look at a single data point on a plot, it’ll have all of these attributes that would make a row on this chart also referred to as an observation. Looking directly at the value of the data, you can have two kinds. The first is numerical, when dealing with machine learning, the most commonly used data is numeric. The second is categorical, that is its non-numeric because it contains characters rather than numbers. In this case, it’s categorical because this dataset is made for classification.


Figure 2: Supervised learning.

Supervised learning covers two main task types. Classification predicts a discrete class label. Regression predicts a continuous value.

2. Unsupervised learning

In unsupervised learning, the model operates on unlabeled data and identifies structure without prior knowledge of expected outcomes. Because there is no ground truth to guide training, these methods tend to be more complex to evaluate. Common techniques include dimensionality reduction, density estimation, market basket analysis, and clustering. Dimensionality reduction and feature selection help by removing redundant information, making underlying patterns more apparent.


Figure 3: Unsupervised learning tasks. Image by Dmytro Nikolaiev (medium.com/@andimid).

Market basket analysis, on the other hand, is a modeling technique based upon the theory that if you buy a certain group of items, you’re more likely to buy another group of items. Density estimation is a very simple concept that is mostly used to explore the data to find some structure within it. Clustering is one of the most widely used unsupervised techniques, grouping data points by similarity. Applications include customer segmentation, pattern discovery, and anomaly detection.

Bottom line

The core distinction is that supervised learning requires labeled data, while unsupervised learning does not. Supervised methods cover classification and regression; unsupervised methods cover clustering, dimensionality reduction, and density estimation. Unsupervised learning has fewer standardized evaluation metrics, making it harder to assess whether the model output is meaningful.

Last updated on Oct 16, 2022