Understanding Dataset-level Labeling Issues#

This 5-minute quickstart tutorial shows how cleanlab.dataset.health_summary() helps you automatically:

  • Score and rank the overall label quality of each class, useful for deciding whether to remove or keep certain classes.

  • Identify overlapping classes that you can merge to make the learning task less ambiguous. Alternatively use this information to refine your annotator instructions (e.g. more precisely defining the difference between two classes).

  • Generate an overall dataset and label quality health score to track improvements in your labels over time as you clean your datasets.

This tutorial does not study issues in individual data points, but rather global issues across the dataset. Much of the functionality demonstrated here can also be accessed via Datalab.get_info() when using Datalab to detect label issues.


Already have (out-of-sample) pred_probs from a model trained on your dataset? Run the code below to evaluate the overall health of your dataset and its labels.

from cleanlab.dataset import health_summary

health_summary(labels, pred_probs)

Install dependencies and import them#

You can use pip to install all packages required for this tutorial as follows:

!pip install requests
!pip install cleanlab
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+https://github.com/cleanlab/cleanlab.git
import requests
import io
import cleanlab
import numpy as np

Fetch the data (can skip these details)#

See the code for fetching data (click to expand)

# Note: This pulldown content is for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.

amazon = ['Negative', 'Neutral', 'Positive']
imdb_test_set = ["Negative", "Positive"]

    'imagenet_val_set': imagenet_val_set,
    'caltech256': caltech256,
    'mnist_test_set': mnist_test_set,
    'cifar10_test_set': cifar10_test_set,
    'cifar100_test_set': cifar100_test_set,
    'imdb_test_set': imdb_test_set,
    '20news_test_set': twenty_news_test_set,
    'amazon': amazon,

def _load_classes_predprobs_labels(dataset_name):
    """Helper function to load data from the labelerrors.com datasets."""

    base = 'https://github.com/cleanlab/label-errors/raw/'
    url_base = base + '5392f6c71473055060be3044becdde1cbc18284d'
    url_labels = url_base + '/original_test_labels/{}_original_labels.npy'
    url_probs =  url_base + '/cross_validated_predicted_probabilities/{}_pyx.npy'
    NUM_PARTS = {'amazon': 3, 'imagenet_val_set': 4}  # pred_probs files broken up into parts for larger datatsets

    response = requests.get(url_labels.format(dataset_name))
    labels = np.load(io.BytesIO(response.content), allow_pickle=True)
    if dataset_name in NUM_PARTS:
        pred_probs_parts = []
        for i in range(1, NUM_PARTS[dataset_name] + 1):
            url = url_probs.format(dataset_name).replace(
            response = requests.get(url)
                np.load(io.BytesIO(response.content), allow_pickle=True))
        pred_probs = np.vstack(pred_probs_parts)
        response = requests.get(url_probs.format(dataset_name))
        pred_probs = np.load(io.BytesIO(response.content), allow_pickle=True)
    print(f"\nLoaded the '{dataset_name}' dataset with predicted "
          f"probabilities of shape {pred_probs.shape}\n")

    return pred_probs, labels, ALL_CLASSES[dataset_name]