Find Label Errors in Multi-Label Classification Datasets#

This 5-minute quickstart tutorial demonstrates how to find potential label errors in multi-label classification datasets. In such datasets, each example is labeled as belonging to one or more classes (unlike in multi-class classification where each example can only belong to one class). For a particular example in such multi-label classification data, we say each class either applies or not. We may even have some examples where no classes apply. Common applications of this include image tagging (or document tagging), where multiple tags can be appropriate for a single image (or document). For example, a image tagging application could involve the following classes: [copyrighted, advertisement, face, violence, nsfw]


cleanlab finds label issues based on two inputs: labels formatted as a list of lists of integer class indices that apply to each example in your dataset, and pred_probs from a trained multi-label classification model (which do not need to sum to 1 since the classes are not mutually exclusive). Once you have these, run the code below to find label issues in your dataset.

from cleanlab.filter import find_label_issues

ranked_label_issues = find_label_issues(

1. Install required dependencies and get dataset#

You can use pip to install all packages required for this tutorial as follows:

!pip install cleanlab matplotlib
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+
import random
import numpy as np
import sklearn
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
import matplotlib.pyplot as plt

from cleanlab.filter import find_label_issues
import cleanlab.internal.multilabel_utils as mlutils
from cleanlab.internal.multilabel_utils import onehot2int, int2onehot
from cleanlab.benchmarking.noise_generation import (

Here we generate a small multi-label classification dataset for a quick demo. To see cleanlab applied to a real image tagging dataset, check out our example notebook “Find Label Errors in Multi-Label Classification Data (CelebA Image Tagging)”.

Code to generate dataset (can skip these details) (click to expand)

# Note: This pulldown content is for, if running on local Jupyter or Colab, please ignore it.

def make_multilabel_data(
    means=[[-5, 3.5], [0, 2], [-3, 6]],
    covs=[[[3, -1.5], [-1.5, 1]], [[5, -1.5], [-1.5, 1]], [[3, -1.5], [-1.5, 1]]],
    boxes_coordinates=[[-3.5, 0, -1.5, 1.7], [-1, 3, 2, 4], [-5, 2, -3, 4], [-3, 2, -1, 4]],
    box_multilabels=[[0, 1], [1, 2], [0, 2], [0, 1, 2]],
    sizes=[100, 80, 100],
    num_classes = len(means)
    m = num_classes + len(
    )  # number of classes by treating each multilabel as 1 unique label
    n = sum(sizes)
    local_data = []
    labels = []
    test_data = []
    test_labels = []
    for i in range(0, len(means)):
        local_data.append(np.random.multivariate_normal(mean=means[i], cov=covs[i], size=sizes[i]))
        test_data.append(np.random.multivariate_normal(mean=means[i], cov=covs[i], size=sizes[i]))
        test_labels += [[i]] * sizes[i]
        labels += [[i]] * sizes[i]

    def make_multi(X, Y, bx1, by1, bx2, by2, label_list):
        ll = np.array([bx1, by1])  # lower-left
        ur = np.array([bx2, by2])  # upper-right

        inidx = np.all(np.logical_and(X.tolist() >= ll, X.tolist() <= ur), axis=1)
        for i in range(0, len(Y)):
            if inidx[i]:
                Y[i] = label_list
        return Y

    X_train = np.vstack(local_data)
    X_test = np.vstack(test_data)

    for i in range(0, len(box_multilabels)):
        bx1, by1, bx2, by2 = boxes_coordinates[i]
        multi_label = box_multilabels[i]
        labels = make_multi(X_train, labels, bx1, by1, bx2, by2, multi_label)
        test_labels = make_multi(X_test, test_labels, bx1, by1, bx2, by2, multi_label)

    d = {}
    for i in labels:
        if str(i) not in d:
            d[str(i)] = len(d)
    inv_d = {v: k for k, v in d.items()}
    labels_idx = [d[str(i)] for i in labels]
    py = np.bincount(labels_idx) / float(len(labels_idx))
    noise_matrix = generate_noise_matrix_from_trace(
        trace=avg_trace * m,
    noisy_labels_idx = generate_noisy_labels(labels_idx, noise_matrix)
    noisy_labels = [eval(inv_d[i]) for i in noisy_labels_idx]
    return {
        "X_train": X_train,
        "true_labels_train": labels,
        "X_test": X_test,
        "true_labels_test": test_labels,
        "labels": noisy_labels,
        "dict_unique_label": d,
        'labels_idx': noisy_labels_idx,


def get_color_array(labels):
    This function returns a dictionary mapping multi-labels to unique colors
    dcolors ={'[0]': 'aa4400',
             '[0, 2]': '55227f',
             '[0, 1]': '55a100',
             '[1]': '00ff00',
             '[1, 2]': '007f7f',
             '[0, 1, 2]': '386b55',
             '[2]': '0000ff'}

    return ["#"+dcolors[str(i)] for i in labels]

def plot_data(data, circles, title, alpha=1.0,colors = []):
    plt.figure(figsize=(14, 5))
    done = set()
    for i in range(0,len(data)):
        lab = str(labels[i])
        if lab in done:
            label = ""
            label = lab
        plt.scatter(data[i, 0], data[i, 1], c=colors[i], s=30,alpha=0.6, label = label)
    for i in circles:
    _ = plt.title(title, fontsize=25)

Some of the labels in our generated dataset purposely contain errors. The examples with label errors are circled in the plot below, which depicts the dataset. This dataset contains 3 classes, and any subset of these may be the given label for a particular example. We say this example has a label error if it is better described by an alternative subset of the classes than the given label.

num_class = 3
dataset = make_multilabel_data()
labels = dataset['labels']
true_errors = np.where(np.sum(int2onehot(dataset['true_labels_train'],3)!=int2onehot(dataset['labels'],3),axis=1)>=1)[0]
plot_data(dataset['X_train'], circles=true_errors, title=f"True label errors in multi-label dataset with {num_class} classes", colors = get_color_array(labels),alpha=0.5)

2. Format data, labels, and model predictions#

In multi-label classification, each example in the dataset is labeled as belonging to one or more of K possible classes. To find label issues, cleanlab requires predicted class probabilities from a trained classifier. Here we produce out-of-sample pred_probs by employing cross-validation to fit a multi-label RandomForestClassifier model via sklearn’s OneVsRestClassifier framework. OneVsRestClassifier offers an easy way to apply any multi-class classifier model from sklearn to multi-label classification tasks. It is done for simplicity here, but we advise against this approach as it does not properly model dependencies between classes.

To instead train a state-of-the-art Pytorch neural network for multi-label classification and produce pred_probs on a real image dataset (that properly account for dependencies between classes), see our example notebook “Train a neural network for multi-label classification on the CelebA dataset”.

SEED = 0
y_onehot = int2onehot(labels, K=num_class)  # labels in a binary format for sklearn OneVsRestClassifier
single_class_labels = [random.choice(i) for i in labels]  # used only for stratifying the cross-validation split
clf = OneVsRestClassifier(RandomForestClassifier(random_state=SEED))
pred_probs = np.zeros(shape=(len(labels), num_class))
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)

for train_index, test_index in kf.split(X=dataset['X_train'], y=single_class_labels):
    clf_cv = sklearn.base.clone(clf)
    X_train_cv, X_test_cv = dataset['X_train'][train_index], dataset['X_train'][test_index]
    y_train_cv, y_test_cv = y_onehot[train_index], y_onehot[test_index], y_train_cv)
    y_pred_cv = clf_cv.predict_proba(X_test_cv)
    pred_probs[test_index] = y_pred_cv

pred_probs should be 2D array whose rows are length-K vectors for each example in the dataset, representing the model-estimated probability that this example belongs to each class. Since one example can belong to multiple classes in multi-label classification, these probabilities need not sum to 1. For the best label error detection performance, these pred_probs should be out-of-sample (from a copy of the model that never saw this example during training, e.g. produced via cross-validation).

labels should be a list of lists, whose i-th entry is a list of (integer) class indices that apply to the i-th example in the dataset. If your classes are represented as string names, you should map these to integer indices. The label for an example that belongs to none of the classes should just be an empty list [].

Once you have pred_probs and labels in the appropriate formats, you can find label issues with cleanlab for any multi-label dataset!

Here’s what these look like for the first few examples in our synthetic multi-label dataset:

num_to_display = 3  # increase this to see more examples

print(f"labels for first {num_to_display} examples in format expected by cleanlab:")
print(f"pred_probs for first {num_to_display} examples in format expected by cleanlab:")
labels for first 3 examples in format expected by cleanlab:
[[0], [0, 2], [0]]
pred_probs for first 3 examples in format expected by cleanlab:
[[1.   0.   0.  ]
 [0.96 0.09 0.88]
 [1.   0.01 0.22]]

3. Use cleanlab to find label issues#

Based on the given labels and pred_probs from a trained model, cleanlab can quickly help us find label errors in our dataset. Here we request that the indices of the examples identified with label issues be sorted by cleanlab’s self-confidence score, which is used to measure the quality of individual labels. The returned issues are a list of indices corresponding to the examples in your dataset that cleanlab finds most likely to be mislabeled. These indices are sorted by the self-confidence label quality score, with the lowest quality labels at the start.

issues = find_label_issues(

print(f"Indices of examples with label issues:\n{issues}")
Indices of examples with label issues:
[275 267 225  72 171 234 165  44   6  29 227 188 102 262 263  35 266 139
 143 172  53 216 265 176 164  73  75  10 159 107]

Note we specified the multi_label option above to distinguish the task from multi-class classification (otherwise assumed as the default task).

Let’s look at the samples that cleanlab thinks are most likely to be mislabeled. You can see that cleanlab was able to identify most of true_errors in our small dataset (despite not having access to this variable, which you won’t have in your own applications).

plot_data(dataset['X_train'], circles=issues, title=f"Inferred label issues in multi-label dataset with {num_class} classes", colors = get_color_array(labels), alpha = 1)

Label quality scores#

The above code identifies which examples have label issues and sorts them by their label quality score. We can also directly compute this label quality score for each example in the dataset, which estimates our confidence that this example has been correctly labeled. These scores range between 0 and 1 with smaller values indicating examples whose label seems more suspect.

from cleanlab.multilabel_classification import get_label_quality_scores

scores = get_label_quality_scores(labels, pred_probs)

print(f"Label quality scores of the first 10 examples in dataset:\n{scores[:10]}")
Label quality scores of the first 10 examples in dataset:
[1.     0.888  0.8224 0.9632 0.968  0.6512 0.0444 1.     0.76   0.774 ]

How to format labels given as a one-hot (multi-hot) binary matrix?#

For multi-label classification, cleanlab expects labels to be formatted as a list of lists, where each entry is an integer corresponding to a particular class. Here are some functions you can use to easily convert labels between this format and a binary matrix format commonly used to train multi-label classification models.

from cleanlab.internal.multilabel_utils import int2onehot, onehot2int

labels_binary_format = int2onehot(labels, K=num_class)
labels_list_format = onehot2int(labels_binary_format)

To see cleanlab applied to a real image tagging dataset, check out our example notebook “Find Label Errors in Multi-Label Classification Data (CelebA Image Tagging)”. That example also demonstrates how to use a state-of-the-art Pytorch neural network for multi-label classification with image data.