Improve Consensus Labels for Multiannotator Data#

This quickstart tutorial shows how to use cleanlab for classification data that has been labeled by multiple annotators (where each example has been labeled by at least one annotator, but not every annotator has labeled every example). Compared to existing crowdsourcing tools, cleanlab helps you better analyze such data by leveraging a trained classifier model in addition to the raw annotations. With one line of code, you can automatically compute:

  • A consensus label for each example that aggregates the individual annotations more accurately than alternative aggregation via majority-vote or other algorithms used in crowdsourcing.

  • a quality score for each consensus label which measures our confidence that this label is correct (via well-calibrated estimates that account for the: number annotators which have labeled this example, overall quality of each annotator, and quality of our trained ML models).

  • An analogous label quality score for each individual label chosen by one annotator for a particular example.

  • An overall quality score for each annotator which measures our confidence in the overall correctness of labels obtained from this annotator.

Overview of what we’ll do in this tutorial:

  • Obtain initial consensus labels of multiannotator data using majority vote.

  • Train a classifier model on the initial consensus labels and use it to obtain out-of-sample predicted class probabilites.

  • Use cleanlab’s multiannotator.get_label_quality_multiannotator function to get improved consensus labels that more accurately reflect the ground truth.

  • View other information about your multiannotator dataset, such as consensus and annotator quality scores, agreement between annotators, detailed label quality scores and more!


Already have multiannotator_labels and (out-of-sample) pred_probs from a model trained on an existing set of consensus labels? Run the code below to get improved consensus labels and more information about the quality of your labels and annotators.

from cleanlab.multiannotator import get_label_quality_multiannotator

get_label_quality_multiannotator(multiannotator_labels, pred_probs)

1. Install and import required dependencies#

You can use pip to install all packages required for this tutorial as follows:

!pip install sklearn
!pip install cleanlab

# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+

Let’s import some of the packages needed throughout this tutorial.

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

from cleanlab.multiannotator import get_label_quality_multiannotator, get_majority_vote_label

2. Create the data (can skip these details)#

For this tutorial we will generate a toy dataset that has 50 annotators and 300 examples. There are three possible classes, 0, 1 and 2.

Each annotator annotates approximately 10% of the examples. We also synthetically made the last 5 annotators in our toy dataset have much noisier labels than the rest of the annotators.

Solely for evaluating cleanlab’s consensus labels against other consensus methods, we here also generate the true labels for this example dataset. However, true labels are not required for any cleanlab multiannotator functions (and they usually are not available in real applications). To generate our multiannotator data, we define a make_data() method (can skip these details).

See the code for data generation (click to expand)

# Note: This pulldown content is for, if running on local Jupyter or Colab, please ignore it.

from cleanlab.benchmarking.noise_generation import generate_noise_matrix_from_trace
from cleanlab.benchmarking.noise_generation import generate_noisy_labels

SEED = 111 # set to None for non-reproducible randomness

def make_data(
    means=[[3, 2], [7, 7], [0, 8]],
    covs=[[[5, -1.5], [-1.5, 1]], [[1, 0.5], [0.5, 4]], [[5, 1], [1, 5]]],
    sizes=[150, 75, 75],

    m = len(means)  # number of classes
    n = sum(sizes)
    local_data = []
    labels = []

    for idx in range(m):
            np.random.multivariate_normal(mean=means[idx], cov=covs[idx], size=sizes[idx])
        labels.append(np.array([idx for i in range(sizes[idx])]))
    X_train = np.vstack(local_data)
    true_labels_train = np.hstack(labels)

    # Compute p(true_label=k)
    py = np.bincount(true_labels_train) / float(len(true_labels_train))

    noise_matrix_better = generate_noise_matrix_from_trace(
        trace=0.8 * m,

    noise_matrix_worse = generate_noise_matrix_from_trace(
        trace=0.35 * m,

    # Generate our noisy labels using the noise_matrix for specified number of annotators.
    s = pd.DataFrame(
                generate_noisy_labels(true_labels_train, noise_matrix_better)
                if i < num_annotators - 5
                else generate_noisy_labels(true_labels_train, noise_matrix_worse)
                for i in range(num_annotators)

    # Each annotator only labels approximately 10% of the dataset
    # (unlabeled points represented with NaN)
    s = s.apply(lambda x: x.mask(np.random.random(n) < 0.9)).astype("Int64")
    s.dropna(axis=1, how="all", inplace=True)
    s.columns = ["A" + str(i).zfill(4) for i in range(1, num_annotators+1)]

    row_NA_check = pd.notna(s).any(axis=1)

    return {
        "X_train": X_train[row_NA_check],
        "true_labels_train": true_labels_train[row_NA_check],
        "multiannotator_labels": s[row_NA_check].reset_index(drop=True),
data_dict = make_data()

X = data_dict["X_train"]
multiannotator_labels = data_dict["multiannotator_labels"]
true_labels = data_dict["true_labels_train"] # used for comparing the accuracy of consensus labels

Let’s view the first few rows of the data used for this tutorial. Here are the labels selected by each annotator for the first few examples:

A0001 A0002 A0003 A0004 A0005 A0006 A0007 A0008 A0009 A0010 ... A0041 A0042 A0043 A0044 A0045 A0046 A0047 A0048 A0049 A0050
0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
1 <NA> <NA> <NA> <NA> <NA> <NA> 0 <NA> <NA> <NA> ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> ... <NA> 0 <NA> <NA> <NA> <NA> <NA> 2 <NA> <NA>
3 <NA> <NA> <NA> <NA> <NA> <NA> 2 <NA> <NA> <NA> ... 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> ... <NA> <NA> <NA> 2 <NA> <NA> 0 <NA> <NA> <NA>

5 rows × 50 columns

Here are the corresponding features for these examples:

array([[ 5.60856743,  1.41693214],
       [-0.40908785,  2.87147629],
       [ 4.64941785,  1.10774851],
       [ 3.0524466 ,  1.71853246],
       [ 4.37169848,  0.66031048]])

multiannotator_labels contains the class labels that each annotator chose for each example, with examples that a particular annotator did not label represented using np.nan. X contains the features for each example, which happen to be numeric in this tutorial but any feature modality can be used with cleanlab.multiannotator.

Bringing Your Own Data (BYOD)?

You can easily replace the above with your own multiannotator labels and features, then continue with the rest of the tutorial.

multiannotator_labels should be a numpy array or pandas DataFrame with each column representing an annotator and each row representing an example. Your labels should be represented as integer indices 0, 1, …, num_classes - 1, where examples that are not annotated by a particular annotator are represented using np.nan or pd.NA. Your features can be represented however you like (since these are not inputs to cleanlab.multiannotator methods) as long as you are able to fit a classifer to them and obtain its predicted class probabilities!

3. Get majority vote labels and compute out-of-sample predicted probabilites#

Before training a machine learning model, we must first obtain the consensus labels from the annotators that labeled the data. The simplest way to obtain an initial set of consensus labels is to select it using majority vote.

majority_vote_label = get_majority_vote_label(multiannotator_labels)

Next, we will train a model on the consensus labels obtained using majority vote to compute out-of-sample predicted probabilities. Here, we use a simple logistic regression model.

model = LogisticRegression()

num_crossval_folds = 5
pred_probs = cross_val_predict(
    estimator=model, X=X, y=majority_vote_label, cv=num_crossval_folds, method="predict_proba"

4. Use cleanlab to get better consensus labels and other statistics#

Using the annotators’ labels and the out-of-sample predicted probabilites from the model, cleanlab can help us obtain improved consensus labels for our data.

results = get_label_quality_multiannotator(multiannotator_labels, pred_probs, verbose=False)

Here, we use the multiannotator.get_label_quality_multiannotator() function which returns a dictionary containing three items:

  1. label_quality which gives us the improved consensus labels using information from each of the annotators and the model. The DataFrame also contains information about the number of annotations, annotator agreement and consensus quality score for each example.

consensus_label consensus_quality_score annotator_agreement num_annotations
0 0 0.778153 0.5 2
1 0 0.840381 1.0 3
2 0 0.811415 0.6 5
3 0 0.764981 0.6 5
4 0 0.848161 0.8 5
  1. detailed_label_quality which returns the label quality score for each label given by every annotator

quality_annotator_A0001 quality_annotator_A0002 quality_annotator_A0003 quality_annotator_A0004 quality_annotator_A0005 quality_annotator_A0006 quality_annotator_A0007 quality_annotator_A0008 quality_annotator_A0009 quality_annotator_A0010 ... quality_annotator_A0041 quality_annotator_A0042 quality_annotator_A0043 quality_annotator_A0044 quality_annotator_A0045 quality_annotator_A0046 quality_annotator_A0047 quality_annotator_A0048 quality_annotator_A0049 quality_annotator_A0050
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN 0.840381 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 0.811415 NaN NaN NaN NaN NaN 0.05199 NaN NaN
3 NaN NaN NaN NaN NaN NaN 0.172521 NaN NaN NaN ... 0.764981 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 0.101573 NaN NaN 0.848161 NaN NaN NaN

5 rows × 50 columns

  1. annotator_stats which gives us the annotator quality score for each annotator, alongisde other information such as the number of examples each annotator labeled, their agreement with the consensus labels and the class they perform the worst at.

annotator_quality agreement_with_consensus worst_class num_examples_labeled
A0050 0.262121 0.250000 2 24
A0047 0.296369 0.294118 2 34
A0049 0.320576 0.310345 1 29
A0046 0.357324 0.346154 1 26
A0048 0.468168 0.520000 2 25
A0031 0.525641 0.580645 2 31
A0034 0.544269 0.607143 2 28
A0021 0.585709 0.656250 1 32
A0019 0.610918 0.678571 2 28
A0015 0.620574 0.678571 2 28

The annotator_stats DataFrame is sorted by increasing annotator_quality, showing us the worst annotators first.

Notice that in the above table annotators with ids A0046 to A0050 have the worst annotator quality score, which is expected because we made the last 5 annotators systematically worse than the rest.

Comparing improved consensus labels#

We can get the improved consensus labels from the label_quality DataFrame shown above.

improved_consensus_label = results["label_quality"]["consensus_label"].values

Since our toy dataset is synthetically generated by adding noise to each annotator’s labels, we know the ground truth labels for each example. Hence we can compare the accuracy of the consensus labels obtained using majority vote, and the improved consensus labels obtained using cleanlab.

majority_vote_accuracy = np.mean(true_labels == majority_vote_label)
cleanlab_label_accuracy = np.mean(true_labels == improved_consensus_label)

print(f"Accuracy of majority vote labels = {majority_vote_accuracy}")
print(f"Accuracy of cleanlab consensus labels = {cleanlab_label_accuracy}")
Accuracy of majority vote labels = 0.9054054054054054
Accuracy of cleanlab consensus labels = 0.9763513513513513

We can see that the accuracy of the consensus labels improved as a result of using cleanlab, which not only takes the annotators’ labels into account, but also a model to compute better consensus labels.

Inspecting consensus quality scores to find potential consensus label errors#

We can get the consensus quality score from the label_quality DataFrame shown above.

consensus_quality_score = results["label_quality"]["consensus_quality_score"]

Besides obtaining improved consensus labels, cleanlab also computes consensus quality scores for each example. The lower scores represent potential consensus label errors in the dataset.

Here, we will extract 15 examples that have the lowest consensus quality score, and we can compare their average accuracy when compared to the true labels. We will also compute the average accuracy for the rest of the examples for comparison.

sorted_consensus_quality_score = consensus_quality_score.sort_values()
worst_quality = sorted_consensus_quality_score.index[:15]
better_quality = sorted_consensus_quality_score.index[15:]

worst_quality_accuracy = np.mean(true_labels[worst_quality] == improved_consensus_label[worst_quality])
better_quality_accuracy = np.mean(true_labels[better_quality] == improved_consensus_label[better_quality])

print(f"Accuracy of 15 worst quality examples = {worst_quality_accuracy}")
print(f"Accuracy of better quality examples = {better_quality_accuracy}")
Accuracy of 15 worst quality examples = 0.8
Accuracy of better quality examples = 0.9857651245551602

We observe that the 15 worst-consensus-quality-score examples have a lower average accuracy compared to the rest of the examples.

5. Retrain model using improved consensus labels#

After obtaining the improved consensus labels, we can now retrain a better version of our machine learning model using these newly obtained labels.

model = LogisticRegression()

num_crossval_folds = 5
improved_pred_probs = cross_val_predict(
    estimator=model, X=X, y=improved_consensus_label, cv=num_crossval_folds, method="predict_proba"

# alternatively, we can treat all the improved consensus labels as training labels to fit the model
#, improved_consensus_label)

You can also repeatedly iterate this process of getting better consensus labels using the model’s out-of-sample predicted probabilites and then retraining the model with the improved labels to get even better predicted probabilities!

For details, see our examples notebook on Iterative use of Cleanlab to Improve Classification Models (and Consensus Labels) from Data Labeled by Multiple Annotators.

How does cleanlab.multiannotator work?#

All estimates above are produced via the CROWDLAB algorithm, described and benchmarked in this paper:

Utilizing supervised models to infer consensus labels and their quality from data with multiple annotators