Estimate Consensus and Annotator Quality for Data Labeled by Multiple Annotators#

This 5-minute quickstart tutorial shows how to use cleanlab for classification data that has been labeled by multiple annotators (where each example has been labeled by at least one annotator, but not every annotator has labeled every example). Compared to existing crowdsourcing tools, cleanlab helps you better analyze such data by leveraging a trained classifier model in addition to the raw annotations. With one line of code, you can automatically compute:

  • A consensus label for each example (i.e. truth inference) that aggregates the individual annotations (more accurately than algorithms from crowdsourcing like majority-vote, Dawid-Skene, or GLAD).

  • A quality score for each consensus label which measures our confidence that this label is correct (via well-calibrated estimates that account for the: number of annotators which have labeled this example, overall quality of each annotator, and quality of our trained ML models).

  • An analogous label quality score for each individual label chosen by one annotator for a particular example (to measure our confidence in alternate labels when annotators differ from the consensus).

  • An overall quality score for each annotator which measures our confidence in the overall correctness of labels obtained from this annotator.

Overview of what we’ll do in this tutorial:

  • Obtain initial consensus labels of multiannotator data using majority vote.

  • Train a classifier model on the initial consensus labels and use it to obtain out-of-sample predicted class probabilities.

  • Use cleanlab’s multiannotator.get_label_quality_multiannotator function to get improved consensus labels that more accurately reflect the ground truth.

  • View other information about your multiannotator dataset, such as consensus and annotator quality scores, agreement between annotators, detailed label quality scores and more!

Consensus labels represent the best guess of the true label for each example and can be used for more reliable modeling/analytics. Cleanlab automatically produces enhanced estimates of consensus through the use of machine learning. Quality scores help us determine how much trust we can place in each: consensus label, individual annotator, and particular label from a particular annotator. These quality scores can help you determine which annotators are best/worst overall, as well as which current consensus labels are least trustworthy and should perhaps be verified via additional annotation.

This tutorial uses a toy tabular dataset labeled with multiple annotators but these steps can easily be applied to image or text data.

Quickstart

Already have multiannotator_labels and (out-of-sample) pred_probs from a model trained on an existing set of consensus labels? Run the code below to get improved consensus labels and more information about the quality of your labels and annotators.

from cleanlab.multiannotator import get_label_quality_multiannotator

get_label_quality_multiannotator(multiannotator_labels, pred_probs)

1. Install and import required dependencies#

You can use pip to install all packages required for this tutorial as follows:

!pip install cleanlab

# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+https://github.com/cleanlab/cleanlab.git

Let’s import some of the packages needed throughout this tutorial.

[2]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

from cleanlab.multiannotator import get_label_quality_multiannotator, get_majority_vote_label

2. Create the data (can skip these details)#

For this tutorial we will generate a toy dataset that has 50 annotators and 300 examples. There are three possible classes, 0, 1 and 2.

Each annotator annotates approximately 10% of the examples. We also synthetically made the last 5 annotators in our toy dataset have much noisier labels than the rest of the annotators.

Solely for evaluating cleanlab’s consensus labels against other consensus methods, we here also generate the true labels for this example dataset. However, true labels are not required for any cleanlab multiannotator functions (and they usually are not available in real applications). To generate our multiannotator data, we define a make_data() method (can skip these details).

See the code for data generation (click to expand)

# Note: This pulldown content is for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.

from cleanlab.benchmarking.noise_generation import generate_noise_matrix_from_trace
from cleanlab.benchmarking.noise_generation import generate_noisy_labels

SEED = 111 # set to None for non-reproducible randomness
np.random.seed(seed=SEED)

def make_data(
    means=[[3, 2], [7, 7], [0, 8]],
    covs=[[[5, -1.5], [-1.5, 1]], [[1, 0.5], [0.5, 4]], [[5, 1], [1, 5]]],
    sizes=[150, 75, 75],
    num_annotators=50,
):

    m = len(means)  # number of classes
    n = sum(sizes)
    local_data = []
    labels = []

    for idx in range(m):
        local_data.append(
            np.random.multivariate_normal(mean=means[idx], cov=covs[idx], size=sizes[idx])
        )
        labels.append(np.array([idx for i in range(sizes[idx])]))
    X_train = np.vstack(local_data)
    true_labels_train = np.hstack(labels)

    # Compute p(true_label=k)
    py = np.bincount(true_labels_train) / float(len(true_labels_train))

    noise_matrix_better = generate_noise_matrix_from_trace(
        m,
        trace=0.8 * m,
        py=py,
        valid_noise_matrix=True,
        seed=SEED,
    )

    noise_matrix_worse = generate_noise_matrix_from_trace(
        m,
        trace=0.35 * m,
        py=py,
        valid_noise_matrix=True,
        seed=SEED,
    )

    # Generate our noisy labels using the noise_matrix for specified number of annotators.
    s = pd.DataFrame(
        np.vstack(
            [
                generate_noisy_labels(true_labels_train, noise_matrix_better)
                if i < num_annotators - 5
                else generate_noisy_labels(true_labels_train, noise_matrix_worse)
                for i in range(num_annotators)
            ]
        ).transpose()
    )

    # Each annotator only labels approximately 10% of the dataset
    # (unlabeled points represented with NaN)
    s = s.apply(lambda x: x.mask(np.random.random(n) < 0.9)).astype("Int64")
    s.dropna(axis=1, how="all", inplace=True)
    s.columns = ["A" + str(i).zfill(4) for i in range(1, num_annotators+1)]

    row_NA_check = pd.notna(s).any(axis=1)

    return {
        "X_train": X_train[row_NA_check],
        "true_labels_train": true_labels_train[row_NA_check],
        "multiannotator_labels": s[row_NA_check].reset_index(drop=True),
    }
[4]:
data_dict = make_data()

X = data_dict["X_train"]
multiannotator_labels = data_dict["multiannotator_labels"]
true_labels = data_dict["true_labels_train"] # used for comparing the accuracy of consensus labels

Let’s view the first few rows of the data used for this tutorial. Here are the labels selected by each annotator for the first few examples (rows) in the dataset:

[5]:
multiannotator_labels.head()
[5]:
A0001 A0002 A0003 A0004 A0005 A0006 A0007 A0008 A0009 A0010 ... A0041 A0042 A0043 A0044 A0045 A0046 A0047 A0048 A0049 A0050
0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
1 <NA> <NA> <NA> <NA> <NA> <NA> 0 <NA> <NA> <NA> ... <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> ... <NA> 0 <NA> <NA> <NA> <NA> <NA> 2 <NA> <NA>
3 <NA> <NA> <NA> <NA> <NA> <NA> 2 <NA> <NA> <NA> ... 0 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> ... <NA> <NA> <NA> 2 <NA> <NA> 0 <NA> <NA> <NA>

5 rows × 50 columns

Here are the corresponding features for these examples:

[6]:
X[:5]
[6]:
array([[ 5.60856743,  1.41693214],
       [-0.40908785,  2.87147629],
       [ 4.64941785,  1.10774851],
       [ 3.0524466 ,  1.71853246],
       [ 4.37169848,  0.66031048]])

multiannotator_labels contains the class label that each annotator chose for each example in the dataset, with examples that a particular annotator did not label represented using np.nan. X contains the features for each example, which happen to be numeric in this tutorial but any feature modality can be used with cleanlab.multiannotator.

Bringing Your Own Data (BYOD)?

You can easily replace the above with your own multiannotator labels and features, then continue with the rest of the tutorial.

multiannotator_labels should be a numpy array or pandas DataFrame with each column representing an annotator and each row representing an example. Your labels should be represented as integer indices 0, 1, …, num_classes - 1, where examples that are not annotated by a particular annotator are represented using np.nan or pd.NA. If you have string labels or other labels that do not fit the required format, you can convert them to the proper format using cleanlab.internal.multiannotator_utils.format_multiannotator_labels.

Your features can be represented however you like (since these are not inputs to cleanlab.multiannotator methods) as long as you are able to fit a classifer to them and obtain its predicted class probabilities!

3. Get initial consensus labels via majority vote and compute out-of-sample predicted probabilities#

Before training a machine learning model, we must first obtain initial consensus labels from the data annotations representing a crude guess of the best label for each example. The most straight forward way to obtain an initial set of consensus labels is via simple majority vote.

[7]:
majority_vote_label = get_majority_vote_label(multiannotator_labels)

Majority vote consensus labels may not be very reliable, particularly for examples that were only labeled by one or a few annotators. To more reliably estimate consensus, we can account for the features associated with each example (based on which the annotations were derived in the first place). Fitting a classifier model serves as a natural way to account for these feature values, here we train a simple logistic regression model to get significantly more accurate estimates of consensus labels and associated quality scores.

We fit the model with our initial consensus labels, and then get (out-of-sample) predicted class probabilities for each example in the dataset from the trained model. These predicted probabilities help us estimate the best consensus labels and associated confidence values in a statistically optimal manner that accounts for all the available information.

[8]:
model = LogisticRegression()

num_crossval_folds = 5
pred_probs = cross_val_predict(
    estimator=model, X=X, y=majority_vote_label, cv=num_crossval_folds, method="predict_proba"
)

4. Use cleanlab to get better consensus labels and other statistics#

Using the annotators’ labels and the (out-of-sample) predicted class probabilities from the model, cleanlab can estimate improved consensus labels for our data that are more accurate than our initial consensus labels were.

Having accurate labels provides insight on each annotator’s label quality and is key for boosting model accuracy and achieving dependable real-world results.

[9]:
results = get_label_quality_multiannotator(multiannotator_labels, pred_probs, verbose=False)

Here, we use the multiannotator.get_label_quality_multiannotator() function which returns a dictionary containing three items:

  1. label_quality which gives us the improved consensus labels using information from each of the annotators and the model. The DataFrame also contains information about the number of annotations, annotator agreement and consensus quality score for each example.

[10]:
results["label_quality"].head()
[10]:
consensus_label consensus_quality_score annotator_agreement num_annotations
0 0 0.736118 0.5 2
1 0 0.757751 1.0 3
2 0 0.782232 0.6 5
3 0 0.715565 0.6 5
4 0 0.824256 0.8 5
  1. detailed_label_quality which returns the label quality score for each label given by every annotator

[11]:
results["detailed_label_quality"].head()
[11]:
quality_annotator_A0001 quality_annotator_A0002 quality_annotator_A0003 quality_annotator_A0004 quality_annotator_A0005 quality_annotator_A0006 quality_annotator_A0007 quality_annotator_A0008 quality_annotator_A0009 quality_annotator_A0010 ... quality_annotator_A0041 quality_annotator_A0042 quality_annotator_A0043 quality_annotator_A0044 quality_annotator_A0045 quality_annotator_A0046 quality_annotator_A0047 quality_annotator_A0048 quality_annotator_A0049 quality_annotator_A0050
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN 0.757751 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN 0.782232 NaN NaN NaN NaN NaN 0.070564 NaN NaN
3 NaN NaN NaN NaN NaN NaN 0.216078 NaN NaN NaN ... 0.715565 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 0.119188 NaN NaN 0.824256 NaN NaN NaN

5 rows × 50 columns

  1. annotator_stats which gives us the annotator quality score for each annotator, alongisde other information such as the number of examples each annotator labeled, their agreement with the consensus labels and the class they perform the worst at.

[12]:
results["annotator_stats"].head(10)
[12]:
annotator_quality agreement_with_consensus worst_class num_examples_labeled
A0050 0.244981 0.208333 2 24
A0047 0.295979 0.294118 2 34
A0049 0.324197 0.310345 1 29
A0046 0.355316 0.346154 1 26
A0048 0.439732 0.480000 2 25
A0031 0.523205 0.580645 2 31
A0034 0.535313 0.607143 2 28
A0021 0.606999 0.718750 1 32
A0015 0.609526 0.678571 2 28
A0011 0.621103 0.692308 1 26

The annotator_stats DataFrame is sorted by increasing annotator_quality, showing us the worst annotators first.

Notice that in the above table annotators with ids A0046 to A0050 have the worst annotator quality score, which is expected because we made the last 5 annotators systematically worse than the rest.

Comparing improved consensus labels#

We can get the improved consensus labels from the label_quality DataFrame shown above.

[13]:
improved_consensus_label = results["label_quality"]["consensus_label"].values

Since our toy dataset is synthetically generated by adding noise to each annotator’s labels, we know the ground truth labels for each example. Hence we can compare the accuracy of the consensus labels obtained using majority vote, and the improved consensus labels obtained using cleanlab.

[14]:
majority_vote_accuracy = np.mean(true_labels == majority_vote_label)
cleanlab_label_accuracy = np.mean(true_labels == improved_consensus_label)

print(f"Accuracy of majority vote labels = {majority_vote_accuracy}")
print(f"Accuracy of cleanlab consensus labels = {cleanlab_label_accuracy}")
Accuracy of majority vote labels = 0.8581081081081081
Accuracy of cleanlab consensus labels = 0.9797297297297297

We can see that the accuracy of the consensus labels improved as a result of using cleanlab, which not only takes the annotators’ labels into account, but also a model to compute better consensus labels.

Inspecting consensus quality scores to find potential consensus label errors#

We can get the consensus quality score from the label_quality DataFrame shown above.

[15]:
consensus_quality_score = results["label_quality"]["consensus_quality_score"]

Besides obtaining improved consensus labels, cleanlab also computes consensus quality scores for each example. The lower scores represent potential consensus label errors in the dataset.

Here, we will extract 15 examples that have the lowest consensus quality score, and we can compare their average accuracy when compared to the true labels. We will also compute the average accuracy for the rest of the examples for comparison.

[16]:
sorted_consensus_quality_score = consensus_quality_score.sort_values()
worst_quality = sorted_consensus_quality_score.index[:15]
better_quality = sorted_consensus_quality_score.index[15:]

worst_quality_accuracy = np.mean(true_labels[worst_quality] == improved_consensus_label[worst_quality])
better_quality_accuracy = np.mean(true_labels[better_quality] == improved_consensus_label[better_quality])

print(f"Accuracy of 15 worst quality examples = {worst_quality_accuracy}")
print(f"Accuracy of better quality examples = {better_quality_accuracy}")
Accuracy of 15 worst quality examples = 0.8
Accuracy of better quality examples = 0.9893238434163701

We observe that the 15 worst-consensus-quality-score examples have a lower average accuracy compared to the rest of the examples. Cleanlab automatically determines which consensus labels are least trustworthy (perhaps want to have another annotator look at that data). Here we see these trustworthiness estimates really do correspond to the true quality of the consensus labels (which we know in this toy dataset because we have the true labels, unlike in your applications)

5. Retrain model using improved consensus labels#

After obtaining the improved consensus labels, we can now retrain a better version of our machine learning model using these newly obtained labels.

[17]:
model = LogisticRegression()

num_crossval_folds = 5
improved_pred_probs = cross_val_predict(
    estimator=model, X=X, y=improved_consensus_label, cv=num_crossval_folds, method="predict_proba"
)

# alternatively, we can treat all the improved consensus labels as training labels to fit the model
# model.fit(X, improved_consensus_label)

Further improvements#

You can also repeat this process of getting better consensus labels using the model’s out-of-sample predicted probabilities and then retraining the model with the improved labels to get even better predicted class probabilities in a virtuous cycle! For details, see our examples notebook on Iterative use of Cleanlab to Improve Classification Models (and Consensus Labels) from Data Labeled by Multiple Annotators.

If possible, the best way to improve your model is to collect additional labels for both previously annotated data and extra not-yet-labeled examples (i.e. active learning). To decide which data is most informative to label next, use cleanlab.multiannotator.get_active_learning_scores() rather than the methods from this tutorial. This is demonstrated in our examples notebook on Active Learning with Multiple Data Annotators via ActiveLab.

While this notebook focused on analzying the labels of your data, cleanlab can also check your data features for various issues. Learn how to do this by following our Datalab tutorials, except you do not need to pass in labels now that you’ve already analyzed them with this notebook (or you can provide labels to Datalab as the consensus labels estimated here).

How does cleanlab.multiannotator work?#

All estimates above are produced via the CROWDLAB algorithm, described in this paper that contains extensive benchmarks which show CROWDLAB can produce better estimates than popular methods like Dawid-Skene and GLAD:

CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators