Improve Consensus Labels for Multiannotator Data#

This 5-minute quickstart tutorial shows how to use cleanlab for classification data that has been labeled by multiple annotators (where each example has been labeled by at least one annotator, but not every annotator has labeled every example). Compared to existing crowdsourcing tools, cleanlab helps you better analyze such data by leveraging a trained classifier model in addition to the raw annotations. With one line of code, you can automatically compute:

A consensus label for each example (i.e. truth inference) that aggregates the individual annotations (more accurately than algorithms from crowdsourcing like majority-vote, Dawid-Skene, or GLAD).
a quality score for each consensus label which measures our confidence that this label is correct (via well-calibrated estimates that account for the: number annotators which have labeled this example, overall quality of each annotator, and quality of our trained ML models).
An analogous label quality score for each individual label chosen by one annotator for a particular example.
An overall quality score for each annotator which measures our confidence in the overall correctness of labels obtained from this annotator.

Overview of what we’ll do in this tutorial:

Obtain initial consensus labels of multiannotator data using majority vote.
Train a classifier model on the initial consensus labels and use it to obtain out-of-sample predicted class probabilities.
Use cleanlab’s multiannotator.get_label_quality_multiannotator function to get improved consensus labels that more accurately reflect the ground truth.
View other information about your multiannotator dataset, such as consensus and annotator quality scores, agreement between annotators, detailed label quality scores and more!

Quickstart

Already have multiannotator_labels and (out-of-sample) pred_probs from a model trained on an existing set of consensus labels? Run the code below to get improved consensus labels and more information about the quality of your labels and annotators.

from cleanlab.multiannotator import get_label_quality_multiannotator

get_label_quality_multiannotator(multiannotator_labels, pred_probs)

1. Install and import required dependencies#

You can use pip to install all packages required for this tutorial as follows:

!pip install sklearn
!pip install cleanlab

# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+https://github.com/cleanlab/cleanlab.git

Let’s import some of the packages needed throughout this tutorial.

[2]:

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

from cleanlab.multiannotator import get_label_quality_multiannotator, get_majority_vote_label

2. Create the data (can skip these details)#

For this tutorial we will generate a toy dataset that has 50 annotators and 300 examples. There are three possible classes, 0, 1 and 2.

Each annotator annotates approximately 10% of the examples. We also synthetically made the last 5 annotators in our toy dataset have much noisier labels than the rest of the annotators.

Solely for evaluating cleanlab’s consensus labels against other consensus methods, we here also generate the true labels for this example dataset. However, true labels are not required for any cleanlab multiannotator functions (and they usually are not available in real applications). To generate our multiannotator data, we define a make_data() method (can skip these details).

See the code for data generation (click to expand)

# Note: This pulldown content is for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.

from cleanlab.benchmarking.noise_generation import generate_noise_matrix_from_trace
from cleanlab.benchmarking.noise_generation import generate_noisy_labels

SEED = 111 # set to None for non-reproducible randomness
np.random.seed(seed=SEED)

def make_data(
    means=[[3, 2], [7, 7], [0, 8]],
    covs=[[[5, -1.5], [-1.5, 1]], [[1, 0.5], [0.5, 4]], [[5, 1], [1, 5]]],
    sizes=[150, 75, 75],
    num_annotators=50,
):

    m = len(means)  # number of classes
    n = sum(sizes)
    local_data = []
    labels = []

    for idx in range(m):
        local_data.append(
            np.random.multivariate_normal(mean=means[idx], cov=covs[idx], size=sizes[idx])
        )
        labels.append(np.array([idx for i in range(sizes[idx])]))
    X_train = np.vstack(local_data)
    true_labels_train = np.hstack(labels)

    # Compute p(true_label=k)
    py = np.bincount(true_labels_train) / float(len(true_labels_train))

    noise_matrix_better = generate_noise_matrix_from_trace(
        m,
        trace=0.8 * m,
        py=py,
        valid_noise_matrix=True,
        seed=SEED,
    )

    noise_matrix_worse = generate_noise_matrix_from_trace(
        m,
        trace=0.35 * m,
        py=py,
        valid_noise_matrix=True,
        seed=SEED,
    )

    # Generate our noisy labels using the noise_matrix for specified number of annotators.
    s = pd.DataFrame(
        np.vstack(
            [
                generate_noisy_labels(true_labels_train, noise_matrix_better)
                if i < num_annotators - 5
                else generate_noisy_labels(true_labels_train, noise_matrix_worse)
                for i in range(num_annotators)
            ]
        ).transpose()
    )

    # Each annotator only labels approximately 10% of the dataset
    # (unlabeled points represented with NaN)
    s = s.apply(lambda x: x.mask(np.random.random(n) < 0.9)).astype("Int64")
    s.dropna(axis=1, how="all", inplace=True)
    s.columns = ["A" + str(i).zfill(4) for i in range(1, num_annotators+1)]

    row_NA_check = pd.notna(s).any(axis=1)

    return {
        "X_train": X_train[row_NA_check],
        "true_labels_train": true_labels_train[row_NA_check],
        "multiannotator_labels": s[row_NA_check].reset_index(drop=True),
    }

[4]:

data_dict = make_data()

X = data_dict["X_train"]
multiannotator_labels = data_dict["multiannotator_labels"]
true_labels = data_dict["true_labels_train"] # used for comparing the accuracy of consensus labels

Let’s view the first few rows of the data used for this tutorial. Here are the labels selected by each annotator for the first few examples:

[5]:

multiannotator_labels.head()

[5]:

	A0001	A0002	A0003	A0004	A0005	A0006	A0007	A0008	A0009	A0010	...	A0041	A0042	A0043	A0044	A0045	A0046	A0047	A0048	A0049	A0050
0	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	...	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>
1	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	0	<NA>	<NA>	<NA>	...	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>
2	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	...	<NA>	0	<NA>	<NA>	<NA>	<NA>	<NA>	2	<NA>	<NA>
3	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	2	<NA>	<NA>	<NA>	...	0	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>
4	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	<NA>	...	<NA>	<NA>	<NA>	2	<NA>	<NA>	0	<NA>	<NA>	<NA>

5 rows × 50 columns

Here are the corresponding features for these examples:

[6]:

X[:5]

[6]:

array([[ 5.60856743,  1.41693214],
       [-0.40908785,  2.87147629],
       [ 4.64941785,  1.10774851],
       [ 3.0524466 ,  1.71853246],
       [ 4.37169848,  0.66031048]])

multiannotator_labels contains the class labels that each annotator chose for each example, with examples that a particular annotator did not label represented using np.nan. X contains the features for each example, which happen to be numeric in this tutorial but any feature modality can be used with cleanlab.multiannotator.

Bringing Your Own Data (BYOD)?

You can easily replace the above with your own multiannotator labels and features, then continue with the rest of the tutorial.

multiannotator_labels should be a numpy array or pandas DataFrame with each column representing an annotator and each row representing an example. Your labels should be represented as integer indices 0, 1, …, num_classes - 1, where examples that are not annotated by a particular annotator are represented using np.nan or pd.NA. If you have string labels or other labels that do not fit the required format, you can convert them to the proper format using cleanlab.internal.multiannotator_utils.format_multiannotator_labels.

Your features can be represented however you like (since these are not inputs to cleanlab.multiannotator methods) as long as you are able to fit a classifer to them and obtain its predicted class probabilities!

3. Get majority vote labels and compute out-of-sample predicted probabilities#

Before training a machine learning model, we must first obtain the consensus labels from the annotators that labeled the data. The simplest way to obtain an initial set of consensus labels is to select it using majority vote.

[7]:

majority_vote_label = get_majority_vote_label(multiannotator_labels)

Next, we will train a model on the consensus labels obtained using majority vote to compute out-of-sample predicted probabilities. Here, we use a simple logistic regression model.

[8]:

model = LogisticRegression()

num_crossval_folds = 5
pred_probs = cross_val_predict(
    estimator=model, X=X, y=majority_vote_label, cv=num_crossval_folds, method="predict_proba"
)

4. Use cleanlab to get better consensus labels and other statistics#

Using the annotators’ labels and the out-of-sample predicted probabilities from the model, cleanlab can help us obtain improved consensus labels for our data.

[9]:

results = get_label_quality_multiannotator(multiannotator_labels, pred_probs, verbose=False)

Here, we use the multiannotator.get_label_quality_multiannotator() function which returns a dictionary containing three items:

label_quality which gives us the improved consensus labels using information from each of the annotators and the model. The DataFrame also contains information about the number of annotations, annotator agreement and consensus quality score for each example.

[10]:

results["label_quality"].head()

[10]:

	consensus_quality_score	annotator_agreement	num_annotations
0	0.736157	0.5	2
1	0.758385	1.0	3
2	0.783900	0.6	5
3	0.729889	0.6	5
4	0.824273	0.8	5

detailed_label_quality which returns the label quality score for each label given by every annotator

[11]:

results["detailed_label_quality"].head()

[11]:

	quality_annotator_A0001	quality_annotator_A0002	quality_annotator_A0003	quality_annotator_A0004	quality_annotator_A0005	quality_annotator_A0006	quality_annotator_A0007	quality_annotator_A0008	quality_annotator_A0009	quality_annotator_A0010	...	quality_annotator_A0041	quality_annotator_A0042	quality_annotator_A0043	quality_annotator_A0044	quality_annotator_A0045	quality_annotator_A0046	quality_annotator_A0047	quality_annotator_A0048	quality_annotator_A0049	quality_annotator_A0050
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN	0.758385	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	0.7839	NaN	NaN	NaN	NaN	NaN	0.068067	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	0.203208	NaN	NaN	NaN	...	0.729889	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	0.119178	NaN	NaN	0.824273	NaN	NaN	NaN

5 rows × 50 columns

annotator_stats which gives us the annotator quality score for each annotator, alongisde other information such as the number of examples each annotator labeled, their agreement with the consensus labels and the class they perform the worst at.

[12]:

results["annotator_stats"].head(10)

[12]:

	annotator_quality	agreement_with_consensus	worst_class	num_examples_labeled
A0050	0.245473	0.208333	2	24
A0047	0.295250	0.294118	2	34
A0049	0.324303	0.310345	1	29
A0046	0.355122	0.346154	1	26
A0048	0.439612	0.480000	2	25
A0031	0.523461	0.580645	2	31
A0034	0.534881	0.607143	2	28
A0021	0.606033	0.718750	1	32
A0015	0.609204	0.678571	2	28
A0011	0.621276	0.692308	1	26

The annotator_stats DataFrame is sorted by increasing annotator_quality, showing us the worst annotators first.

Notice that in the above table annotators with ids A0046 to A0050 have the worst annotator quality score, which is expected because we made the last 5 annotators systematically worse than the rest.

Comparing improved consensus labels#

We can get the improved consensus labels from the label_quality DataFrame shown above.

[13]:

improved_consensus_label = results["label_quality"]["consensus_label"].values

Since our toy dataset is synthetically generated by adding noise to each annotator’s labels, we know the ground truth labels for each example. Hence we can compare the accuracy of the consensus labels obtained using majority vote, and the improved consensus labels obtained using cleanlab.

[14]:

majority_vote_accuracy = np.mean(true_labels == majority_vote_label)
cleanlab_label_accuracy = np.mean(true_labels == improved_consensus_label)

print(f"Accuracy of majority vote labels = {majority_vote_accuracy}")
print(f"Accuracy of cleanlab consensus labels = {cleanlab_label_accuracy}")

Accuracy of majority vote labels = 0.8581081081081081
Accuracy of cleanlab consensus labels = 0.9797297297297297

We can see that the accuracy of the consensus labels improved as a result of using cleanlab, which not only takes the annotators’ labels into account, but also a model to compute better consensus labels.

Inspecting consensus quality scores to find potential consensus label errors#

We can get the consensus quality score from the label_quality DataFrame shown above.

[15]:

consensus_quality_score = results["label_quality"]["consensus_quality_score"]

Besides obtaining improved consensus labels, cleanlab also computes consensus quality scores for each example. The lower scores represent potential consensus label errors in the dataset.

Here, we will extract 15 examples that have the lowest consensus quality score, and we can compare their average accuracy when compared to the true labels. We will also compute the average accuracy for the rest of the examples for comparison.

[16]:

sorted_consensus_quality_score = consensus_quality_score.sort_values()
worst_quality = sorted_consensus_quality_score.index[:15]
better_quality = sorted_consensus_quality_score.index[15:]

worst_quality_accuracy = np.mean(true_labels[worst_quality] == improved_consensus_label[worst_quality])
better_quality_accuracy = np.mean(true_labels[better_quality] == improved_consensus_label[better_quality])

print(f"Accuracy of 15 worst quality examples = {worst_quality_accuracy}")
print(f"Accuracy of better quality examples = {better_quality_accuracy}")

Accuracy of 15 worst quality examples = 0.8666666666666667
Accuracy of better quality examples = 0.9857651245551602

We observe that the 15 worst-consensus-quality-score examples have a lower average accuracy compared to the rest of the examples.

5. Retrain model using improved consensus labels#

After obtaining the improved consensus labels, we can now retrain a better version of our machine learning model using these newly obtained labels.

[17]:

model = LogisticRegression()

num_crossval_folds = 5
improved_pred_probs = cross_val_predict(
    estimator=model, X=X, y=improved_consensus_label, cv=num_crossval_folds, method="predict_proba"
)

# alternatively, we can treat all the improved consensus labels as training labels to fit the model
# model.fit(X, improved_consensus_label)

Further model improvements#

You can also repeatedly iterate this process of getting better consensus labels using the model’s out-of-sample predicted probabilities and then retraining the model with the improved labels to get even better predicted probabilities! For details, see our examples notebook on Iterative use of Cleanlab to Improve Classification Models (and Consensus Labels) from Data Labeled by Multiple Annotators.

If possible, the best way to improve your model is to collect additional labels for both previously annotated data and extra not-yet-labeled examples (i.e. active learning). To decide which data is most informative to label next, use cleanlab.multiannotator.get_active_learning_scores() rather than the methods from this tutorial. This is demonstrated in our examples notebook on Active Learning with Multiple Data Annotators via ActiveLab.

How does cleanlab.multiannotator work?#

All estimates above are produced via the CROWDLAB algorithm, described in this paper that contains extensive benchmarks which show CROWDLAB can produce better estimates than popular methods like Dawid-Skene and GLAD:

CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators