multiannotator#

Methods for analysis of classification data labeled by multiple annotators.

To analyze a fixed dataset labeled by multiple annotators, use the get_label_quality_multiannotator function which estimates:

  • A consensus label for each example that aggregates the individual annotations more accurately than alternative aggregation via majority-vote or other algorithms used in crowdsourcing like Dawid-Skene.

  • A quality score for each consensus label which measures our confidence that this label is correct.

  • An analogous label quality score for each individual label chosen by one annotator for a particular example.

  • An overall quality score for each annotator which measures our confidence in the overall correctness of labels obtained from this annotator.

The underlying algorithms used to compute the statistics are described in the CROWDLAB paper.

If you have some labeled and unlabeled data (with multiple annotators for some labeled examples) and want to decide what data to collect additional labels for, use the get_active_learning_scores function, which is intended for active learning. This function estimates an active learning quality score for each example, which can be used to prioritize which examples are most informative to collect additional labels for. This function is effective for settings where some examples have been labeled by one or more annotators and other examples can have no labels at all so far, as well as settings where new labels are collected either in batches of examples or one at a time. Here is an example notebook showcasing the use of this function in multiple active learning rounds.

Each of the main functions in this module utilizes any trained classifier model. Variants of these functions are provided for settings where you have trained an ensemble of multiple models.

Functions:

get_label_quality_multiannotator(...[, ...])

Returns label quality scores for each example and for each annotator.

get_label_quality_multiannotator_ensemble(...)

Returns label quality scores for each example and for each annotator, based on predictions from an ensemble of models.

get_active_learning_scores(...[, ...])

Returns an active learning quality score for each example in the dataset.

get_active_learning_scores_ensemble(...[, ...])

Returns an active learning quality score for each example in the dataset, based on predictions from an ensemble of models.

get_majority_vote_label(labels_multiannotator)

Returns the majority vote label for each example, aggregated from the labels given by multiple annotators.

convert_long_to_wide_dataset(...)

Converts a long format dataset to wide format which is suitable for passing into get_label_quality_multiannotator.

cleanlab.multiannotator.get_label_quality_multiannotator(labels_multiannotator, pred_probs, *, consensus_method='best_quality', quality_method='crowdlab', calibrate_probs=False, return_detailed_quality=True, return_annotator_stats=True, return_weights=False, verbose=True, label_quality_score_kwargs={})[source]#

Returns label quality scores for each example and for each annotator.

This function is for multiclass classification datasets where examples have been labeled by multiple annotators (not necessarily the same number of annotators per example).

It computes one consensus label for each example that best accounts for the labels chosen by each annotator (and their quality), as well as a consensus quality score for how confident we are that this consensus label is actually correct. It also computes similar quality scores for each annotator’s individual labels, and the quality of each annotator. Scores are between 0 and 1; lower scores indicate labels/annotators less likely to be correct.

To decide what data to collect additional labels for, try the get_active_learning_scores function, which is intended for active learning with multiple annotators.

Parameters:
  • labels_multiannotator (pd.DataFrame of np.ndarray) –

    2D pandas DataFrame or array of multiple given labels for each example with shape (N, M), where N is the number of examples and M is the number of annotators. labels_multiannotator[n][m] = label for n-th example given by m-th annotator.

    For a dataset with K classes, each given label must be an integer in 0, 1, …, K-1 or NaN if this annotator did not label a particular example. If you have string or other differently formatted labels, you can convert them to the proper format using format_multiannotator_labels. If pd.DataFrame, column names should correspond to each annotator’s ID.

  • pred_probs (np.ndarray) – An array of shape (N, K) of predicted class probabilities from a trained classifier model. Predicted probabilities in the same format expected by the get_label_quality_scores.

  • consensus_method (str or List[str], default = "majority_vote") –

    Specifies the method used to aggregate labels from multiple annotators into a single consensus label. Options include:

    • majority_vote: consensus obtained using a simple majority vote among annotators, with ties broken via pred_probs.

    • best_quality: consensus obtained by selecting the label with highest label quality (quality determined by method specified in quality_method).

    A List may be passed if you want to consider multiple methods for producing consensus labels. If a List is passed, then the 0th element of the list is the method used to produce columns consensus_label, consensus_quality_score, annotator_agreement in the returned DataFrame. The remaning (1st, 2nd, 3rd, etc.) elements of this list are output as extra columns in the returned pandas DataFrame with names formatted as: consensus_label_SUFFIX, consensus_quality_score_SUFFIX where SUFFIX = each element of this list, which must correspond to a valid method for computing consensus labels.

  • quality_method (str, default = "crowdlab") –

    Specifies the method used to calculate the quality of the consensus label. Options include:

    • crowdlab: an emsemble method that weighs both the annotators’ labels as well as the model’s prediction.

    • agreement: the fraction of annotators that agree with the consensus label.

  • calibrate_probs (bool, default = False) – Boolean value that specifies whether the provided pred_probs should be re-calibrated to better match the annotators’ empirical label distribution. We recommend setting this to True in active learning applications, in order to prevent overconfident models from suggesting the wrong examples to collect labels for.

  • return_detailed_quality (bool, default = True) – Boolean to specify if detailed_label_quality is returned.

  • return_annotator_stats (bool, default = True) – Boolean to specify if annotator_stats is returned.

  • return_weights (bool, default = False) – Boolean to specify if model_weight and annotator_weight is returned. Model and annotator weights are applicable for quality_method == crowdlab, will return None for any other quality methods.

  • verbose (bool, default = True) – Important warnings and other printed statements may be suppressed if verbose is set to False.

  • label_quality_score_kwargs (dict, optional) – Keyword arguments to pass into get_label_quality_scores.

Return type:

Dict[str, Any]

Returns:

labels_info (dict) – Dictionary containing up to 5 pandas DataFrame with keys as below:

label_qualitypandas.DataFrame

pandas DataFrame in which each row corresponds to one example, with columns:

  • num_annotations: the number of annotators that have labeled each example.

  • consensus_label: the single label that is best for each example (you can control how it is derived from all annotators’ labels via the argument: consensus_method).

  • annotator_agreement: the fraction of annotators that agree with the consensus label (only consider the annotators that labeled that particular example).

  • consensus_quality_score: label quality score for consensus label, calculated by the method specified in quality_method.

detailed_label_qualitypandas.DataFrame

Only returned if return_detailed_quality=True. Returns a pandas DataFrame with columns quality_annotator_1, quality_annotator_2, …, quality_annotator_M where each entry is the label quality score for the labels provided by each annotator (is NaN for examples which this annotator did not label).

annotator_statspandas.DataFrame

Only returned if return_annotator_stats=True. Returns overall statistics about each annotator, sorted by lowest annotator_quality first. pandas DataFrame in which each row corresponds to one annotator (the row IDs correspond to annotator IDs), with columns:

  • annotator_quality: overall quality of a given annotator’s labels, calculated by the method specified in quality_method.

  • num_examples_labeled: number of examples annotated by a given annotator.

  • agreement_with_consensus: fraction of examples where a given annotator agrees with the consensus label.

  • worst_class: the class that is most frequently mislabeled by a given annotator.

model_weightfloat

Only returned if return_weights=True. It is only applicable for quality_method == crowdlab. The model weight specifies the weight of classifier model in weighted averages used to estimate label quality This number is an estimate of how trustworthy the model is relative the annotators.

annotator_weightnp.ndarray

Only returned if return_weights=True. It is only applicable for quality_method == crowdlab. An array of shape (M,) where M is the number of annotators, specifying the weight of each annotator in weighted averages used to estimate label quality. These weights are estimates of how trustworthy each annotator is relative to the other annotators.

cleanlab.multiannotator.get_label_quality_multiannotator_ensemble(labels_multiannotator, pred_probs, *, calibrate_probs=False, return_detailed_quality=True, return_annotator_stats=True, return_weights=False, verbose=True, label_quality_score_kwargs={})[source]#

Returns label quality scores for each example and for each annotator, based on predictions from an ensemble of models.

This function is similar to get_label_quality_multiannotator but for settings where you have trained an ensemble of multiple classifier models rather than a single model.

Parameters:
  • labels_multiannotator (pd.DataFrame of np.ndarray) – Multiannotator labels in the same format expected by get_label_quality_multiannotator.

  • pred_probs (np.ndarray) – An array of shape (P, N, K) where P is the number of models, consisting of predicted class probabilities from the ensemble models. Each set of predicted probabilities with shape (N, K) is in the same format expected by the get_label_quality_scores.

  • calibrate_probs (bool, default = False) – Boolean value as expected by get_label_quality_multiannotator.

  • return_detailed_quality (bool, default = True) – Boolean value as expected by get_label_quality_multiannotator.

  • return_annotator_stats (bool, default = True) – Boolean value as expected by get_label_quality_multiannotator.

  • return_weights (bool, default = False) – Boolean value as expected by get_label_quality_multiannotator.

  • verbose (bool, default = True) – Boolean value as expected by get_label_quality_multiannotator.

  • label_quality_score_kwargs (dict, optional) – Keyword arguments in the same format expected by py:func:get_label_quality_multiannotator <cleanlab.multiannotator.get_label_quality_multiannotator>.

Return type:

Dict[str, Any]

Returns:

labels_info (dict) – Dictionary containing up to 5 pandas DataFrame with keys as below:

label_qualitypandas.DataFrame

Similar to output as get_label_quality_multiannotator.

detailed_label_qualitypandas.DataFrame

Similar to output as get_label_quality_multiannotator.

annotator_statspandas.DataFrame

Similar to output as get_label_quality_multiannotator.

model_weightnp.ndarray

Only returned if return_weights=True. An array of shape (P,) where is the number of models in the ensemble, specifying the weight of each classifier model in weighted averages used to estimate label quality. These weigthts is an estimate of how trustworthy the model is relative the annotators. An array of shape (P,) where is the number of models in the ensemble, specifying the model weight used in weighted averages.

annotator_weightnp.ndarray

Only returned if return_weights=True. Similar to output as get_label_quality_multiannotator.

cleanlab.multiannotator.get_active_learning_scores(labels_multiannotator, pred_probs, pred_probs_unlabeled=None)[source]#

Returns an active learning quality score for each example in the dataset.

We consider settings where one example can be labeled by one or more annotators and some examples have no labels at all so far.

The score is in between 0 and 1, and can be used to prioritize what data to collect additional labels for. Lower scores indicate examples whose true label we are least confident about based on the current data; collecting additional labels for these low-scoring examples will be more informative than collecting labels for other examples. To use an annotation budget most efficiently, select a batch of examples with the lowest scores and collect one additional label for each example, and repeat this process after retraining your classifier.

To analyze a fixed dataset labeled by multiple annotators rather than collecting additional labels, try the get_label_quality_multiannotator function instead.

Parameters:
  • labels_multiannotator (pd.DataFrame of np.ndarray) – 2D pandas DataFrame or array of multiple given labels for each example with shape (N, M), where N is the number of examples and M is the number of annotators. Note that this function also works with datasets where there is only one annotator (M=1). For more details, labels in the same format expected by the get_label_quality_multiannotator. Note that examples that have no annotator labels should not be included in this DataFrame/array.

  • pred_probs (np.ndarray) – An array of shape (N, K) of predicted class probabilities from a trained classifier model. Predicted probabilities in the same format expected by the get_label_quality_scores.

  • pred_probs_unlabeled (np.ndarray, optional) – An array of shape (N, K) of predicted class probabilities from a trained classifier model for examples that have no annotator labels. Predicted probabilities in the same format expected by the get_label_quality_scores.

Return type:

Tuple[ndarray, ndarray]

Returns:

  • active_learning_scores (np.ndarray) – Array of shape (N,) indicating the active learning quality scores for each example. Examples with the lowest scores are those we should label next in order to maximally improve our classifier model.

  • active_learning_scores_unlabeled (np.ndarray) – Array of shape (N,) indicating the active learning quality scores for each unlabeled example. Returns an empty array if no unlabeled data is provided. Examples with the lowest scores are those we should label next in order to maximally improve our classifier model (scores for unlabeled data are directly comparable with the active_learning_scores for labeled data).

cleanlab.multiannotator.get_active_learning_scores_ensemble(labels_multiannotator, pred_probs, pred_probs_unlabeled=None)[source]#

Returns an active learning quality score for each example in the dataset, based on predictions from an ensemble of models.

This function is similar to get_active_learning_scores but allows for an ensemble of multiple classifier models to be trained and will aggregate predictions from the models to compute the active learning quality score.

Parameters:
  • labels_multiannotator (pd.DataFrame or np.ndarray) – Multiannotator labels in the same format expected by get_active_learning_scores.

  • pred_probs (np.ndarray) – An array of shape (P, N, K) where P is the number of models, consisting of predicted class probabilities from the ensemble models. Note that this function also works with datasets where there is only one annotator (M=1). Each set of predicted probabilities with shape (N, K) is in the same format expected by the get_label_quality_scores.

  • pred_probs_unlabeled (np.ndarray, optional) – An array of shape (P, N, K) where P is the number of models, consisting of predicted class probabilities from a trained classifier model for examples that have no annotated labels so far (but which we may want to label in the future, and hence compute active learning quality scores for). Each set of predicted probabilities with shape (N, K) is in the same format expected by the get_label_quality_scores.

Return type:

Tuple[ndarray, ndarray]

Returns:

  • active_learning_scores (np.ndarray) – Similar to output as get_label_quality_scores.

  • active_learning_scores_unlabeled (np.ndarray) – Similar to output as get_label_quality_scores.

cleanlab.multiannotator.get_majority_vote_label(labels_multiannotator, pred_probs=None, verbose=True)[source]#

Returns the majority vote label for each example, aggregated from the labels given by multiple annotators.

Parameters:
  • labels_multiannotator (pd.DataFrame or np.ndarray) – 2D pandas DataFrame or array of multiple given labels for each example with shape (N, M), where N is the number of examples and M is the number of annotators. For more details, labels in the same format expected by the get_label_quality_multiannotator.

  • pred_probs (np.ndarray, optional) – An array of shape (N, K) of model-predicted probabilities, P(label=k|x). For details, predicted probabilities in the same format expected by get_label_quality_multiannotator.

  • verbose (bool, optional) – Important warnings and other printed statements may be suppressed if verbose is set to False.

Return type:

ndarray

Returns:

consensus_label (np.ndarray) – An array of shape (N,) with the majority vote label aggregated from all annotators.

In the event of majority vote ties, ties are broken in the following order: using the model pred_probs (if provided) and selecting the class with highest predicted probability, using the empirical class frequencies and selecting the class with highest frequency, using an initial annotator quality score and selecting the class that has been labeled by annotators with higher quality, and lastly by random selection.

cleanlab.multiannotator.convert_long_to_wide_dataset(labels_multiannotator_long)[source]#

Converts a long format dataset to wide format which is suitable for passing into get_label_quality_multiannotator.

Dataframe must contain three columns named:

  1. task representing each example labeled by the annotators

  2. annotator representing each annotator

  3. label representing the label given by an annotator for the corresponding task (i.e. example)

Parameters:

labels_multiannotator_long (pd.DataFrame) – pandas DataFrame in long format with three columns named task, annotator and label

Return type:

DataFrame

Returns:

labels_multiannotator_wide (pd.DataFrame) – pandas DataFrame of the proper format to be passed as labels_multiannotator for the other cleanlab.multiannotator functions.