multiannotator#
Methods for analysis of classification data labeled by multiple annotators, including computation of:
A consensus label for each example that aggregates the individual annotations more accurately than alternative aggregation via majority-vote or other algorithms used in crowdsourcing.
A quality score for each consensus label which measures our confidence that this label is correct.
An analogous label quality score for each individual label chosen by one annotator for a particular example.
An overall quality score for each annotator which measures our confidence in the overall correctness of labels obtained from this annotator.
Functions:
Converts a long format dataset to wide format which is suitable for passing into |
|
|
Returns label quality scores for each example and for each annotator. |
|
Returns the majority vote label for each example, aggregated from the labels given by multiple annotators. |
- cleanlab.multiannotator.convert_long_to_wide_dataset(labels_multiannotator_long)[source]#
Converts a long format dataset to wide format which is suitable for passing into
get_label_quality_multiannotator
.Dataframe must contain three columns named:
task
representing each example labeled by the annotatorsannotator
representing each annotatorlabel
representing the label given by an annotator for the corresponding task (i.e. example)
- Parameters:
labels_multiannotator_long (
pd.DataFrame
) – pandas DataFrame in long format with three columns namedtask
,annotator
andlabel
- Return type:
DataFrame
- Returns:
labels_multiannotator_wide (
pd.DataFrame
) – pandas DataFrame of the proper format to be passed aslabels_multiannotator
for the othercleanlab.multiannotator
functions.
- cleanlab.multiannotator.get_label_quality_multiannotator(labels_multiannotator, pred_probs, *, consensus_method='best_quality', quality_method='crowdlab', return_detailed_quality=True, return_annotator_stats=True, verbose=True, label_quality_score_kwargs={})[source]#
Returns label quality scores for each example and for each annotator.
This function is for multiclass classification datasets where examples have been labeled by multiple annotators (not necessarily the same number of annotators per example).
It computes one consensus label for each example that best accounts for the labels chosen by each annotator (and their quality), as well as a score for how confident we are that this consensus label is actually correct. It also computes the the quality scores for each annotator’s individual labels, and the quality of each annotator.
The score is between 0 and 1; lower scores indicate labels less likely to be correct. For example:
1 - clean label (the given label is likely correct).
0 - dirty label (the given label is unlikely correct).
- Parameters:
labels_multiannotator (
pd.DataFrame
ofnp.ndarray
) –2D pandas DataFrame or array of multiple given labels for each example with shape
(N, M)
, where N is the number of examples and M is the number of annotators.labels_multiannotator[n][m]
= label for n-th example given by m-th annotator.For a dataset with K classes, each given label must be an integer in 0, 1, …, K-1 or
NaN
if this annotator did not label a particular example. If pd.DataFrame, column names should correspond to each annotator’s ID.pred_probs (
np.ndarray
) – An array of shape(N, K)
of predicted class probabilities from a trained classifier model. Predicted probabilities in the same format expected by theget_label_quality_scores
.consensus_method (
str
orList[str]
, default ="majority_vote"
) –Specifies the method used to aggregate labels from multiple annotators into a single consensus label. Options include:
majority_vote
: consensus obtained using a simple majority vote among annotators, with ties broken viapred_probs
.best_quality
: consensus obtained by selecting the label with highest label quality (quality determined by method specified inquality_method
).
A List may be passed if you want to consider multiple methods for producing consensus labels. If a List is passed, then the 0th element of the list is the method used to produce columns consensus_label, consensus_quality_score, annotator_agreement in the returned DataFrame. The remaning (1st, 2nd, 3rd, etc.) elements of this list are output as extra columns in the returned pandas DataFrame with names formatted as: consensus_label_SUFFIX, consensus_quality_score_SUFFIX where SUFFIX = each element of this list, which must correspond to a valid method for computing consensus labels.
quality_method (
str
, default ="crowdlab"
) –Specifies the method used to calculate the quality of the consensus label. Options include:
crowdlab
: an emsemble method that weighs both the annotators’ labels as well as the model’s prediction.agreement
: the fraction of annotators that agree with the consensus label.
return_detailed_quality (
bool
, default= True
) – Boolean to specify if detailed_label_quality is returned.return_annotator_stats (
bool
, default= True
) – Boolean to specify if annotator_stats is returned.verbose (
bool
, default= True
) – Important warnings and other printed statements may be suppressed ifverbose
is set toFalse
.label_quality_score_kwargs (
dict
, optional) – Keyword arguments to pass intoget_label_quality_scores
.
- Return type:
Dict
[str
,DataFrame
]- Returns:
labels_info (
dict
) – Dictionary containing up to 3 pandas DataFrame with keys as below:label_quality
pandas.DataFramepandas DataFrame in which each row corresponds to one example, with columns:
num_annotations
: the number of annotators that have labeled each example.consensus_label
: the single label that is best for each example (you can control how it is derived from all annotators’ labels via the argument:consensus_method
).annotator_agreement
: the fraction of annotators that agree with the consensus label (only consider the annotators that labeled that particular example).consensus_quality_score
: label quality score for consensus label, calculated by the method specified inquality_method
.
detailed_label_quality
pandas.DataFrame (returned if return_detailed_quality=True)Returns a pandas DataFrame with columns quality_annotator_1, quality_annotator_2, …, quality_annotator_M where each entry is the label quality score for the labels provided by each annotator (is
NaN
for examples which this annotator did not label).annotator_stats
pandas.DataFrame (returned if return_annotator_stats=True)Returns overall statistics about each annotator, sorted by lowest annotator_quality first. pandas DataFrame in which each row corresponds to one annotator (the row IDs correspond to annotator IDs), with columns:
annotator_quality
: overall quality of a given annotator’s labels, calculated by the method specified inquality_method
.num_examples_labeled
: number of examples annotated by a given annotator.agreement_with_consensus
: fraction of examples where a given annotator agrees with the consensus label.worst_class
: the class that is most frequently mislabeled by a given annotator.
- cleanlab.multiannotator.get_majority_vote_label(labels_multiannotator, pred_probs=None, verbose=True)[source]#
Returns the majority vote label for each example, aggregated from the labels given by multiple annotators.
- Parameters:
labels_multiannotator (
pd.DataFrame
ornp.ndarray
) – 2D pandas DataFrame or array of multiple given labels for each example with shape(N, M)
, where N is the number of examples and M is the number of annotators. For more details, labels in the same format expected by theget_label_quality_multiannotator
.pred_probs (
np.ndarray
, optional) – An array of shape(N, K)
of model-predicted probabilities,P(label=k|x)
. For details, predicted probabilities in the same format expected by theget_label_quality_multiannotator
.verbose (
bool
, optional) – Important warnings and other printed statements may be suppressed ifverbose
is set toFalse
.
- Return type:
ndarray
- Returns:
consensus_label (
np.ndarray
) – An array of shape(N,)
with the majority vote label aggregated from all annotators.In the event of majority vote ties, ties are broken in the following order: using the model
pred_probs
(if provided) and selecting the class with highest predicted probability, using the empirical class frequencies and selecting the class with highest frequency, using an initial annotator quality score and selecting the class that has been labeled by annotators with higher quality, and lastly by random selection.