Methods to rank and score sentences in a token classification dataset (text data), based on how likely they are to contain label errors.

The underlying algorithms are described in this paper.


get_label_quality_scores(labels, pred_probs, *)

Returns overall quality scores for the labels in each sentence, as well as for the individual tokens' labels in a token classification dataset.

issues_from_scores(sentence_scores, *[, ...])

Converts scores output by get_label_quality_scores to a list of issues of similar format as output by token_classification.filter.find_label_issues.

cleanlab.token_classification.rank.get_label_quality_scores(labels, pred_probs, *, tokens=None, token_score_method='self_confidence', sentence_score_method='min', sentence_score_kwargs={})[source]#

Returns overall quality scores for the labels in each sentence, as well as for the individual tokens’ labels in a token classification dataset.

Each score is between 0 and 1.

Lower scores indicate token labels that are less likely to be correct, or sentences that are more likely to contain a mislabeled token.

  • labels (list) –

    Nested list of given labels for all tokens, such that labels[i] is a list of labels, one for each token in the i-th sentence.

    For a dataset with K classes, each label must be in 0, 1, …, K-1.

  • pred_probs (list) –

    List of np arrays, such that pred_probs[i] has shape (T, K) if the i-th sentence contains T tokens.

    Each row of pred_probs[i] corresponds to a token t in the i-th sentence, and contains model-predicted probabilities that t belongs to each of the K possible classes.

    Columns of each pred_probs[i] should be ordered such that the probabilities correspond to class 0, 1, …, K-1.

  • tokens (Optional[list]) –

    Nested list such that tokens[i] is a list of tokens (strings/words) that comprise the i-th sentence.

    These strings are used to annotated the returned token_scores object, see its documentation for more information.

  • sentence_score_method ({"min", "softmin"}, default "min") –

    Method to aggregate individual token label quality scores into a single score for the sentence.

    • min: sentence score = minimum of token scores in the sentence

    • softmin: sentence score = <s, softmax(1-s, t)>, where s denotes the token label scores of the sentence, and <a, b> == np.dot(a, b). Here parameter t controls the softmax temperature, such that the score converges toward min as t -> 0. Unlike min, softmin is affected by the scores of all tokens in the sentence.

  • token_score_method ({"self_confidence", "normalized_margin", "confidence_weighted_entropy"}, default "self_confidence") –

    Label quality scoring method for each token.

    See cleanlab.rank.get_label_quality_scores documentation for more info.

  • sentence_score_kwargs (dict) –

    Optional keyword arguments for sentence_score_method function (for advanced users only).

    See ~cleanlab.token_classification.rank._softmin_sentence_score for more info about keyword arguments supported for that scoring method.

Return type:

Tuple[ndarray, list]


  • sentence_scores – Array of shape (N, ) of scores between 0 and 1, one per sentence in the dataset.

    Lower scores indicate sentences more likely to contain a label issue.

  • token_scores – List of pd.Series, such that token_info[i] contains the label quality scores for individual tokens in the i-th sentence.

    If tokens strings were provided, they are used as index for each Series.


>>> import numpy as np
>>> from cleanlab.token_classification.rank import get_label_quality_scores
>>> labels = [[0, 0, 1], [0, 1]]
>>> pred_probs = [
...     np.array([[0.9, 0.1], [0.7, 0.3], [0.05, 0.95]]),
...     np.array([[0.8, 0.2], [0.8, 0.2]]),
... ]
>>> sentence_scores, token_scores = get_label_quality_scores(labels, pred_probs)
>>> sentence_scores
array([0.7, 0.2])
>>> token_scores
[0    0.90
1    0.70
2    0.95
dtype: float64, 0    0.8
1    0.2
dtype: float64]
cleanlab.token_classification.rank.issues_from_scores(sentence_scores, *, token_scores=None, threshold=0.1)[source]#

Converts scores output by ~cleanlab.token_classification.rank.get_label_quality_scores to a list of issues of similar format as output by token_classification.filter.find_label_issues.

Issues are sorted by label quality score, from most to least severe.

Only considers as issues those tokens with label quality score lower than threshold, so this parameter determines the number of issues that are returned. This method is intended for converting the most severely mislabeled examples to a format compatible with summary methods like token_classification.summary.display_issues. This method does not estimate the number of label errors since the threshold is arbitrary, for that instead use token_classification.filter.find_label_issues, which estimates the label errors via Confident Learning rather than score thresholding.

  • sentence_scores (ndarray) –

    Array of shape (N, ) of overall sentence scores, where N is the number of sentences in the dataset.

    Same format as the sentence_scores returned by ~cleanlab.token_classification.rank.get_label_quality_scores.

  • token_scores (Optional[list]) –

    Optional list such that token_scores[i] contains the individual token scores for the i-th sentence.

    Same format as the token_scores returned by ~cleanlab.token_classification.rank.get_label_quality_scores.

  • threshold (float) – Tokens (or sentences, if token_scores is not provided) with quality scores above the threshold are not included in the result.

Return type:

Union[list, ndarray]


issues – List of label issues identified by comparing quality scores to threshold, such that each element is a tuple (i, j), which indicates that the j-th token of the i-th sentence has a label issue.

These tuples are ordered in issues list based on the token label quality score.

Use token_classification.summary.display_issues to view these issues within the original sentences.

If token_scores is not provided, returns array of integer indices (rather than tuples) of the sentences whose label quality score falls below the threshold (also sorted by overall label quality score of each sentence).


>>> import numpy as np
>>> from cleanlab.token_classification.rank import issues_from_scores
>>> sentence_scores = np.array([0.1, 0.3, 0.6, 0.2, 0.05, 0.9, 0.8, 0.0125, 0.5, 0.6])
>>> issues_from_scores(sentence_scores)
array([7, 4])

Changing the score threshold

>>> issues_from_scores(sentence_scores, threshold=0.5)
array([7, 4, 0, 3, 1])

Providing token scores along with sentence scores finds issues at the token level

>>> token_scores = [
...     [0.9, 0.6],
...     [0.0, 0.8, 0.8],
...     [0.8, 0.8],
...     [0.1, 0.02, 0.3, 0.4],
...     [0.1, 0.2, 0.03, 0.4],
...     [0.1, 0.2, 0.3, 0.04],
...     [0.1, 0.2, 0.4],
...     [0.3, 0.4],
...     [0.08, 0.2, 0.5, 0.4],
...     [0.1, 0.2, 0.3, 0.4],
... ]
>>> issues_from_scores(sentence_scores, token_scores=token_scores)
[(1, 0), (3, 1), (4, 2), (5, 3), (8, 0)]