filter#

Methods to find label issues in token classification datasets (text data), where each token in a sentence receives its own class label.

The underlying algorithms are described in this paper.

Functions:

find_label_issues(labels, pred_probs, *[, ...])

Identifies tokens with label issues in a token classification dataset.

cleanlab.token_classification.filter.find_label_issues(labels, pred_probs, *, return_indices_ranked_by='self_confidence', low_memory=False, **kwargs)[source]#

Identifies tokens with label issues in a token classification dataset.

Tokens identified with issues will be ranked by their individual label quality score.

Instead use token_classification.rank.get_label_quality_scores if you prefer to rank the sentences based on their overall label quality.

Parameters:

labels (list) –
Nested list of given labels for all tokens, such that labels[i] is a list of labels, one for each token in the i-th sentence.

For a dataset with K classes, each class label must be integer in 0, 1, …, K-1.
pred_probs (list) –
List of np arrays, such that pred_probs[i] has shape (T, K) if the i-th sentence contains T tokens.

Each row of pred_probs[i] corresponds to a token t in the i-th sentence, and contains model-predicted probabilities that t belongs to each of the K possible classes.

Columns of each pred_probs[i] should be ordered such that the probabilities correspond to class 0, 1, …, K-1.
return_indices_ranked_by ({"self_confidence", "normalized_margin", "confidence_weighted_entropy"}, default "self_confidence") –
Returned token-indices are sorted by their label quality score.

See cleanlab.filter.find_label_issues documentation for more details on each label quality scoring method.
kwargs – Additional keyword arguments to pass into filter.find_label_issues which is internally applied at the token level. Can include values like n_jobs to control parallel processing, frac_noise, etc.

Return type:

List[Tuple[int, int]]

Returns:

issues – List of label issues identified by cleanlab, such that each element is a tuple (i, j), which indicates that the j-th token of the i-th sentence has a label issue.

These tuples are ordered in issues list based on the likelihood that the corresponding token is mislabeled.

Use token_classification.summary.display_issues to view these issues within the original sentences.

Examples

>>> import numpy as np
>>> from cleanlab.token_classification.filter import find_label_issues
>>> labels = [[0, 0, 1], [0, 1]]
>>> pred_probs = [
...     np.array([[0.9, 0.1], [0.7, 0.3], [0.05, 0.95]]),
...     np.array([[0.8, 0.2], [0.8, 0.2]]),
... ]
>>> find_label_issues(labels, pred_probs)
[(1, 1)]