filter#
Methods to find label issues in token classification datasets (text data), where each token in a sentence receives its own class label.
The underlying algorithms are described in this paper.
Functions:
| 
 | Identifies tokens with label issues in a token classification dataset. | 
- cleanlab.token_classification.filter.find_label_issues(labels, pred_probs, *, return_indices_ranked_by='self_confidence', low_memory=False, **kwargs)[source]#
- Identifies tokens with label issues in a token classification dataset. - Tokens identified with issues will be ranked by their individual label quality score. - Instead use - token_classification.rank.get_label_quality_scoresif you prefer to rank the sentences based on their overall label quality.- Parameters:
- labels ( - list) –- Nested list of given labels for all tokens, such that - labels[i]is a list of labels, one for each token in the- i-th sentence.- For a dataset with K classes, each class label must be integer in 0, 1, …, K-1. 
- pred_probs ( - list) –- List of np arrays, such that - pred_probs[i]has shape- (T, K)if the- i-th sentence contains T tokens.- Each row of - pred_probs[i]corresponds to a token- tin the- i-th sentence, and contains model-predicted probabilities that- tbelongs to each of the K possible classes.- Columns of each - pred_probs[i]should be ordered such that the probabilities correspond to class 0, 1, …, K-1.
- return_indices_ranked_by ( - {"self_confidence", "normalized_margin", "confidence_weighted_entropy"}, default- "self_confidence") –- Returned token-indices are sorted by their label quality score. - See - cleanlab.filter.find_label_issuesdocumentation for more details on each label quality scoring method.
- kwargs – Additional keyword arguments to pass into - filter.find_label_issueswhich is internally applied at the token level. Can include values like- n_jobsto control parallel processing,- frac_noise, etc.
 
- Return type:
- List[- Tuple[- int,- int]]
- Returns:
- issues– List of label issues identified by cleanlab, such that each element is a tuple- (i, j), which indicates that the- j-th token of the- i-th sentence has a label issue.- These tuples are ordered in - issueslist based on the likelihood that the corresponding token is mislabeled.- Use - token_classification.summary.display_issuesto view these issues within the original sentences.
 - Examples - >>> import numpy as np >>> from cleanlab.token_classification.filter import find_label_issues >>> labels = [[0, 0, 1], [0, 1]] >>> pred_probs = [ ... np.array([[0.9, 0.1], [0.7, 0.3], [0.05, 0.95]]), ... np.array([[0.8, 0.2], [0.8, 0.2]]), ... ] >>> find_label_issues(labels, pred_probs) [(1, 1)]