filter#
Methods to find label issues in token classification datasets (text data), where each token in a sentence receives its own class label.
The underlying algorithms are described in this paper.
Functions:
|
Identifies tokens with label issues in a token classification dataset. |
- cleanlab.token_classification.filter.find_label_issues(labels, pred_probs, *, return_indices_ranked_by='self_confidence', **kwargs)[source]#
Identifies tokens with label issues in a token classification dataset.
Tokens identified with issues will be ranked by their individual label quality score.
Instead use
token_classification.rank.get_label_quality_scores
if you prefer to rank the sentences based on their overall label quality.- Parameters:
labels (
list
) –Nested list of given labels for all tokens, such that
labels[i]
is a list of labels, one for each token in thei
-th sentence.For a dataset with K classes, each class label must be integer in 0, 1, …, K-1.
pred_probs (
list
) –List of np arrays, such that
pred_probs[i]
has shape(T, K)
if thei
-th sentence contains T tokens.Each row of
pred_probs[i]
corresponds to a tokent
in thei
-th sentence, and contains model-predicted probabilities thatt
belongs to each of the K possible classes.Columns of each
pred_probs[i]
should be ordered such that the probabilities correspond to class 0, 1, …, K-1.return_indices_ranked_by (
{"self_confidence", "normalized_margin", "confidence_weighted_entropy"}
, default"self_confidence"
) –Returned token-indices are sorted by their label quality score.
See
cleanlab.filter.find_label_issues
documentation for more details on each label quality scoring method.kwargs – Additional keyword arguments to pass into
filter.find_label_issues
which is internally applied at the token level. Can include values liken_jobs
to control parallel processing,frac_noise
, etc.
- Return type:
List
[Tuple
[int
,int
]]- Returns:
issues
– List of label issues identified by cleanlab, such that each element is a tuple(i, j)
, which indicates that thej
-th token of thei
-th sentence has a label issue.These tuples are ordered in
issues
list based on the likelihood that the corresponding token is mislabeled.Use
token_classification.summary.display_issues
to view these issues within the original sentences.
Examples
>>> import numpy as np >>> from cleanlab.token_classification.filter import find_label_issues >>> labels = [[0, 0, 1], [0, 1]] >>> pred_probs = [ ... np.array([[0.9, 0.1], [0.7, 0.3], [0.05, 0.95]]), ... np.array([[0.8, 0.2], [0.8, 0.2]]), ... ] >>> find_label_issues(labels, pred_probs) [(1, 1)]