summary#
Methods to display sentences and their label issues in a token classification dataset (text data), as well as summarize the types of issues identified.
Functions:
|
Display token classification label issues, showing sentence with problematic token(s) highlighted. |
|
Display the tokens (words) that most commonly have label issues. |
|
Return subset of label issues involving a particular token. |
- cleanlab.token_classification.summary.display_issues(issues, tokens, *, labels=None, pred_probs=None, exclude=[], class_names=None, top=20)[source]#
Display token classification label issues, showing sentence with problematic token(s) highlighted.
Can also shows given and predicted label for each token identified to have label issue.
- Parameters:
issues (
list
) –List of tuples
(i, j)
representing a label issue for thej
-th token of thei
-th sentence.Same format as output by
token_classification.filter.find_label_issues
ortoken_classification.rank.issues_from_scores
.tokens (
List
[List
[str
]]) – Nested list such thattokens[i]
is a list of tokens (strings/words) that comprise thei
-th sentence.labels (
Optional
[list
]) –Optional nested list of given labels for all tokens, such that
labels[i]
is a list of labels, one for each token in thei
-th sentence. For a dataset with K classes, each label must be in 0, 1, …, K-1.If
labels
is provided, this function also displays given label of the token identified with issue.pred_probs (
Optional
[list
]) –Optional list of np arrays, such that
pred_probs[i]
has shape(T, K)
if thei
-th sentence contains T tokens.Each row of
pred_probs[i]
corresponds to a tokent
in thei
-th sentence, and contains model-predicted probabilities thatt
belongs to each of the K possible classes.Columns of each
pred_probs[i]
should be ordered such that the probabilities correspond to class 0, 1, …, K-1.If
pred_probs
is provided, this function also displays predicted label of the token identified with issue.exclude (
List
[Tuple
[int
,int
]]) – Optional list of given/predicted label swaps (tuples) to be ignored. For example, ifexclude=[(0, 1), (1, 0)]
, tokens whose label was likely swapped between class 0 and 1 are not displayed. Class labels must be in 0, 1, …, K-1.class_names (
Optional
[List
[str
]]) –Optional length K list of names of each class, such that
class_names[i]
is the string name of the class corresponding tolabels
with valuei
.If
class_names
is provided, display these string names for predicted and given labels, otherwise display the integer index of classes.top (
int
, default20
) – Maximum number of issues to be printed.
Examples
>>> from cleanlab.token_classification.summary import display_issues >>> issues = [(2, 0), (0, 1)] >>> tokens = [ ... ["A", "?weird", "sentence"], ... ["A", "valid", "sentence"], ... ["An", "sentence", "with", "a", "typo"], ... ] >>> display_issues(issues, tokens) Sentence 2, token 0: ---- An sentence with a typo ... ... Sentence 0, token 1: ---- A ?weird sentence
- Return type:
None
- cleanlab.token_classification.summary.common_label_issues(issues, tokens, *, labels=None, pred_probs=None, class_names=None, top=10, exclude=[], verbose=True)[source]#
Display the tokens (words) that most commonly have label issues.
These may correspond to words that are ambiguous or systematically misunderstood by the data annotators.
- Parameters:
issues (
List
[Tuple
[int
,int
]]) –List of tuples
(i, j)
representing a label issue for thej
-th token of thei
-th sentence.Same format as output by
token_classification.filter.find_label_issues
ortoken_classification.rank.issues_from_scores
.tokens (
List
[List
[str
]]) – Nested list such thattokens[i]
is a list of tokens (strings/words) that comprise thei
-th sentence.labels (
Optional
[list
]) –Optional nested list of given labels for all tokens in the same format as
labels
fortoken_classification.summary.display_issues
.If
labels
is provided, this function also displays given label of the token identified to commonly suffer from label issues.pred_probs (
Optional
[list
]) –Optional list of model-predicted probabilities (np arrays) in the same format as
pred_probs
fortoken_classification.summary.display_issues
.If both
labels
andpred_probs
are provided, also reports each type of given/predicted label swap for tokens identified to commonly suffer from label issues.class_names (
Optional
[List
[str
]]) –Optional length K list of names of each class, such that
class_names[i]
is the string name of the class corresponding tolabels
with valuei
.If
class_names
is provided, display these string names for predicted and given labels, otherwise display the integer index of classes.top (
int
) – Maximum number of tokens to print information for.exclude (
List
[Tuple
[int
,int
]]) – Optional list of given/predicted label swaps (tuples) to be ignored in the same format asexclude
fortoken_classification.summary.display_issues
.verbose (
bool
) – Whether to also print out the token information in the returned DataFramedf
.
- Return type:
DataFrame
- Returns:
df
– If bothlabels
andpred_probs
are provided, DataFramedf
contains columns['token', 'given_label', 'predicted_label', 'num_label_issues']
, and each row contains information for a specific token and given/predicted label swap, ordered by the number of label issues inferred for this type of label swap.Otherwise,
df
only has columns [‘token’, ‘num_label_issues’], and each row contains the information for a specific token, ordered by the number of total label issues involving this token.
Examples
>>> from cleanlab.token_classification.summary import common_label_issues >>> issues = [(2, 0), (0, 1)] >>> tokens = [ ... ["A", "?weird", "sentence"], ... ["A", "valid", "sentence"], ... ["An", "sentence", "with", "a", "typo"], ... ] >>> df = common_label_issues(issues, tokens) >>> df token num_label_issues 0 An 1 1 ?weird 1
- cleanlab.token_classification.summary.filter_by_token(token, issues, tokens)[source]#
Return subset of label issues involving a particular token.
- Parameters:
token (
str
) – A specific token you are interested in.issues (
List
[Tuple
[int
,int
]]) – List of tuples(i, j)
representing a label issue for thej
-th token of thei
-th sentence. Same format as output bytoken_classification.filter.find_label_issues
ortoken_classification.rank.issues_from_scores
.tokens (
List
[List
[str
]]) – Nested list such thattokens[i]
is a list of tokens (strings/words) that comprise thei
-th sentence.
- Return type:
List
[Tuple
[int
,int
]]- Returns:
issues_subset
– List of tuples(i, j)
representing a label issue for thej
-th token of thei
-th sentence, in the same format asissues
. But restricting to only those issues that involve the specifiedtoken
.
Examples
>>> from cleanlab.token_classification.summary import filter_by_token >>> token = "?weird" >>> issues = [(2, 0), (0, 1)] >>> tokens = [ ... ["A", "?weird", "sentence"], ... ["A", "valid", "sentence"], ... ["An", "sentence", "with", "a", "typo"], ... ] >>> filter_by_token(token, issues, tokens) [(0, 1)]