summary#

Methods to display sentences and their label issues in a token classification dataset (text data), as well as summarize the types of issues identified.

Functions:

display_issues(issues, tokens, *[, labels, ...])

Display token classification label issues, showing sentence with problematic token(s) highlighted.

common_label_issues(issues, tokens, *[, ...])

Display the tokens (words) that most commonly have label issues.

filter_by_token(token, issues, tokens)

Return subset of label issues involving a particular token.

cleanlab.token_classification.summary.display_issues(issues, tokens, *, labels=None, pred_probs=None, exclude=[], class_names=None, top=20)[source]#

Display token classification label issues, showing sentence with problematic token(s) highlighted.

Can also shows given and predicted label for each token identified to have label issue.

Parameters:
  • issues (list) –

    List of tuples (i, j) representing a label issue for the j-th token of the i-th sentence.

    Same format as output by token_classification.filter.find_label_issues or token_classification.rank.issues_from_scores.

  • tokens (List[List[str]]) – Nested list such that tokens[i] is a list of tokens (strings/words) that comprise the i-th sentence.

  • labels (Optional[list]) –

    Optional nested list of given labels for all tokens, such that labels[i] is a list of labels, one for each token in the i-th sentence. For a dataset with K classes, each label must be in 0, 1, …, K-1.

    If labels is provided, this function also displays given label of the token identified with issue.

  • pred_probs (Optional[list]) –

    Optional list of np arrays, such that pred_probs[i] has shape (T, K) if the i-th sentence contains T tokens.

    Each row of pred_probs[i] corresponds to a token t in the i-th sentence, and contains model-predicted probabilities that t belongs to each of the K possible classes.

    Columns of each pred_probs[i] should be ordered such that the probabilities correspond to class 0, 1, …, K-1.

    If pred_probs is provided, this function also displays predicted label of the token identified with issue.

  • exclude (List[Tuple[int, int]]) – Optional list of given/predicted label swaps (tuples) to be ignored. For example, if exclude=[(0, 1), (1, 0)], tokens whose label was likely swapped between class 0 and 1 are not displayed. Class labels must be in 0, 1, …, K-1.

  • class_names (Optional[List[str]]) –

    Optional length K list of names of each class, such that class_names[i] is the string name of the class corresponding to labels with value i.

    If class_names is provided, display these string names for predicted and given labels, otherwise display the integer index of classes.

  • top (int, default 20) – Maximum number of issues to be printed.

Return type:

None

Examples

>>> from cleanlab.token_classification.summary import display_issues
>>> issues = [(2, 0), (0, 1)]
>>> tokens = [
...     ["A", "?weird", "sentence"],
...     ["A", "valid", "sentence"],
...     ["An", "sentence", "with", "a", "typo"],
... ]
>>> display_issues(issues, tokens)
Sentence 2, token 0:
----
An sentence with a typo
...
...
Sentence 0, token 1:
----
A ?weird sentence
cleanlab.token_classification.summary.common_label_issues(issues, tokens, *, labels=None, pred_probs=None, class_names=None, top=10, exclude=[], verbose=True)[source]#

Display the tokens (words) that most commonly have label issues.

These may correspond to words that are ambiguous or systematically misunderstood by the data annotators.

Parameters:
  • issues (List[Tuple[int, int]]) –

    List of tuples (i, j) representing a label issue for the j-th token of the i-th sentence.

    Same format as output by token_classification.filter.find_label_issues or token_classification.rank.issues_from_scores.

  • tokens (List[List[str]]) – Nested list such that tokens[i] is a list of tokens (strings/words) that comprise the i-th sentence.

  • labels (Optional[list]) –

    Optional nested list of given labels for all tokens in the same format as labels for ~cleanlab.token_classification.summary.display_issues.

    If labels is provided, this function also displays given label of the token identified to commonly suffer from label issues.

  • pred_probs (Optional[list]) –

    Optional list of model-predicted probabilities (np arrays) in the same format as pred_probs for ~cleanlab.token_classification.summary.display_issues.

    If both labels and pred_probs are provided, also reports each type of given/predicted label swap for tokens identified to commonly suffer from label issues.

  • class_names (Optional[List[str]]) –

    Optional length K list of names of each class, such that class_names[i] is the string name of the class corresponding to labels with value i.

    If class_names is provided, display these string names for predicted and given labels, otherwise display the integer index of classes.

  • top (int) – Maximum number of tokens to print information for.

  • exclude (List[Tuple[int, int]]) – Optional list of given/predicted label swaps (tuples) to be ignored in the same format as exclude for ~cleanlab.token_classification.summary.display_issues.

  • verbose (bool) – Whether to also print out the token information in the returned DataFrame df.

Return type:

DataFrame

Returns:

df – If both labels and pred_probs are provided, DataFrame df contains columns ['token', 'given_label', 'predicted_label', 'num_label_issues'], and each row contains information for a specific token and given/predicted label swap, ordered by the number of label issues inferred for this type of label swap.

Otherwise, df only has columns [‘token’, ‘num_label_issues’], and each row contains the information for a specific token, ordered by the number of total label issues involving this token.

Examples

>>> from cleanlab.token_classification.summary import common_label_issues
>>> issues = [(2, 0), (0, 1)]
>>> tokens = [
...     ["A", "?weird", "sentence"],
...     ["A", "valid", "sentence"],
...     ["An", "sentence", "with", "a", "typo"],
... ]
>>> df = common_label_issues(issues, tokens)
>>> df
    token  num_label_issues
0      An                 1
1  ?weird                 1
cleanlab.token_classification.summary.filter_by_token(token, issues, tokens)[source]#

Return subset of label issues involving a particular token.

Parameters:
Return type:

List[Tuple[int, int]]

Returns:

issues_subset – List of tuples (i, j) representing a label issue for the j-th token of the i-th sentence, in the same format as issues. But restricting to only those issues that involve the specified token.

Examples

>>> from cleanlab.token_classification.summary import filter_by_token
>>> token = "?weird"
>>> issues = [(2, 0), (0, 1)]
>>> tokens = [
...     ["A", "?weird", "sentence"],
...     ["A", "valid", "sentence"],
...     ["An", "sentence", "with", "a", "typo"],
... ]
>>> filter_by_token(token, issues, tokens)
[(0, 1)]