summary#
Methods to display sentences and their label issues in a token classification dataset (text data), as well as summarize the types of issues identified.
Functions:
| 
 | Display token classification label issues, showing sentence with problematic token(s) highlighted. | 
| 
 | Display the tokens (words) that most commonly have label issues. | 
| 
 | Return subset of label issues involving a particular token. | 
- cleanlab.token_classification.summary.display_issues(issues, tokens, *, labels=None, pred_probs=None, exclude=[], class_names=None, top=20)[source]#
- Display token classification label issues, showing sentence with problematic token(s) highlighted. - Can also shows given and predicted label for each token identified to have label issue. - Parameters:
- issues ( - list) –- List of tuples - (i, j)representing a label issue for the- j-th token of the- i-th sentence.- Same format as output by - token_classification.filter.find_label_issuesor- token_classification.rank.issues_from_scores.
- tokens ( - List[- List[- str]]) – Nested list such that- tokens[i]is a list of tokens (strings/words) that comprise the- i-th sentence.
- labels ( - Optional[- list]) –- Optional nested list of given labels for all tokens, such that - labels[i]is a list of labels, one for each token in the- i-th sentence. For a dataset with K classes, each label must be in 0, 1, …, K-1.- If - labelsis provided, this function also displays given label of the token identified with issue.
- pred_probs ( - Optional[- list]) –- Optional list of np arrays, such that - pred_probs[i]has shape- (T, K)if the- i-th sentence contains T tokens.- Each row of - pred_probs[i]corresponds to a token- tin the- i-th sentence, and contains model-predicted probabilities that- tbelongs to each of the K possible classes.- Columns of each - pred_probs[i]should be ordered such that the probabilities correspond to class 0, 1, …, K-1.- If - pred_probsis provided, this function also displays predicted label of the token identified with issue.
- exclude ( - List[- Tuple[- int,- int]]) – Optional list of given/predicted label swaps (tuples) to be ignored. For example, if- exclude=[(0, 1), (1, 0)], tokens whose label was likely swapped between class 0 and 1 are not displayed. Class labels must be in 0, 1, …, K-1.
- class_names ( - Optional[- List[- str]]) –- Optional length K list of names of each class, such that - class_names[i]is the string name of the class corresponding to- labelswith value- i.- If - class_namesis provided, display these string names for predicted and given labels, otherwise display the integer index of classes.
- top ( - int, default- 20) – Maximum number of issues to be printed.
 
 - Examples - >>> from cleanlab.token_classification.summary import display_issues >>> issues = [(2, 0), (0, 1)] >>> tokens = [ ... ["A", "?weird", "sentence"], ... ["A", "valid", "sentence"], ... ["An", "sentence", "with", "a", "typo"], ... ] >>> display_issues(issues, tokens) Sentence 2, token 0: ---- An sentence with a typo ... ... Sentence 0, token 1: ---- A ?weird sentence - Return type:
- None
 
- cleanlab.token_classification.summary.common_label_issues(issues, tokens, *, labels=None, pred_probs=None, class_names=None, top=10, exclude=[], verbose=True)[source]#
- Display the tokens (words) that most commonly have label issues. - These may correspond to words that are ambiguous or systematically misunderstood by the data annotators. - Parameters:
- issues ( - List[- Tuple[- int,- int]]) –- List of tuples - (i, j)representing a label issue for the- j-th token of the- i-th sentence.- Same format as output by - token_classification.filter.find_label_issuesor- token_classification.rank.issues_from_scores.
- tokens ( - List[- List[- str]]) – Nested list such that- tokens[i]is a list of tokens (strings/words) that comprise the- i-th sentence.
- labels ( - Optional[- list]) –- Optional nested list of given labels for all tokens in the same format as - labelsfor- token_classification.summary.display_issues.- If - labelsis provided, this function also displays given label of the token identified to commonly suffer from label issues.
- pred_probs ( - Optional[- list]) –- Optional list of model-predicted probabilities (np arrays) in the same format as - pred_probsfor- token_classification.summary.display_issues.- If both - labelsand- pred_probsare provided, also reports each type of given/predicted label swap for tokens identified to commonly suffer from label issues.
- class_names ( - Optional[- List[- str]]) –- Optional length K list of names of each class, such that - class_names[i]is the string name of the class corresponding to- labelswith value- i.- If - class_namesis provided, display these string names for predicted and given labels, otherwise display the integer index of classes.
- top ( - int) – Maximum number of tokens to print information for.
- exclude ( - List[- Tuple[- int,- int]]) – Optional list of given/predicted label swaps (tuples) to be ignored in the same format as- excludefor- token_classification.summary.display_issues.
- verbose ( - bool) – Whether to also print out the token information in the returned DataFrame- df.
 
- Return type:
- DataFrame
- Returns:
- df– If both- labelsand- pred_probsare provided, DataFrame- dfcontains columns- ['token', 'given_label', 'predicted_label', 'num_label_issues'], and each row contains information for a specific token and given/predicted label swap, ordered by the number of label issues inferred for this type of label swap.- Otherwise, - dfonly has columns [‘token’, ‘num_label_issues’], and each row contains the information for a specific token, ordered by the number of total label issues involving this token.
 - Examples - >>> from cleanlab.token_classification.summary import common_label_issues >>> issues = [(2, 0), (0, 1)] >>> tokens = [ ... ["A", "?weird", "sentence"], ... ["A", "valid", "sentence"], ... ["An", "sentence", "with", "a", "typo"], ... ] >>> df = common_label_issues(issues, tokens) >>> df token num_label_issues 0 An 1 1 ?weird 1 
- cleanlab.token_classification.summary.filter_by_token(token, issues, tokens)[source]#
- Return subset of label issues involving a particular token. - Parameters:
- token ( - str) – A specific token you are interested in.
- issues ( - List[- Tuple[- int,- int]]) – List of tuples- (i, j)representing a label issue for the- j-th token of the- i-th sentence. Same format as output by- token_classification.filter.find_label_issuesor- token_classification.rank.issues_from_scores.
- tokens ( - List[- List[- str]]) – Nested list such that- tokens[i]is a list of tokens (strings/words) that comprise the- i-th sentence.
 
- Return type:
- List[- Tuple[- int,- int]]
- Returns:
- issues_subset– List of tuples- (i, j)representing a label issue for the- j-th token of the- i-th sentence, in the same format as- issues. But restricting to only those issues that involve the specified- token.
 - Examples - >>> from cleanlab.token_classification.summary import filter_by_token >>> token = "?weird" >>> issues = [(2, 0), (0, 1)] >>> tokens = [ ... ["A", "?weird", "sentence"], ... ["A", "valid", "sentence"], ... ["An", "sentence", "with", "a", "typo"], ... ] >>> filter_by_token(token, issues, tokens) [(0, 1)]