filter#

Methods to identify which examples have label issues.

Functions:

find_label_issues(labels, pred_probs, *[, ...])

Identifies potential label issues in the dataset using confident learning.

find_label_issues_using_argmax_confusion_matrix(...)

This is a baseline approach that uses the confusion matrix of argmax(pred_probs) and labels as the confident joint and then uses cleanlab (confident learning) to find the label issues using this matrix.

find_predicted_neq_given(labels, pred_probs, *)

A simple baseline approach that considers argmax(pred_probs) != labels as a label error.

cleanlab.filter.find_label_issues(labels, pred_probs, *, confident_joint=None, filter_by='prune_by_noise_rate', return_indices_ranked_by=None, rank_by_kwargs={}, multi_label=False, frac_noise=1.0, num_to_remove_per_class=None, min_examples_per_class=1, n_jobs=None, verbose=False)[source]#

Identifies potential label issues in the dataset using confident learning.

Returns a boolean mask for the entire dataset where True represents a label issue and False represents an example that is confidently/accurately labeled.

Instead of a mask, you can obtain indices of the label issues in your dataset by setting return_indices_ranked_by to specify the label quality score used to order the label issues.

The number of indices returned is controlled by frac_noise: reduce its value to identify fewer label issues. If you aren’t sure, leave this set to 1.0.

Tip: if you encounter the error “pred_probs is not defined”, try setting n_jobs=1.

Parameters
  • labels (np.array) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in 0, 1, …, K-1.

  • pred_probs (np.array, optional) –

    An array of shape (N, K) of model-predicted probabilities, P(label=k|x). Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class, for each of the K classes. The columns must be ordered such that these probabilities correspond to class 0, 1, …, K-1.

    Caution: pred_probs from your model must be out-of-sample! You should never provide predictions on the same examples used to train the model, as these will be overfit and unsuitable for finding label-errors. To obtain out-of-sample predicted probabilities for every datapoint in your dataset, you can use cross-validation. Alternatively it is ok if your model was trained on a separate dataset and you are only evaluating data that was previously held-out.

  • confident_joint (np.array, optional) – An array of shape (K, K) representing the confident joint, the matrix used for identifying label issues, which estimates a confident subset of the joint distribution of the noisy and true labels, P_{noisy label, true label}. Entry (j, k) in the matrix is the number of examples confidently counted into the pair of (noisy label=j, true label=k) classes. The confident_joint can be computed using count.compute_confident_joint. If not provided, it is computed from the given (noisy) labels and pred_probs.

  • filter_by ({'prune_by_class', 'prune_by_noise_rate', 'both', 'confident_learning', 'predicted_neq_given'}, default 'prune_by_noise_rate') –

    Method used for filtering/pruning out the label issues:

    • 'prune_by_noise_rate': works by removing examples with high probability of being mislabeled for every non-diagonal in the confident joint (see prune_counts_matrix in filter.py). These are the examples where (with high confidence) the given label is unlikely to match the predicted label for the example.

    • 'prune_by_class': works by removing the examples with smallest probability of belonging to their given class label for every class.

    • 'both': Removes only the examples that would be filtered by both 'prune_by_noise_rate' and 'prune_by_class'.

    • 'confident_learning': Returns the examples in the off-diagonals of the confident joint. These are the examples that are confidently predicted to be a different label than their given label.

    • 'predicted_neq_given': Find examples where the predicted class (i.e. argmax of the predicted probabilities) does not match the given label.

  • return_indices_ranked_by ({None, 'self_confidence', 'normalized_margin', 'confidence_weighted_entropy'}, default None) –

    If None, returns a boolean mask (True if example at index is label error). If not None, returns an array of the label error indices (instead of a boolean mask) where error indices are ordered:

    • 'normalized_margin': normalized margin (p(label = k) - max(p(label != k)))

    • 'self_confidence': [pred_probs[i][labels[i]] for i in label_issues_idx]

    • 'confidence_weighted_entropy': entropy(pred_probs) / self_confidence

  • rank_by_kwargs (dict, optional) – Optional keyword arguments to pass into scoring functions for ranking by label quality score (see rank.get_label_quality_scores).

  • multi_label (bool, optional) – If True, labels should be an iterable (e.g. list) of iterables, containing a list of labels for each example, instead of just a single label. The multi-label setting supports classification tasks where an example has 1 or more labels. Example of a multi-labeled labels input: [[0,1], [1], [0,2], [0,1,2], [0], [1], ...].

  • frac_noise (float, default 1.0) –

    Used to only return the “top” frac_noise * num_label_issues. The choice of which “top” label issues to return is dependent on the filter_by method used. It works by reducing the size of the off-diagonals of the joint distribution of given labels and true labels proportionally by frac_noise prior to estimating label issues with each method. This parameter only applies for filter_by=both, filter_by=prune_by_class, and filter_by=prune_by_noise_rate methods and currently is unused by other methods. When frac_noise=1.0, return all “confident” estimated noise indices (recommended).

    frac_noise * number_of_mislabeled_examples_in_class_k.

  • num_to_remove_per_class (array_like) –

    An iterable of length K, the number of classes. E.g. if K = 3, num_to_remove_per_class=[5, 0, 1] would return the indices of the 5 most likely mislabeled examples in class 0, and the most likely mislabeled example in class 2.

    Note

    Only set this parameter if filter_by='prune_by_class'. You may use with filter_by='prune_by_noise_rate', but if num_to_remove_per_class=k, then either k-1, k, or k+1 examples may be removed for any class due to rounding error. If you need exactly ‘k’ examples removed from every class, you should use filter_by='prune_by_class'.

  • min_examples_per_class (int, default 1) – Minimum number of examples per class to avoid flagging as label issues. This is useful to avoid deleting too much data from one class when pruning noisy examples in datasets with rare classes.

  • n_jobs (optional) – Number of processing threads used by multiprocessing. Default None sets to the number of cores on your CPU. Set this to 1 to disable parallel processing (if its causing issues). Windows users may see a speed-up with n_jobs=1.

  • verbose (optional) – If True, prints when multiprocessing happens.

Returns

label_issues – A boolean mask for the entire dataset where True represents a label issue and False represents an example that is accurately labeled with high confidence.

Note

You can also return the indices of the label issues in your dataset by setting return_indices_ranked_by.

Return type

np.array

cleanlab.filter.find_label_issues_using_argmax_confusion_matrix(labels, pred_probs, *, calibrate=True, filter_by='prune_by_noise_rate')[source]#

This is a baseline approach that uses the confusion matrix of argmax(pred_probs) and labels as the confident joint and then uses cleanlab (confident learning) to find the label issues using this matrix.

The only difference between this and find_label_issues is that it uses the confusion matrix based on the argmax and given label instead of using the confident joint from count.compute_confident_joint.

Parameters
  • labels (np.array) – An array of shape (N,) of noisy labels, i.e. some labels may be erroneous. Elements must be in the set 0, 1, …, K-1, where K is the number of classes.

  • pred_probs (np.array (shape (N, K))) – An array of shape (N, K) of model-predicted probabilities, P(label=k|x). Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class, for each of the K classes. The columns must be ordered such that these probabilities correspond to class 0, 1, …, K-1. pred_probs should have been computed using 3 (or higher) fold cross-validation.

  • calibrate (bool, default True) – Set to True to calibrate the confusion matrix created by pred != given labels. This calibration adjusts the confusion matrix / confident joint so that the prior (given noisy labels) is correct based on the original labels.

  • filter_by (str, default 'prune_by_noise_rate') – See filter_by argument of find_label_issues.

Returns

label_issues_mask – A boolean mask for the entire dataset where True represents a label issue and False represents an example that is accurately labeled with high confidence.

Return type

np.array

cleanlab.filter.find_predicted_neq_given(labels, pred_probs, *, multi_label=False)[source]#

A simple baseline approach that considers argmax(pred_probs) != labels as a label error.

Parameters
  • labels (np.array) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in 0, 1, …, K-1.

  • pred_probs (np.array, optional) – An array of shape (N, K) of model-predicted probabilities, P(label=k|x). Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class, for each of the K classes. The columns must be ordered such that these probabilities correspond to class 0, 1, …, K-1.

  • multi_label (bool, optional) – If True, labels should be an iterable (e.g. list) of iterables, containing a list of labels for each example, instead of just a single label. The multi-label setting supports classification tasks where an example has 1 or more labels. Example of a multi-labeled labels input: [[0,1], [1], [0,2], [0,1,2], [0], [1], ...].

Returns

label_issues_mask – A boolean mask for the entire dataset where True represents a label issue and False represents an example that is accurately labeled with high confidence.

Return type

np.array