filter#
Methods to identify which examples have label issues in a classification dataset.
The documentation below assumes a dataset with N examples and K classes.
This module is for standard (multi-class) classification where each example is labeled as belonging to exactly one of K classes (e.g. labels = np.array([0,0,1,0,2,1])).
Some methods here also work for multi-label classification data where each example can be labeled as belonging to multiple classes (e.g. labels = [[1,2],[1],[0],[],...]),
but we encourage using the methods in the cleanlab.multilabel_classification module instead for such data.
Data:
Functions:
| 
 | Identifies potentially bad labels in a classification dataset using confident learning. | 
| 
 | A simple baseline approach that considers  | 
| A baseline approach that uses the confusion matrix of  | 
- cleanlab.filter.pred_probs_by_class: Dict[int, ndarray]#
- cleanlab.filter.prune_count_matrix_cols: Dict[int, ndarray]#
- cleanlab.filter.find_label_issues(labels, pred_probs, *, return_indices_ranked_by=None, rank_by_kwargs=None, filter_by='prune_by_noise_rate', frac_noise=1.0, num_to_remove_per_class=None, min_examples_per_class=1, confident_joint=None, n_jobs=None, verbose=False, multi_label=False)[source]#
- Identifies potentially bad labels in a classification dataset using confident learning. - Returns a boolean mask for the entire dataset where - Truerepresents an example identified with a label issue and- Falserepresents an example that seems correctly labeled.- Instead of a mask, you can obtain indices of the examples with label issues in your dataset (sorted by issue severity) by specifying the - return_indices_ranked_byargument. This determines which label quality score is used to quantify severity, and is useful to view only the top-- Jmost severe issues in your dataset.- The number of indices returned as issues is controlled by - frac_noise: reduce its value to identify fewer label issues. If you aren’t sure, leave this set to 1.0.- Tip: if you encounter the error “pred_probs is not defined”, try setting - n_jobs=1.- Parameters:
- labels ( - np.ndarrayor- list) – A discrete vector of noisy labels for a classification dataset, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, each label must be integer in 0, 1, …, K-1. For a standard (multi-class) classification dataset where each example is labeled with one class,- labelsshould be 1D array of shape- (N,), for example:- labels = [1,0,2,1,1,0...].
- pred_probs ( - np.ndarray, optional) –- An array of shape - (N, K)of model-predicted class probabilities,- P(label=k|x). Each row of this matrix corresponds to an example- xand contains the model-predicted probabilities that- xbelongs to each possible class, for each of the K classes. The columns must be ordered such that these probabilities correspond to class 0, 1, …, K-1.- Note: Returned label issues are most accurate when they are computed based on out-of-sample - pred_probsfrom your model. To obtain out-of-sample predicted probabilities for every datapoint in your dataset, you can use cross-validation. This is encouraged to get better results.
- return_indices_ranked_by ( - {None, 'self_confidence', 'normalized_margin', 'confidence_weighted_entropy'}, default- None) –- Determines what is returned by this method: either a boolean mask or list of indices np.ndarray. If - None, this function returns a boolean mask (- Trueif example at index is label error). If not- None, this function returns a sorted array of indices of examples with label issues (instead of a boolean mask). Indices are sorted by label quality score which can be one of:- 'normalized_margin':- normalized margin (p(label = k) - max(p(label != k)))
- 'self_confidence':- [pred_probs[i][labels[i]] for i in label_issues_idx]
- 'confidence_weighted_entropy':- entropy(pred_probs) / self_confidence
 
- rank_by_kwargs ( - dict, optional) – Optional keyword arguments to pass into scoring functions for ranking by label quality score (see- rank.get_label_quality_scores).
- filter_by ( - {'prune_by_class', 'prune_by_noise_rate', 'both', 'confident_learning', 'predicted_neq_given', 'low_normalized_margin', 'low_self_confidence'}, default- 'prune_by_noise_rate') –- Method to determine which examples are flagged as having label issue, so you can filter/prune them from the dataset. Options: - 'prune_by_noise_rate': filters examples with high probability of being mislabeled for every non-diagonal in the confident joint (see- prune_counts_matrixin- filter.py). These are the examples where (with high confidence) the given label is unlikely to match the predicted label for the example.
- 'prune_by_class': filters the examples with smallest probability of belonging to their given class label for every class.
- 'both': filters only those examples that would be filtered by both- 'prune_by_noise_rate'and- 'prune_by_class'.
- 'confident_learning': filters the examples counted as part of the off-diagonals of the confident joint. These are the examples that are confidently predicted to be a different label than their given label.
- 'predicted_neq_given': filters examples for which the predicted class (i.e. argmax of the predicted probabilities) does not match the given label.
- 'low_normalized_margin': filters the examples with smallest normalized margin label quality score. The number of issues returned matches- count.num_label_issues.
- 'low_self_confidence': filters the examples with smallest self confidence label quality score. The number of issues returned matches- count.num_label_issues.
 
- frac_noise ( - float, default- 1.0) –- Used to only return the “top” - frac_noise * num_label_issues. The choice of which “top” label issues to return is dependent on the- filter_bymethod used. It works by reducing the size of the off-diagonals of the- jointdistribution of given labels and true labels proportionally by- frac_noiseprior to estimating label issues with each method. This parameter only applies for- filter_by=both,- filter_by=prune_by_class, and- filter_by=prune_by_noise_ratemethods and currently is unused by other methods. When- frac_noise=1.0, return all “confident” estimated noise indices (recommended).- frac_noise * number_of_mislabeled_examples_in_class_k. 
- num_to_remove_per_class ( - array_like) –- An iterable of length K, the number of classes. E.g. if K = 3, - num_to_remove_per_class=[5, 0, 1]would return the indices of the 5 most likely mislabeled examples in class 0, and the most likely mislabeled example in class 2.- Note - Only set this parameter if - filter_by='prune_by_class'. You may use with- filter_by='prune_by_noise_rate', but if- num_to_remove_per_class=k, then either k-1, k, or k+1 examples may be removed for any class due to rounding error. If you need exactly ‘k’ examples removed from every class, you should use- filter_by='prune_by_class'.
- min_examples_per_class ( - int, default- 1) – Minimum number of examples per class to avoid flagging as label issues. This is useful to avoid deleting too much data from one class when pruning noisy examples in datasets with rare classes.
- confident_joint ( - np.ndarray, optional) – An array of shape- (K, K)representing the confident joint, the matrix used for identifying label issues, which estimates a confident subset of the joint distribution of the noisy and true labels,- P_{noisy label, true label}. Entry- (j, k)in the matrix is the number of examples confidently counted into the pair of- (noisy label=j, true label=k)classes. The- confident_jointcan be computed using- count.compute_confident_joint. If not provided, it is computed from the given (noisy)- labelsand- pred_probs.
- n_jobs (optional) – Number of processing threads used by multiprocessing. Default - Nonesets to the number of cores on your CPU (physical cores if you have- psutilpackage installed, otherwise logical cores). Set this to 1 to disable parallel processing (if its causing issues). Windows users may see a speed-up with- n_jobs=1.
- verbose (optional) – If - True, prints when multiprocessing happens.
 
- Return type:
- ndarray
- Returns:
- label_issues ( - np.ndarray) – If- return_indices_ranked_byleft unspecified, returns a boolean mask for the entire dataset where- Truerepresents a label issue and- Falserepresents an example that is accurately labeled with high confidence. If- return_indices_ranked_byis specified, returns a shorter array of indices of examples identified to have label issues (i.e. those indices where the mask would be- True), sorted by likelihood that the corresponding label is correct.- Note - Obtain the indices of examples with label issues in your dataset by setting - return_indices_ranked_by.
 
- cleanlab.filter.find_predicted_neq_given(labels, pred_probs, *, multi_label=False)[source]#
- A simple baseline approach that considers - argmax(pred_probs) != labelsas the examples with label issues.- Parameters:
- labels ( - np.ndarrayor- list) – Labels in the same format expected by the- find_label_issuesfunction.
- pred_probs ( - np.ndarray) – Predicted-probabilities in the same format expected by the- find_label_issuesfunction.
- multi_label ( - bool, optional) – Whether each example may have multiple labels or not (see documentation for the- find_label_issuesfunction).
 
- Return type:
- ndarray
- Returns:
- label_issues_mask ( - np.ndarray) – A boolean mask for the entire dataset where- Truerepresents a label issue and- Falserepresents an example that is accurately labeled with high confidence.
 
- cleanlab.filter.find_label_issues_using_argmax_confusion_matrix(labels, pred_probs, *, calibrate=True, filter_by='prune_by_noise_rate')[source]#
- A baseline approach that uses the confusion matrix of - argmax(pred_probs)and labels as the confident joint and then uses cleanlab (confident learning) to find the label issues using this matrix.- The only difference between this and - find_label_issuesis that it uses the confusion matrix based on the argmax and given label instead of using the confident joint from- count.compute_confident_joint.- Parameters:
- labels ( - np.ndarray) – An array of shape- (N,)of noisy labels, i.e. some labels may be erroneous. Elements must be in the set 0, 1, …, K-1, where K is the number of classes.
- pred_probs ( - np.ndarray) – An array of shape- (N, K)of model-predicted probabilities,- P(label=k|x). Each row of this matrix corresponds to an example- xand contains the model-predicted probabilities that- xbelongs to each possible class, for each of the K classes. The columns must be ordered such that these probabilities correspond to class 0, 1, …, K-1.- pred_probsshould have been computed using 3 (or higher) fold cross-validation.
- calibrate ( - bool, default- True) – Set to- Trueto calibrate the confusion matrix created by- pred != given labels. This calibration adjusts the confusion matrix / confident joint so that the prior (given noisy labels) is correct based on the original labels.
- filter_by ( - str, default- 'prune_by_noise_rate') – See- filter_byargument of- find_label_issues.
 
- Return type:
- ndarray
- Returns:
- label_issues_mask ( - np.ndarray) – A boolean mask for the entire dataset where- Truerepresents a label issue and- Falserepresents an example that is accurately labeled with high confidence.