count#

Methods for estimating latent structures used for confident learning, including:

  • Latent prior of the unobserved, error-less labels: py: p(y)

  • Latent noisy channel (noise matrix) characterizing the flipping rates: nm: P(given label | true label)

  • Latent inverse noise matrix characterizing the flipping process: inv: P(true label | given label)

  • Latent confident_joint, an un-normalized matrix that counts the confident subset of label errors under the joint distribution for true/given label

Functions:

calibrate_confident_joint(confident_joint, ...)

Calibrates any confident joint estimate P(label=i, true_label=j) such that np.sum(cj) == len(labels) and np.sum(cj, axis = 1) == np.bincount(labels).

compute_confident_joint(labels, pred_probs, *)

Estimates the confident counts of latent true vs observed noisy labels for the examples in our dataset.

estimate_confident_joint_and_cv_pred_proba(X, ...)

Estimates P(labels, y), the confident counts of the latent joint distribution of true and noisy labels using observed labels and predicted probabilities pred_probs.

estimate_cv_predicted_probabilities(X, labels)

This function computes the out-of-sample predicted probability [P(label=k|x)] for every example in X using cross validation.

estimate_joint(labels, pred_probs, *[, ...])

Estimates the joint distribution of label noise P(label=i, true_label=j) guaranteed to:

estimate_latent(confident_joint, labels, *)

Computes the latent prior p(y), the noise matrix P(labels|y) and the inverse noise matrix P(y|labels) from the confident_joint count(labels, y).

estimate_noise_matrices(X, labels[, clf, ...])

Estimates the noise_matrix of shape (K, K).

estimate_py_and_noise_matrices_from_probabilities(...)

Computes the confident counts estimate of latent variables py and the noise rates using observed labels and predicted probabilities, pred_probs.

estimate_py_noise_matrices_and_cv_pred_proba(X, ...)

This function computes the out-of-sample predicted probability P(label=k|x) for every example x in X using cross validation while also computing the confident counts noise rates within each cross-validated subset and returning the average noise rate across all examples.

get_confident_thresholds(labels, pred_probs)

Returns expected (average) "self-confidence" for each class.

num_label_issues(labels, pred_probs[, ...])

Estimates the number of label issues in the labels of a dataset.

cleanlab.count.calibrate_confident_joint(confident_joint, labels, *, multi_label=False)[source]#

Calibrates any confident joint estimate P(label=i, true_label=j) such that np.sum(cj) == len(labels) and np.sum(cj, axis = 1) == np.bincount(labels).

In other words, this function forces the confident joint to have the true noisy prior p(labels) (summed over columns for each row) and also forces the confident joint to add up to the total number of examples.

This method makes the confident joint a valid counts estimate of the actual joint of noisy and true labels.

Parameters:
  • confident_joint (np.ndarray) – An array of shape (K, K) representing the confident joint, the matrix used for identifying label issues, which estimates a confident subset of the joint distribution of the noisy and true labels, P_{noisy label, true label}. Entry (j, k) in the matrix is the number of examples confidently counted into the pair of (noisy label=j, true label=k) classes. The confident_joint can be computed using count.compute_confident_joint. If not provided, it is computed from the given (noisy) labels and pred_probs.

  • labels (np.ndarray) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in 0, 1, …, K-1. All the classes (0, 1, …, and K-1) MUST be present in labels, such that: len(set(labels)) == pred_probs.shape[1] for standard multi-class classification with single-labeled data (e.g. labels =  [1,0,2,1,1,0...]). For multi-label classification where each example can belong to multiple classes(e.g. labels = [[1,2],[1],[0],..]), your labels should instead satisfy: len(set(k for l in labels for k in l)) == pred_probs.shape[1]).

  • multi_label (bool, optional) – If True, labels should be an iterable (e.g. list) of iterables, containing a list of labels for each example, instead of just a single label. The multi-label setting supports classification tasks where an example has 1 or more labels. Example of a multi-labeled labels input: [[0,1], [1], [0,2], [0,1,2], [0], [1], ...]. The major difference in how this is calibrated versus single-label is that the total number of errors considered is based on the number of labels, not the number of examples. So, the calibrated confident_joint will sum to the number of total labels.

Return type:

ndarray

Returns:

calibrated_cj (np.ndarray) – An array of shape (K, K) of type float representing a valid estimate of the joint counts of noisy and true labels.

cleanlab.count.compute_confident_joint(labels, pred_probs, *, thresholds=None, calibrate=True, multi_label=False, return_indices_of_off_diagonals=False)[source]#

Estimates the confident counts of latent true vs observed noisy labels for the examples in our dataset. This array of shape (K, K) is called the confident joint and contains counts of examples in every class, confidently labeled as every other class. These counts may subsequently be used to estimate the joint distribution of true and noisy labels (by normalizing them to frequencies).

Important: this function assumes that pred_probs are out-of-sample holdout probabilities. This can be done with cross validation. If the probabilities are not computed out-of-sample, overfitting may occur.

Parameters:
  • labels (np.ndarray) – An array of shape (N,) of noisy labels, i.e. some labels may be erroneous. Elements must be in the set 0, 1, …, K-1, where K is the number of classes. len(set(labels)) == pred_probs.shape[1] for standard multi-class classification with single-labeled data (e.g. labels =  [1,0,2,1,1,0...]). For multi-label classification where each example can belong to multiple classes(e.g. labels = [[1,2],[1],[0],..]), your labels should instead satisfy: len(set(k for l in labels for k in l)) == pred_probs.shape[1]).

  • pred_probs (np.ndarray, optional) – An array of shape (N, K) of model-predicted probabilities, P(label=k|x). Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class, for each of the K classes. The columns must be ordered such that these probabilities correspond to class 0, 1, …, K-1. pred_probs should have been computed using 3 (or higher) fold cross-validation.

  • thresholds (array_like, optional) –

    An array of shape (K, 1) or (K,) of per-class threshold probabilities, used to determine the cutoff probability necessary to consider an example as a given class label (see Northcutt et al., 2021, Section 3.1, Equation 2).

    This is for advanced users only. If not specified, these are computed for you automatically. If an example has a predicted probability greater than this threshold, it is counted as having true_label = k. This is not used for pruning/filtering, only for estimating the noise rates using confident counts.

  • calibrate (bool, default True) – Calibrates confident joint estimate P(label=i, true_label=j) such that np.sum(cj) == len(labels) and np.sum(cj, axis = 1) == np.bincount(labels). When calibrate=True, this method returns an estimate of the latent true joint counts of noisy and true labels.

  • multi_label (bool, optional) – If True, labels should be an iterable (e.g. list) of iterables, containing a list of labels for each example, instead of just a single label. The multi-label setting supports classification tasks where an example has 1 or more labels. Example of a multi-labeled labels input: [[0,1], [1], [0,2], [0,1,2], [0], [1], ...]. The major difference in how this is calibrated versus single-label is that the total number of errors considered is based on the number of labels, not the number of examples. So, the calibrated confident_joint will sum to the number of total labels.

  • return_indices_of_off_diagonals (bool, optional) – If True, returns indices of examples that were counted in off-diagonals of confident joint as a baseline proxy for the label issues. This sometimes works as well as filter.find_label_issues(confident_joint).

Return type:

Union[ndarray, Tuple[ndarray, list]]

Returns:

confident_joint_counts (np.ndarray) – An array of shape (K, K) representing counts of examples for which we are confident about their given and true label. If return_indices_of_off_diagonals is True, confident_joint_counts is the first element of returned tuple and second element is another array of indices counted in off-diagonals of confident joint.

Note

We provide a for-loop based simplification of the confident joint below. This implementation is not efficient, not used in practice, and not complete, but covers the gist of how the confident joint is computed:

# Confident examples are those that we are confident have true_label = k
# Estimate (K, K) matrix of confident examples with label = k_s and true_label = k_y
cj_ish = np.zeros((K, K))
for k_s in range(K): # k_s is the class value k of noisy labels `s`
    for k_y in range(K): # k_y is the (guessed) class k of true_label k_y
        cj_ish[k_s][k_y] = sum((pred_probs[:,k_y] >= (thresholds[k_y] - 1e-8)) & (labels == k_s))

The following is a vectorized (but non-parallelized) implementation of the confident joint, again slow, using for-loops/simplified for understanding. This implementation is 100% accurate, it’s just not optimized for speed.

confident_joint = np.zeros((K, K), dtype = int)
for i, row in enumerate(pred_probs):
    s_label = labels[i]
    confident_bins = row >= thresholds - 1e-6
    num_confident_bins = sum(confident_bins)
    if num_confident_bins == 1:
        confident_joint[s_label][np.argmax(confident_bins)] += 1
    elif num_confident_bins > 1:
        confident_joint[s_label][np.argmax(row)] += 1
cleanlab.count.estimate_confident_joint_and_cv_pred_proba(X, labels, clf=LogisticRegression(), *, cv_n_folds=5, thresholds=None, seed=None, calibrate=True, clf_kwargs={}, validation_func=None)[source]#

Estimates P(labels, y), the confident counts of the latent joint distribution of true and noisy labels using observed labels and predicted probabilities pred_probs.

The output of this function is an array of shape (K, K).

Under certain conditions, estimates are exact, and in many conditions, estimates are within one percent of actual.

Notes: There are two ways to compute the confident joint with pros/cons. (1) For each holdout set, we compute the confident joint, then sum them up. (2) Compute pred_proba for each fold, combine, compute the confident joint. (1) is more accurate because it correctly computes thresholds for each fold (2) is more accurate when you have only a little data because it computes the confident joint using all the probabilities. For example if you had 100 examples, with 5-fold cross validation + uniform p(y) you would only have 20 examples to compute each confident joint for (1). Such small amounts of data is bound to result in estimation errors. For this reason, we implement (2), but we implement (1) as a commented out function at the end of this file.

Parameters:
  • X (np.ndarray or pd.DataFrame) –

    Input feature matrix of shape (N, ...), where N is the number of examples. The classifier that this instance was initialized with,

    clf, must be able to fit() and predict() data with this format.

  • labels (np.ndarray or pd.Series) – An array of shape (N,) of noisy labels, i.e. some labels may be erroneous. Elements must be in (0, 1, …, K-1) where K is the number of classes, and all classes must be present at least once.

  • clf (estimator instance, optional) – A classifier implementing the sklearn estimator API.

  • cv_n_folds (int, default 5) – The number of cross-validation folds used to compute out-of-sample probabilities for each example in X.

  • thresholds (array_like, optional) –

    An array of shape (K, 1) or (K,) of per-class threshold probabilities, used to determine the cutoff probability necessary to consider an example as a given class label (see Northcutt et al., 2021, Section 3.1, Equation 2).

    This is for advanced users only. If not specified, these are computed for you automatically. If an example has a predicted probability greater than this threshold, it is counted as having true_label = k. This is not used for pruning/filtering, only for estimating the noise rates using confident counts.

  • seed (int, optional) – Set the default state of the random number generator used to split the cross-validated folds. If None, uses np.random current random state.

  • calibrate (bool, default True) – Calibrates confident joint estimate P(label=i, true_label=j) such that np.sum(cj) == len(labels) and np.sum(cj, axis = 1) == np.bincount(labels).

  • clf_kwargs (dict, optional) – Optional keyword arguments to pass into clf’s fit() method.

  • validation_func (callable, optional) – Specifies how to map the validation data split in cross-validation as input for clf.fit(). For details, see the documentation of CleanLearning.fit

Return type:

Tuple[ndarray, ndarray]

Returns:

estimates (tuple) – Tuple of two numpy arrays in the form: (joint counts matrix, predicted probability matrix)

cleanlab.count.estimate_cv_predicted_probabilities(X, labels, clf=LogisticRegression(), *, cv_n_folds=5, seed=None, clf_kwargs={}, validation_func=None)[source]#

This function computes the out-of-sample predicted probability [P(label=k|x)] for every example in X using cross validation. Output is a np.ndarray of shape (N, K) where N is the number of training examples and K is the number of classes.

Parameters:
  • X (np.ndarray) – Input feature matrix of shape (N, ...), where N is the number of examples. The classifier that this instance was initialized with, clf, must be able to handle data with this shape.

  • labels (np.ndarray) – An array of shape (N,) of noisy labels, i.e. some labels may be erroneous. Elements must be in the set 0, 1, …, K-1, where K is the number of classes.

  • clf (estimator instance, optional) –

    A classifier implementing the sklearn estimator API.

  • cv_n_folds (int, default 5) – The number of cross-validation folds used to compute out-of-sample probabilities for each example in X.

  • seed (int, optional) – Set the default state of the random number generator used to split the cross-validated folds. If None, uses np.random current random state.

  • clf_kwargs (dict, optional) – Optional keyword arguments to pass into clf’s fit() method.

  • validation_func (callable, optional) – Specifies how to map the validation data split in cross-validation as input for clf.fit(). For details, see the documentation of CleanLearning.fit

Return type:

ndarray

Returns:

pred_probs (np.ndarray) – An array of shape (N, K) representing P(label=k|x), the model-predicted probabilities. Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class.

cleanlab.count.estimate_joint(labels, pred_probs, *, confident_joint=None, multi_label=False)[source]#

Estimates the joint distribution of label noise P(label=i, true_label=j) guaranteed to:

  • Sum to 1

  • Satisfy np.sum(joint_estimate, axis = 1) == p(labels)

Parameters:
  • labels (np.ndarray) – An array of shape (N,) of noisy labels, i.e. some labels may be erroneous. Elements must be in the set 0, 1, …, K-1, where K is the number of classes. All the classes (0, 1, …, and K-1) MUST be present in labels, such that: len(set(labels)) == pred_probs.shape[1] for standard multi-class classification with single-labeled data (e.g. labels =  [1,0,2,1,1,0...]). For multi-label classification where each example can belong to multiple classes(e.g. labels = [[1,2],[1],[0],..]), your labels should instead satisfy: len(set(k for l in labels for k in l)) == pred_probs.shape[1]).

  • pred_probs (np.ndarray) – An array of shape (N, K) of model-predicted probabilities, P(label=k|x). Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class, for each of the K classes. The columns must be ordered such that these probabilities correspond to class 0, 1, …, K-1. pred_probs should have been computed using 3 (or higher) fold cross-validation.

  • confident_joint (np.ndarray, optional) – An array of shape (K, K) representing the confident joint, the matrix used for identifying label issues, which estimates a confident subset of the joint distribution of the noisy and true labels, P_{noisy label, true label}. Entry (j, k) in the matrix is the number of examples confidently counted into the pair of (noisy label=j, true label=k) classes. The confident_joint can be computed using count.compute_confident_joint. If not provided, it is computed from the given (noisy) labels and pred_probs.

  • multi_label (bool, optional) – If True, labels should be an iterable (e.g. list) of iterables, containing a list of labels for each example, instead of just a single label. The multi-label setting supports classification tasks where an example has 1 or more labels. Example of a multi-labeled labels input: [[0,1], [1], [0,2], [0,1,2], [0], [1], ...].

Return type:

ndarray

Returns:

confident_joint_distribution (np.ndarray) – An array of shape (K, K) representing an estimate of the true joint distribution of noisy and true labels.

cleanlab.count.estimate_latent(confident_joint, labels, *, py_method='cnt', converge_latent_estimates=False)[source]#

Computes the latent prior p(y), the noise matrix P(labels|y) and the inverse noise matrix P(y|labels) from the confident_joint count(labels, y). The confident_joint can be estimated by compute_confident_joint <cleanlab.count.compute_confident_joint> by counting confident examples.

Parameters:
  • confident_joint (np.ndarray) – An array of shape (K, K) representing the confident joint, the matrix used for identifying label issues, which estimates a confident subset of the joint distribution of the noisy and true labels, P_{noisy label, true label}. Entry (j, k) in the matrix is the number of examples confidently counted into the pair of (noisy label=j, true label=k) classes. The confident_joint can be computed using count.compute_confident_joint. If not provided, it is computed from the given (noisy) labels and pred_probs.

  • labels (np.ndarray) – An array of shape (N,) of noisy labels, i.e. some labels may be erroneous. Elements must be in the set 0, 1, …, K-1, where K is the number of classes.

  • py_method ({"cnt", "eqn", "marginal", "marginal_ps"}, default "cnt") – py is shorthand for the “class proportions (a.k.a prior) of the true labels”. This method defines how to compute the latent prior p(true_label=k). Default is "cnt", which works well even when the noise matrices are estimated poorly by using the matrix diagonals instead of all the probabilities.

  • converge_latent_estimates (bool, optional) – If True, forces numerical consistency of estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively make them mathematically consistent.

Return type:

Tuple[ndarray, ndarray, ndarray]

Returns:

tuple – A tuple containing (py, noise_matrix, inv_noise_matrix).

cleanlab.count.estimate_noise_matrices(X, labels, clf=LogisticRegression(), *, cv_n_folds=5, thresholds=None, converge_latent_estimates=True, seed=None, clf_kwargs={}, validation_func=None)[source]#

Estimates the noise_matrix of shape (K, K). This is the fraction of examples in every class, labeled as every other class. The noise_matrix is a conditional probability matrix for P(label=k_s|true_label=k_y).

Under certain conditions, estimates are exact, and in most conditions, estimates are within one percent of the actual noise rates.

Parameters:
  • X (np.ndarray) – Input feature matrix of shape (N, ...), where N is the number of examples. The classifier that this instance was initialized with, clf, must be able to handle data with this shape.

  • labels (np.ndarray) – An array of shape (N,) of noisy labels, i.e. some labels may be erroneous. Elements must be in the set 0, 1, …, K-1, where K is the number of classes.

  • clf (estimator instance, optional) –

    A classifier implementing the sklearn estimator API.

  • cv_n_folds (int, default 5) – The number of cross-validation folds used to compute out-of-sample probabilities for each example in X.

  • thresholds (array_like, optional) –

    An array of shape (K, 1) or (K,) of per-class threshold probabilities, used to determine the cutoff probability necessary to consider an example as a given class label (see Northcutt et al., 2021, Section 3.1, Equation 2).

    This is for advanced users only. If not specified, these are computed for you automatically. If an example has a predicted probability greater than this threshold, it is counted as having true_label = k. This is not used for pruning/filtering, only for estimating the noise rates using confident counts.

  • converge_latent_estimates (bool, optional) – If True, forces numerical consistency of estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively make them mathematically consistent.

  • seed (int, optional) – Set the default state of the random number generator used to split the cross-validated folds. If None, uses np.random current random state.

  • clf_kwargs (dict, optional) – Optional keyword arguments to pass into clf’s fit() method.

  • validation_func (callable, optional) – Specifies how to map the validation data split in cross-validation as input for clf.fit(). For details, see the documentation of CleanLearning.fit

Return type:

Tuple[ndarray, ndarray]

Returns:

estimates (tuple) – A tuple containing arrays (noise_matrix, inv_noise_matrix).

cleanlab.count.estimate_py_and_noise_matrices_from_probabilities(labels, pred_probs, *, thresholds=None, converge_latent_estimates=True, py_method='cnt', calibrate=True)[source]#

Computes the confident counts estimate of latent variables py and the noise rates using observed labels and predicted probabilities, pred_probs.

Important: this function assumes that pred_probs are out-of-sample holdout probabilities. This can be done with cross validation. If the probabilities are not computed out-of-sample, overfitting may occur.

This function estimates the noise_matrix of shape (K, K). This is the fraction of examples in every class, labeled as every other class. The noise_matrix is a conditional probability matrix for P(label=k_s|true_label=k_y).

Under certain conditions, estimates are exact, and in most conditions, estimates are within one percent of the actual noise rates.

Parameters:
  • labels (np.ndarray) – An array of shape (N,) of noisy labels, i.e. some labels may be erroneous. Elements must be in the set 0, 1, …, K-1, where K is the number of classes.

  • pred_probs (np.ndarray) – An array of shape (N, K) of model-predicted probabilities, P(label=k|x). Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class, for each of the K classes. The columns must be ordered such that these probabilities correspond to class 0, 1, …, K-1. pred_probs should have been computed using 3 (or higher) fold cross-validation.

  • thresholds (array_like, optional) –

    An array of shape (K, 1) or (K,) of per-class threshold probabilities, used to determine the cutoff probability necessary to consider an example as a given class label (see Northcutt et al., 2021, Section 3.1, Equation 2).

    This is for advanced users only. If not specified, these are computed for you automatically. If an example has a predicted probability greater than this threshold, it is counted as having true_label = k. This is not used for pruning/filtering, only for estimating the noise rates using confident counts.

  • converge_latent_estimates (bool, optional) – If True, forces numerical consistency of estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively make them mathematically consistent.

  • py_method ({"cnt", "eqn", "marginal", "marginal_ps"}, default "cnt") – How to compute the latent prior p(true_label=k). Default is "cnt" as it often works well even when the noise matrices are estimated poorly by using the matrix diagonals instead of all the probabilities.

  • calibrate (bool, default True) – Calibrates confident joint estimate P(label=i, true_label=j) such that np.sum(cj) == len(labels) and np.sum(cj, axis = 1) == np.bincount(labels).

Return type:

Tuple[ndarray, ndarray, ndarray, ndarray]

Returns:

estimates (tuple) – A tuple of arrays: (py, noise_matrix, inverse_noise_matrix, confident_joint).

cleanlab.count.estimate_py_noise_matrices_and_cv_pred_proba(X, labels, clf=LogisticRegression(), *, cv_n_folds=5, thresholds=None, converge_latent_estimates=False, py_method='cnt', seed=None, clf_kwargs={}, validation_func=None)[source]#

This function computes the out-of-sample predicted probability P(label=k|x) for every example x in X using cross validation while also computing the confident counts noise rates within each cross-validated subset and returning the average noise rate across all examples.

This function estimates the noise_matrix of shape (K, K). This is the fraction of examples in every class, labeled as every other class. The noise_matrix is a conditional probability matrix for P(label=k_s|true_label=k_y).

Under certain conditions, estimates are exact, and in most conditions, estimates are within one percent of the actual noise rates.

Parameters:
  • X (np.ndarray) – Input feature matrix of shape (N, ...), where N is the number of examples. The classifier that this instance was initialized with, clf, must be able to handle data with this shape.

  • labels (np.ndarray) – An array of shape (N,) of noisy labels, i.e. some labels may be erroneous. Elements must be in the set 0, 1, …, K-1, where K is the number of classes.

  • clf (estimator instance, optional) –

    A classifier implementing the sklearn estimator API.

  • cv_n_folds (int, default 5) – The number of cross-validation folds used to compute out-of-sample probabilities for each example in X.

  • thresholds (array_like, optional) –

    An array of shape (K, 1) or (K,) of per-class threshold probabilities, used to determine the cutoff probability necessary to consider an example as a given class label (see Northcutt et al., 2021, Section 3.1, Equation 2).

    This is for advanced users only. If not specified, these are computed for you automatically. If an example has a predicted probability greater than this threshold, it is counted as having true_label = k. This is not used for pruning/filtering, only for estimating the noise rates using confident counts.

  • converge_latent_estimates (bool, optional) – If True, forces numerical consistency of estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively make them mathematically consistent.

  • py_method ({"cnt", "eqn", "marginal", "marginal_ps"}, default "cnt") – How to compute the latent prior p(true_label=k). Default is "cnt" as it often works well even when the noise matrices are estimated poorly by using the matrix diagonals instead of all the probabilities.

  • seed (int, optional) – Set the default state of the random number generator used to split the cross-validated folds. If None, uses np.random current random state.

  • clf_kwargs (dict, optional) – Optional keyword arguments to pass into clf’s fit() method.

  • validation_func (callable, optional) – Specifies how to map the validation data split in cross-validation as input for clf.fit(). For details, see the documentation of CleanLearning.fit

Return type:

Tuple[ndarray, ndarray, ndarray, ndarray, ndarray]

Returns:

estimates (tuple) – A tuple of five arrays (py, noise matrix, inverse noise matrix, confident joint, predicted probability matrix).

cleanlab.count.get_confident_thresholds(labels, pred_probs, multi_label=False)[source]#

Returns expected (average) “self-confidence” for each class.

The confident class threshold for a class j is the expected (average) “self-confidence” for class j.

Parameters:
  • labels (np.ndarray) – An array of shape (N,) of noisy labels, i.e. some labels may be erroneous. Elements must be in the set 0, 1, …, K-1, where K is the number of classes. All the classes (0, 1, …, and K-1) MUST be present in labels, such that: len(set(labels)) == pred_probs.shape[1] for standard multi-class classification with single-labeled data (e.g. labels =  [1,0,2,1,1,0...]). For multi-label classification where each example can belong to multiple classes(e.g. labels = [[1,2],[1],[0],..]), your labels should instead satisfy: len(set(k for l in labels for k in l)) == pred_probs.shape[1]).

  • pred_probs (np.ndarray) – An array of shape (N, K) of model-predicted probabilities, P(label=k|x). Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class, for each of the K classes. The columns must be ordered such that these probabilities correspond to class 0, 1, …, K-1. pred_probs should have been computed using 3 (or higher) fold cross-validation.

  • multi_label (bool, optional) – If True, labels should be an iterable (e.g. list) of iterables, containing a list of labels for each example, instead of just a single label. Assumes all classes in pred_probs.shape[1] are represented in labels. The multi-label setting supports classification tasks where an example has 1 or more labels. Example of a multi-labeled labels input: [[0,1], [1], [0,2], [0,1,2], [0], [1], ...]. The major difference in how this is calibrated versus single-label is that the total number of errors considered is based on the number of labels, not the number of examples. So, the calibrated confident_joint will sum to the number of total labels.

Return type:

ndarray

Returns:

confident_thresholds (np.ndarray) – An array of shape (K, ) where K is the number of classes.

cleanlab.count.num_label_issues(labels, pred_probs, confident_joint=None)[source]#

Estimates the number of label issues in the labels of a dataset.

This method is more accurate than sum(find_label_issues()) because its computed using only the trace of the confident joint, ignoring all off-diagonals (which are used by find_label_issues and are harder to estimate). Here, we sum over only diagonal elements in the joint (which have more data are more constrained, and therefore easier to compute).

TL;DR: use this method to get the most accurate estimate of number of label issues when you don’t need the indices of the label issues.

You can use this method to label issues by using num_label_issues as the cutoff threshold used with ranking/scoring functions from cleanlab.rank with num_label_issues. There are two cases when you should use this approach instead of filter.find_label_issues:

  1. As we add more label and data quality scoring functions in cleanlab.rank, this approach will always work.

  2. If you have a custom score to rank your data by label quality and you just need to know the cut-off of likely label issues.

Parameters:
  • labels (np.ndarray) – An array of shape (N,) of noisy labels, i.e. some labels may be erroneous. Elements must be in the set 0, 1, …, K-1, where K is the number of classes.

  • pred_probs (np.ndarray) – An array of shape (N, K) of model-predicted probabilities, P(label=k|x). Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class, for each of the K classes. The columns must be ordered such that these probabilities correspond to class 0, 1, …, K-1. pred_probs should have been computed using 3 (or higher) fold cross-validation.

  • confident_joint (np.ndarray, optional) – An array of shape (K, K) representing the confident joint, the matrix used for identifying label issues, which estimates a confident subset of the joint distribution of the noisy and true labels, P_{noisy label, true label}. Entry (j, k) in the matrix is the number of examples confidently counted into the pair of (noisy label=j, true label=k) classes. The confident_joint can be computed using count.compute_confident_joint. If not provided, it is computed from the given (noisy) labels and pred_probs.

Return type:

int

Returns:

num_issues (int) – The estimated number of examples with label issues in the dataset.