count#
Methods to estimate latent structures used for confident learning, including:
Latent prior of the unobserved, error-less labels:
py
:p(y)
Latent noisy channel (noise matrix) characterizing the flipping rates:
nm
:P(given label | true label)
Latent inverse noise matrix characterizing the flipping process:
inv
:P(true label | given label)
Latent
confident_joint
, an un-normalized matrix that counts the confident subset of label errors under the joint distribution for true/given label
These are estimated from a classification dataset. This module considers two types of datasets:
standard (multi-class) classification where each example is labeled as belonging to exactly one of K classes (e.g.
labels = np.array([0,0,1,0,2,1])
)multi-label classification where each example can be labeled as belonging to multiple classes (e.g.
labels = [[1,2],[1],[0],[],...]
)
Functions:
|
Estimates the number of label issues in a classification dataset. |
|
Calibrates any confident joint estimate |
|
Estimates the joint distribution of label noise |
|
Estimates the confident counts of latent true vs observed noisy labels for the examples in our dataset. |
|
Computes the latent prior |
Computes the confident counts estimate of latent variables |
|
Estimates |
|
This function computes the out-of-sample predicted probability |
|
|
This function computes the out-of-sample predicted probability [P(label=k|x)] for every example in X using cross validation. |
|
Estimates the |
|
Returns expected (average) "self-confidence" for each class. |
- cleanlab.count.num_label_issues(labels, pred_probs, *, confident_joint=None, estimation_method='off_diagonal', multi_label=False)[source]#
Estimates the number of label issues in a classification dataset. Use this method to get the most accurate estimate of number of label issues when you don’t need the indices of the examples with label issues.
- Parameters:
labels (
np.ndarray
orlist
) – Given class labels for each example in the dataset, some of which may be erroneous, in same format expected byfilter.find_label_issues
function.pred_probs (
ndarray
) – Model-predicted class probabilities for each example in the dataset, in same format expected byfilter.find_label_issues
function.confident_joint (
Optional
[ndarray
]) – Array of estimated class label error statisics used for identifying label issues, in same format expected byfilter.find_label_issues
function. Theconfident_joint
can be computed usingcount.compute_confident_joint
. It is internally computed from the given (noisy)labels
andpred_probs
.estimation_method (
str
) –Method for estimating the number of label issues in dataset by counting the examples in the off-diagonal of the
confident_joint
P(label=i, true_label=j)
.'off_diagonal'
: Counts the number of examples in the off-diagonal of theconfident_joint
. Returns the same value assum(find_label_issues(filter_by='confident_learning'))
'off_diagonal_calibrated'
: Calibrates confident joint estimateP(label=i, true_label=j)
such thatnp.sum(cj) == len(labels)
andnp.sum(cj, axis = 1) == np.bincount(labels)
before counting the number of examples in the off-diagonal. Number will always be equal to or greater thanestimate_issues='off_diagonal'
. You can use this value as the cutoff threshold used with ranking/scoring functions fromcleanlab.rank
withnum_label_issues
overestimation_method='off_diagonal'
in two cases:As we add more label and data quality scoring functions in
cleanlab.rank
, this approach will always work.If you have a custom score to rank your data by label quality and you just need to know the cut-off of likely label issues.
'off_diagonal_custom'
: Counts the number of examples in the off-diagonal of a providedconfident_joint
matrix.
TL;DR: Use this method to get the most accurate estimate of number of label issues when you don’t need the indices of the label issues.
Note:
'off_diagonal'
may sometimes underestimate issues for data with few classes, so consider using'off_diagonal_calibrated'
instead if your data has < 4 classes.multi_label (
bool
, optional) – SetFalse
if your dataset is for regular (multi-class) classification, where each example belongs to exactly one class. SetTrue
if your dataset is for multi-label classification, where each example can belong to multiple classes. See documentation ofcompute_confident_joint
for details.
- Return type:
int
- Returns:
num_issues
– The estimated number of examples with label issues in the dataset.
- cleanlab.count.calibrate_confident_joint(confident_joint, labels, *, multi_label=False)[source]#
Calibrates any confident joint estimate
P(label=i, true_label=j)
such thatnp.sum(cj) == len(labels)
andnp.sum(cj, axis = 1) == np.bincount(labels)
.In other words, this function forces the confident joint to have the true noisy prior
p(labels)
(summed over columns for each row) and also forces the confident joint to add up to the total number of examples.This method makes the confident joint a valid counts estimate of the actual joint of noisy and true labels.
- Parameters:
confident_joint (
np.ndarray
) – An array of shape(K, K)
representing the confident joint, the matrix used for identifying label issues, which estimates a confident subset of the joint distribution of the noisy and true labels,P_{noisy label, true label}
. Entry(j, k)
in the matrix is the number of examples confidently counted into the pair of(noisy label=j, true label=k)
classes. Theconfident_joint
can be computed usingcount.compute_confident_joint
. If not provided, it is computed from the given (noisy)labels
andpred_probs
. Ifmulti_label
is True, then theconfident_joint
should be a one-vs-rest array of shape(K, 2, 2)
, and an array of the same shape will be returned.labels (
np.ndarray
orlist
) – Given class labels for each example in the dataset, some of which may be erroneous, in same format expected byfilter.find_label_issues
function.multi_label (
bool
, optional) – IfFalse
, dataset is for regular (multi-class) classification, where each example belongs to exactly one class. IfTrue
, dataset is for multi-label classification, where each example can belong to multiple classes. See documentation ofcompute_confident_joint
for details. In multi-label classification, the confident/calibrated joint arrays have shape(K, 2, 2)
formatted in a one-vs-rest fashion such that they contain a 2x2 matrix for each class that counts examples which are correctly/incorrectly labeled as belonging to that class. After calibration, the entries in each class-specific 2x2 matrix will sum to the number of examples.
- Return type:
ndarray
- Returns:
calibrated_cj (
np.ndarray
) – An array of shape(K, K)
representing a valid estimate of the joint counts of noisy and true labels (ifmulti_label
is False). Ifmulti_label
is True, the returnedcalibrated_cj
is instead an one-vs-rest array of shape(K, 2, 2)
, where for classc
: entry(c, 0, 0)
in this one-vs-rest array is the number of examples whose noisy label containsc
confidently identified as truly belonging to classc
as well. Entry(c, 1, 0)
in this one-vs-rest array is the number of examples whose noisy label containsc
confidently identified as not actually belonging to classc
. Entry(c, 0, 1)
in this one-vs-rest array is the number of examples whose noisy label does not containc
confidently identified as truly belonging to classc
. Entry(c, 1, 1)
in this one-vs-rest array is the number of examples whose noisy label does not containc
confidently identified as actually not belonging to classc
as well.
- cleanlab.count.estimate_joint(labels, pred_probs, *, confident_joint=None, multi_label=False)[source]#
Estimates the joint distribution of label noise
P(label=i, true_label=j)
guaranteed to:Sum to 1
Satisfy
np.sum(joint_estimate, axis = 1) == p(labels)
- Parameters:
labels (
np.ndarray
orlist
) – Given class labels for each example in the dataset, some of which may be erroneous, in same format expected byfilter.find_label_issues
function.pred_probs (
np.ndarray
) – Model-predicted class probabilities for each example in the dataset, in same format expected byfilter.find_label_issues
function.confident_joint (
np.ndarray
, optional) – Array of estimated class label error statisics used for identifying label issues, in same format expected byfilter.find_label_issues
function. Theconfident_joint
can be computed usingcount.compute_confident_joint
. If not provided, it is internally computed from the given (noisy)labels
andpred_probs
.multi_label (
bool
, optional) – IfFalse
, dataset is for regular (multi-class) classification, where each example belongs to exactly one class. IfTrue
, dataset is for multi-label classification, where each example can belong to multiple classes. See documentation ofcompute_confident_joint
for details.
- Return type:
ndarray
- Returns:
confident_joint_distribution (
np.ndarray
) – An array of shape(K, K)
representing an estimate of the true joint distribution of noisy and true labels (ifmulti_label
is False). Ifmulti_label
is True, an array of shape(K, 2, 2)
representing an estimate of the true joint distribution of noisy and true labels for each class in a one-vs-rest fashion. Entry(c, i, j)
in this array is the number of examples confidently counted into a(class c, noisy label=i, true label=j)
bin, wherei, j
are either 0 or 1 to denote whether this example belongs to classc
or not (recall examples can belong to multiple classes in multi-label classification).
- cleanlab.count.compute_confident_joint(labels, pred_probs, *, thresholds=None, calibrate=True, multi_label=False, return_indices_of_off_diagonals=False)[source]#
Estimates the confident counts of latent true vs observed noisy labels for the examples in our dataset. This array of shape
(K, K)
is called the confident joint and contains counts of examples in every class, confidently labeled as every other class. These counts may subsequently be used to estimate the joint distribution of true and noisy labels (by normalizing them to frequencies).Important: this function assumes that
pred_probs
are out-of-sample holdout probabilities. This can be done with cross validation. If the probabilities are not computed out-of-sample, overfitting may occur.- Parameters:
labels (
np.ndarray
orlist
) – Given class labels for each example in the dataset, some of which may be erroneous, in same format expected byfilter.find_label_issues
function.pred_probs (
np.ndarray
) – Model-predicted class probabilities for each example in the dataset, in same format expected byfilter.find_label_issues
function.thresholds (
array_like
, optional) –An array of shape
(K, 1)
or(K,)
of per-class threshold probabilities, used to determine the cutoff probability necessary to consider an example as a given class label (see Northcutt et al., 2021, Section 3.1, Equation 2).This is for advanced users only. If not specified, these are computed for you automatically. If an example has a predicted probability greater than this threshold, it is counted as having true_label = k. This is not used for pruning/filtering, only for estimating the noise rates using confident counts.
calibrate (
bool
, defaultTrue
) – Calibrates confident joint estimateP(label=i, true_label=j)
such thatnp.sum(cj) == len(labels)
andnp.sum(cj, axis = 1) == np.bincount(labels)
. Whencalibrate=True
, this method returns an estimate of the latent true joint counts of noisy and true labels.multi_label (
bool
, optional) – IfTrue
, this is multi-label classification dataset (where each example can belong to more than one class) rather than a regular (multi-class) classifiction dataset. In this case,labels
should be an iterable (e.g. list) of iterables (e.g.List[List[int]]
), containing the list of classes to which each example belongs, instead of just a single class. Example oflabels
for a multi-label classification dataset:[[0,1], [1], [0,2], [0,1,2], [0], [1], [], ...]
.return_indices_of_off_diagonals (
bool
, optional) – IfTrue
, returns indices of examples that were counted in off-diagonals of confident joint as a baseline proxy for the label issues. This sometimes works as well asfilter.find_label_issues(confident_joint)
.
- Return type:
Union
[ndarray
,Tuple
[ndarray
,list
]]- Returns:
- confident_joint_counts (
np.ndarray
) – An array of shape(K, K)
representing counts of examples for which we are confident about their given and true label (if
multi_label
is False). Ifmulti_label
is True, this array instead has shape(K, 2, 2)
representing a one-vs-rest format for the confident joint, where for each classc
: Entry(c, 0, 0)
in this one-vs-rest array is the number of examples whose noisy label containsc
confidently identified as truly belonging to classc
as well. Entry(c, 1, 0)
in this one-vs-rest array is the number of examples whose noisy label containsc
confidently identified as not actually belonging to classc
. Entry(c, 0, 1)
in this one-vs-rest array is the number of examples whose noisy label does not containc
confidently identified as truly belonging to classc
. Entry(c, 1, 1)
in this one-vs-rest array is the number of examples whose noisy label does not containc
confidently identified as actually not belonging to classc
as well.
If
return_indices_of_off_diagonals
is set as True, this function instead returns a tuple(confident_joint, indices_off_diagonal)
whereindices_off_diagonal
is a list of arrays and each array contains the indices of examples counted in off-diagonals of confident joint.- confident_joint_counts (
Note
We provide a for-loop based simplification of the confident joint below. This implementation is not efficient, not used in practice, and not complete, but covers the gist of how the confident joint is computed:
# Confident examples are those that we are confident have true_label = k # Estimate (K, K) matrix of confident examples with label = k_s and true_label = k_y cj_ish = np.zeros((K, K)) for k_s in range(K): # k_s is the class value k of noisy labels `s` for k_y in range(K): # k_y is the (guessed) class k of true_label k_y cj_ish[k_s][k_y] = sum((pred_probs[:,k_y] >= (thresholds[k_y] - 1e-8)) & (labels == k_s))
The following is a vectorized (but non-parallelized) implementation of the confident joint, again slow, using for-loops/simplified for understanding. This implementation is 100% accurate, it’s just not optimized for speed.
confident_joint = np.zeros((K, K), dtype = int) for i, row in enumerate(pred_probs): s_label = labels[i] confident_bins = row >= thresholds - 1e-6 num_confident_bins = sum(confident_bins) if num_confident_bins == 1: confident_joint[s_label][np.argmax(confident_bins)] += 1 elif num_confident_bins > 1: confident_joint[s_label][np.argmax(row)] += 1
- cleanlab.count.estimate_latent(confident_joint, labels, *, py_method='cnt', converge_latent_estimates=False)[source]#
Computes the latent prior
p(y)
, the noise matrixP(labels|y)
and the inverse noise matrixP(y|labels)
from theconfident_joint
count(labels, y)
. Theconfident_joint
can be estimated bycompute_confident_joint
which counts confident examples.- Parameters:
confident_joint (
np.ndarray
) – An array of shape(K, K)
representing the confident joint, the matrix used for identifying label issues, which estimates a confident subset of the joint distribution of the noisy and true labels,P_{noisy label, true label}
. Entry(j, k)
in the matrix is the number of examples confidently counted into the pair of(noisy label=j, true label=k)
classes. Theconfident_joint
can be computed usingcount.compute_confident_joint
. If not provided, it is computed from the given (noisy)labels
andpred_probs
.labels (
np.ndarray
) – A 1D array of shape(N,)
containing class labels for a standard (multi-class) classification dataset. Some given labels may be erroneous. Elements must be integers in the set 0, 1, …, K-1, where K is the number of classes.py_method (
{"cnt", "eqn", "marginal", "marginal_ps"}
, default"cnt"
) –py
is shorthand for the “class proportions (a.k.a prior) of the true labels”. This method defines how to compute the latent priorp(true_label=k)
. Default is"cnt"
, which works well even when the noise matrices are estimated poorly by using the matrix diagonals instead of all the probabilities.converge_latent_estimates (
bool
, optional) – IfTrue
, forces numerical consistency of estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively make them mathematically consistent.
- Return type:
Tuple
[ndarray
,ndarray
,ndarray
]- Returns:
tuple
– A tuple containing (py, noise_matrix, inv_noise_matrix).
Note
Multi-label classification is not supported in this method.
- cleanlab.count.estimate_py_and_noise_matrices_from_probabilities(labels, pred_probs, *, thresholds=None, converge_latent_estimates=True, py_method='cnt', calibrate=True)[source]#
Computes the confident counts estimate of latent variables
py
and the noise rates using observed labels and predicted probabilities,pred_probs
.Important: this function assumes that
pred_probs
are out-of-sample holdout probabilities. This can be done with cross validation. If the probabilities are not computed out-of-sample, overfitting may occur.This function estimates the
noise_matrix
of shape(K, K)
. This is the fraction of examples in every class, labeled as every other class. Thenoise_matrix
is a conditional probability matrix forP(label=k_s|true_label=k_y)
.Under certain conditions, estimates are exact, and in most conditions, estimates are within one percent of the actual noise rates.
- Parameters:
labels (
np.ndarray
) – A 1D array of shape(N,)
containing class labels for a standard (multi-class) classification dataset. Some given labels may be erroneous. Elements must be integers in the set 0, 1, …, K-1, where K is the number of classes.pred_probs (
np.ndarray
) – Model-predicted class probabilities for each example in the dataset, in same format expected byfilter.find_label_issues
function.thresholds (
array_like
, optional) –An array of shape
(K, 1)
or(K,)
of per-class threshold probabilities, used to determine the cutoff probability necessary to consider an example as a given class label (see Northcutt et al., 2021, Section 3.1, Equation 2).This is for advanced users only. If not specified, these are computed for you automatically. If an example has a predicted probability greater than this threshold, it is counted as having true_label = k. This is not used for pruning/filtering, only for estimating the noise rates using confident counts.
converge_latent_estimates (
bool
, optional) – IfTrue
, forces numerical consistency of estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively make them mathematically consistent.py_method (
{"cnt", "eqn", "marginal", "marginal_ps"}
, default"cnt"
) – How to compute the latent priorp(true_label=k)
. Default is"cnt"
as it often works well even when the noise matrices are estimated poorly by using the matrix diagonals instead of all the probabilities.calibrate (
bool
, defaultTrue
) – Calibrates confident joint estimateP(label=i, true_label=j)
such thatnp.sum(cj) == len(labels)
andnp.sum(cj, axis = 1) == np.bincount(labels)
.
- Return type:
Tuple
[ndarray
,ndarray
,ndarray
,ndarray
]- Returns:
estimates (
tuple
) – A tuple of arrays: (py
,noise_matrix
,inverse_noise_matrix
,confident_joint
).
Note
Multi-label classification is not supported in this method.
- cleanlab.count.estimate_confident_joint_and_cv_pred_proba(X, labels, clf=LogisticRegression(), *, cv_n_folds=5, thresholds=None, seed=None, calibrate=True, clf_kwargs={}, validation_func=None)[source]#
Estimates
P(labels, y)
, the confident counts of the latent joint distribution of true and noisy labels using observedlabels
and predicted probabilitiespred_probs
.The output of this function is an array of shape
(K, K)
.Under certain conditions, estimates are exact, and in many conditions, estimates are within one percent of actual.
Notes: There are two ways to compute the confident joint with pros/cons. (1) For each holdout set, we compute the confident joint, then sum them up. (2) Compute pred_proba for each fold, combine, compute the confident joint. (1) is more accurate because it correctly computes thresholds for each fold (2) is more accurate when you have only a little data because it computes the confident joint using all the probabilities. For example if you had 100 examples, with 5-fold cross validation + uniform p(y) you would only have 20 examples to compute each confident joint for (1). Such small amounts of data is bound to result in estimation errors. For this reason, we implement (2), but we implement (1) as a commented out function at the end of this file.
- Parameters:
X (
np.ndarray
orpd.DataFrame
) –Input feature matrix of shape
(N, ...)
, where N is the number of examples. The classifier that this instance was initialized with,clf
, must be able to fit() and predict() data with this format.labels (
np.ndarray
orpd.Series
) – A 1D array of shape(N,)
containing class labels for a standard (multi-class) classification dataset. Some given labels may be erroneous. Elements must be integers in the set 0, 1, …, K-1, where K is the number of classes. All classes must be present in the dataset.clf (
estimator instance
, optional) – A classifier implementing the sklearn estimator API.cv_n_folds (
int
, default5
) – The number of cross-validation folds used to compute out-of-sample predicted probabilities for each example inX
.thresholds (
array_like
, optional) –An array of shape
(K, 1)
or(K,)
of per-class threshold probabilities, used to determine the cutoff probability necessary to consider an example as a given class label (see Northcutt et al., 2021, Section 3.1, Equation 2).This is for advanced users only. If not specified, these are computed for you automatically. If an example has a predicted probability greater than this threshold, it is counted as having true_label = k. This is not used for pruning/filtering, only for estimating the noise rates using confident counts.
seed (
int
, optional) – Set the default state of the random number generator used to split the cross-validated folds. If None, uses np.random current random state.calibrate (
bool
, defaultTrue
) – Calibrates confident joint estimateP(label=i, true_label=j)
such thatnp.sum(cj) == len(labels)
andnp.sum(cj, axis = 1) == np.bincount(labels)
.clf_kwargs (
dict
, optional) – Optional keyword arguments to pass intoclf
’sfit()
method.validation_func (
callable
, optional) – Specifies how to map the validation data split in cross-validation as input forclf.fit()
. For details, see the documentation ofCleanLearning.fit
- Return type:
Tuple
[ndarray
,ndarray
]- Returns:
estimates (
tuple
) – Tuple of two numpy arrays in the form: (joint counts matrix, predicted probability matrix)
Note
Multi-label classification is not supported in this method.
- cleanlab.count.estimate_py_noise_matrices_and_cv_pred_proba(X, labels, clf=LogisticRegression(), *, cv_n_folds=5, thresholds=None, converge_latent_estimates=False, py_method='cnt', seed=None, clf_kwargs={}, validation_func=None)[source]#
This function computes the out-of-sample predicted probability
P(label=k|x)
for every example x inX
using cross validation while also computing the confident counts noise rates within each cross-validated subset and returning the average noise rate across all examples.This function estimates the
noise_matrix
of shape(K, K)
. This is the fraction of examples in every class, labeled as every other class. Thenoise_matrix
is a conditional probability matrix forP(label=k_s|true_label=k_y)
.Under certain conditions, estimates are exact, and in most conditions, estimates are within one percent of the actual noise rates.
- Parameters:
X (
np.ndarray
) – Input feature matrix of shape(N, ...)
, where N is the number of examples. The classifier that this instance was initialized with,clf
, must be able to handle data with this shape.labels (
np.ndarray
) – A 1D array of shape(N,)
containing class labels for a standard (multi-class) classification dataset. Some given labels may be erroneous. Elements must be integers in the set 0, 1, …, K-1, where K is the number of classes. All classes must be present in the dataset.clf (
estimator instance
, optional) –A classifier implementing the sklearn estimator API.
cv_n_folds (
int
, default5
) – The number of cross-validation folds used to compute out-of-sample probabilities for each example inX
.thresholds (
array_like
, optional) –An array of shape
(K, 1)
or(K,)
of per-class threshold probabilities, used to determine the cutoff probability necessary to consider an example as a given class label (see Northcutt et al., 2021, Section 3.1, Equation 2).This is for advanced users only. If not specified, these are computed for you automatically. If an example has a predicted probability greater than this threshold, it is counted as having true_label = k. This is not used for pruning/filtering, only for estimating the noise rates using confident counts.
converge_latent_estimates (
bool
, optional) – IfTrue
, forces numerical consistency of estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively make them mathematically consistent.py_method (
{"cnt", "eqn", "marginal", "marginal_ps"}
, default"cnt"
) – How to compute the latent priorp(true_label=k)
. Default is"cnt"
as it often works well even when the noise matrices are estimated poorly by using the matrix diagonals instead of all the probabilities.seed (
int
, optional) – Set the default state of the random number generator used to split the cross-validated folds. IfNone
, usesnp.random
current random state.clf_kwargs (
dict
, optional) – Optional keyword arguments to pass intoclf
’sfit()
method.validation_func (
callable
, optional) – Specifies how to map the validation data split in cross-validation as input forclf.fit()
. For details, see the documentation ofCleanLearning.fit
- Return type:
Tuple
[ndarray
,ndarray
,ndarray
,ndarray
,ndarray
]- Returns:
estimates (
tuple
) – A tuple of five arrays (py, noise matrix, inverse noise matrix, confident joint, predicted probability matrix).
Note
Multi-label classification is not supported in this method.
- cleanlab.count.estimate_cv_predicted_probabilities(X, labels, clf=LogisticRegression(), *, cv_n_folds=5, seed=None, clf_kwargs={}, validation_func=None)[source]#
This function computes the out-of-sample predicted probability [P(label=k|x)] for every example in X using cross validation. Output is a np.ndarray of shape
(N, K)
where N is the number of training examples and K is the number of classes.- Parameters:
X (
np.ndarray
) – Input feature matrix of shape(N, ...)
, where N is the number of examples. The classifier that this instance was initialized with,clf
, must be able to handle data with this shape.labels (
np.ndarray
) – A 1D array of shape(N,)
containing class labels for a standard (multi-class) classification dataset. Some given labels may be erroneous. Elements must be integers in the set 0, 1, …, K-1, where K is the number of classes. All classes must be present in the dataset.clf (
estimator instance
, optional) –A classifier implementing the sklearn estimator API.
cv_n_folds (
int
, default5
) – The number of cross-validation folds used to compute out-of-sample probabilities for each example inX
.seed (
int
, optional) – Set the default state of the random number generator used to split the cross-validated folds. IfNone
, usesnp.random
current random state.clf_kwargs (
dict
, optional) – Optional keyword arguments to pass intoclf
’sfit()
method.validation_func (
callable
, optional) – Specifies how to map the validation data split in cross-validation as input forclf.fit()
. For details, see the documentation ofCleanLearning.fit
- Return type:
ndarray
- Returns:
pred_probs (
np.ndarray
) – An array of shape(N, K)
representingP(label=k|x)
, the model-predicted probabilities. Each row of this matrix corresponds to an examplex
and contains the model-predicted probabilities thatx
belongs to each possible class.
- cleanlab.count.estimate_noise_matrices(X, labels, clf=LogisticRegression(), *, cv_n_folds=5, thresholds=None, converge_latent_estimates=True, seed=None, clf_kwargs={}, validation_func=None)[source]#
Estimates the
noise_matrix
of shape(K, K)
. This is the fraction of examples in every class, labeled as every other class. Thenoise_matrix
is a conditional probability matrix forP(label=k_s|true_label=k_y)
.Under certain conditions, estimates are exact, and in most conditions, estimates are within one percent of the actual noise rates.
- Parameters:
X (
np.ndarray
) – Input feature matrix of shape(N, ...)
, where N is the number of examples. The classifier that this instance was initialized with,clf
, must be able to handle data with this shape.labels (
np.ndarray
) – An array of shape(N,)
of noisy labels, i.e. some labels may be erroneous. Elements must be integers in the set 0, 1, …, K-1, where K is the number of classes.clf (
estimator instance
, optional) –A classifier implementing the sklearn estimator API.
cv_n_folds (
int
, default5
) – The number of cross-validation folds used to compute out-of-sample probabilities for each example inX
.thresholds (
array_like
, optional) –An array of shape
(K, 1)
or(K,)
of per-class threshold probabilities, used to determine the cutoff probability necessary to consider an example as a given class label (see Northcutt et al., 2021, Section 3.1, Equation 2).This is for advanced users only. If not specified, these are computed for you automatically. If an example has a predicted probability greater than this threshold, it is counted as having true_label = k. This is not used for pruning/filtering, only for estimating the noise rates using confident counts.
converge_latent_estimates (
bool
, optional) – IfTrue
, forces numerical consistency of estimates. Each is estimated independently, but they are related mathematically with closed form equivalences. This will iteratively make them mathematically consistent.seed (
int
, optional) – Set the default state of the random number generator used to split the cross-validated folds. If None, uses np.random current random state.clf_kwargs (
dict
, optional) – Optional keyword arguments to pass intoclf
’sfit()
method.validation_func (
callable
, optional) – Specifies how to map the validation data split in cross-validation as input forclf.fit()
. For details, see the documentation ofCleanLearning.fit
- Return type:
Tuple
[ndarray
,ndarray
]- Returns:
estimates (
tuple
) – A tuple containing arrays (noise_matrix
,inv_noise_matrix
).
- cleanlab.count.get_confident_thresholds(labels, pred_probs, multi_label=False)[source]#
Returns expected (average) “self-confidence” for each class.
The confident class threshold for a class j is the expected (average) “self-confidence” for class j, i.e. the model-predicted probability of this class averaged amongst all examples labeled as class j.
- Parameters:
labels (
np.ndarray
orlist
) – Given class labels for each example in the dataset, some of which may be erroneous, in same format expected byfilter.find_label_issues
function.pred_probs (
np.ndarray
) – Model-predicted class probabilities for each example in the dataset, in same format expected byfilter.find_label_issues
function.multi_label (
bool
, default= False
) – SetFalse
if your dataset is for regular (multi-class) classification, where each example belongs to exactly one class. SetTrue
if your dataset is for multi-label classification, where each example can belong to multiple classes. See documentation ofcompute_confident_joint
for details.
- Return type:
ndarray
- Returns:
confident_thresholds (
np.ndarray
) – An array of shape(K, )
where K is the number of classes.