latent_algebra#
Contains mathematical functions relating the latent terms,
P(given_label)
, P(given_label  true_label)
, P(true_label  given_label)
, P(true_label)
, etc. together.
For every function here, if the inputs are exact, the output is guaranteed to be exact.
Every function herein is the computational equivalent of a mathematical equation having a closed, exact form.
If the inputs are inexact, the error will of course propagate.
Throughout K denotes the number of classes in the classification task.
Functions:

Compute the inverse noise matrix if py := P(true_label=k) is given. 

Compute the noise matrix 

Compute 

Compute 

Compute py := P(true_label=k), and the inverse noise matrix. 

Compute 
 cleanlab.internal.latent_algebra.compute_inv_noise_matrix(py, noise_matrix, *, ps=None)[source]#
Compute the inverse noise matrix if py := P(true_label=k) is given.
 Parameters:
py (
np.ndarray (shape (K
,1))
) – The fraction (prior probability) of each TRUE class label, P(true_label = k)noise_matrix (
np.ndarray
) – A conditional probability matrix (of shape(K, K)
) of the formP(label=k_strue_label=k_y)
containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.ps (
np.ndarray
) – Array of shape(K, 1)
containing the fraction (prior probability) of each NOISY given label,P(labels = k)
. ps is easily computable from py and should only be provided if it has already been precomputed, to increase code efficiency.
Examples
For loop based implementation:
# Number of classes K = len(py) # 'ps' is p(labels=k) = noise_matrix * p(true_labels=k) # because in *vector computation*: P(label=ktrue_label=k) * p(true_label=k) = P(label=k) if ps is None: ps = noise_matrix.dot(py) # Estimate the (K, K) inverse noise matrix P(true_label = k_y  label = k_s) inverse_noise_matrix = np.empty(shape=(K,K)) # k_s is the class value k of noisy label `label == k` for k_s in range(K): # k_y is the (guessed) class value k of true label y for k_y in range(K): # P(true_labellabel) = P(labely) * P(true_label) / P(labels) inverse_noise_matrix[k_y][k_s] = noise_matrix[k_s][k_y] * py[k_y] / ps[k_s]
 Return type:
ndarray
 cleanlab.internal.latent_algebra.compute_noise_matrix_from_inverse(ps, inverse_noise_matrix, *, py=None)[source]#
Compute the noise matrix
P(label=k_strue_label=k_y)
. Parameters:
py (
np.ndarray
) – Array of shape(K, 1)
containing the fraction (prior probability) of each TRUE class label,P(true_label = k)
.inverse_noise_matrix (
np.ndarray
) – A conditional probability matrix (of shape(K, K)
) of the form P(true_label=k_ylabel=k_s) representing the estimated fraction observed examples in each class k_s, that are mislabeled examples from every other class k_y. If None, the inverse_noise_matrix will be computed from pred_probs and labels. Assumes columns of inverse_noise_matrix sum to 1.ps (
np.ndarray
) – Array of shape(K, 1)
containing the fraction (prior probability) of each observed NOISY label, P(labels = k). ps is easily computable from py and should only be provided if it has already been precomputed, to increase code efficiency.
 Return type:
ndarray
 Returns:
noise_matrix (
np.ndarray
) – Array of shape(K, K)
, where K = number of classes, whose columns sum to 1. A conditional probability matrix of the formP(label=k_strue_label=k_y)
containing the fraction of examples in every class, labeled as every other class.
Examples
For loop based implementation:
# Number of classes labels K = len(ps) # 'py' is p(true_label=k) = inverse_noise_matrix * p(true_label=k) # because in *vector computation*: P(true_label=klabel=k) * p(label=k) = P(true_label=k) if py is None: py = inverse_noise_matrix.dot(ps) # Estimate the (K, K) noise matrix P(labels = k_s  true_labels = k_y) noise_matrix = np.empty(shape=(K,K)) # k_s is the class value k of noisy label `labels == k` for k_s in range(K): # k_y is the (guessed) class value k of true label y for k_y in range(K): # P(labelsy) = P(true_labellabels) * P(labels) / P(true_label) noise_matrix[k_s][k_y] = inverse_noise_matrix[k_y][k_s] * ps[k_s] / py[k_y]
 cleanlab.internal.latent_algebra.compute_ps_py_inv_noise_matrix(labels, noise_matrix)[source]#
Compute
ps := P(labels=k), py := P(true_labels=k)
, and the inverse noise matrix. Parameters:
labels (
np.ndarray
) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in{0,1,...,K1}
.noise_matrix (
np.ndarray
) – A conditional probability matrix (of shape(K, K)
) of the formP(label=k_strue_label=k_y)
containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.
 Return type:
Tuple
[ndarray
,ndarray
,ndarray
]
 cleanlab.internal.latent_algebra.compute_py(ps, noise_matrix, inverse_noise_matrix, *, py_method='cnt', true_labels_class_counts=None)[source]#
Compute
py := P(true_labels=k)
fromps := P(labels=k)
, noise_matrix, and inverse_noise_matrix.This method is ** ROBUST ** when
py_method = 'cnt'
It may work well even when the noise matrices are estimated poorly by using the diagonals of the matrices instead of all the probabilities in the entire matrix. Parameters:
ps (
np.ndarray
) – Array of shape(K, )
or(1, K)
containing the fraction (prior probability) of each observed, noisy label, P(labels = k)noise_matrix (
np.ndarray
) – A conditional probability matrix ( of shape(K, K)
) of the formP(label=k_strue_label=k_y)
containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.inverse_noise_matrix (
np.ndarray
ofshape (K
,K)
,K = number
ofclasses
) – A conditional probability matrix ( of shape(K, K)
) of the formP(true_label=k_ylabel=k_s)
representing the estimated fraction observed examples in each class k_s, that are mislabeled examples from every other class k_y. IfNone
, the inverse_noise_matrix will be computed from pred_probs and labels. Assumes columns of inverse_noise_matrix sum to 1.py_method (
str (Options
:[``
”cnt”, ``"eqn"
,"marginal"
,"marginal_ps"
])
) – How to compute the latent priorp(true_label=k)
. Default is “cnt” as it often works well even when the noise matrices are estimated poorly by using the matrix diagonals instead of all the probabilities.true_labels_class_counts (
np.ndarray
) – Array of shape(K, )
or(1, K)
containing the marginal counts of the confident joint (likecj.sum(axis = 0)
).
 Return type:
ndarray
 Returns:
py (
np.ndarray
) – Array of shape(K, )
or(1, K)
. The fraction (prior probability) of each TRUE class label,P(true_label = k)
.
 cleanlab.internal.latent_algebra.compute_py_inv_noise_matrix(ps, noise_matrix)[source]#
Compute py := P(true_label=k), and the inverse noise matrix.
 Parameters:
ps (
np.ndarray
) – Array of shape(K, )
or(1, K)
. The fraction (prior probability) of each observed, NOISY classP(labels = k)
.noise_matrix (
np.ndarray
) – A conditional probability matrix (of shape(K, K)
) of the formP(label=k_strue_label=k_y)
containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.
 Return type:
Tuple
[ndarray
,ndarray
]
 cleanlab.internal.latent_algebra.compute_pyx(pred_probs, noise_matrix, inverse_noise_matrix)[source]#
Compute
pyx := P(true_label=kx)
frompred_probs := P(label=kx)
, noise_matrix and inverse_noise_matrix.This method is ROBUST  meaning it works well even when the noise matrices are estimated poorly by only using the diagonals of the matrices which tend to be easy to estimate correctly.
 Parameters:
pred_probs (
np.ndarray
) –P(label=kx)
is a(N x K)
matrix with K modelpredicted probabilities. Each row of this matrix corresponds to an example x and contains the modelpredicted probabilities that x belongs to each possible class. The columns must be ordered such that these probabilities correspond to class 0,1,2,… pred_probs should have been computed using 3 (or higher) fold crossvalidation.noise_matrix (
np.ndarray
) – A conditional probability matrix (of shape(K, K)
) of the formP(label=k_strue_label=k_y)
containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.inverse_noise_matrix (
np.ndarray
) – A conditional probability matrix (of shape(K, K)
) of the formP(true_label=k_ylabel=k_s)
representing the estimated fraction observed examples in each class k_s, that are mislabeled examples from every other class k_y. If None, the inverse_noise_matrix will be computed from pred_probs and labels. Assumes columns of inverse_noise_matrix sum to 1.
 Returns:
pyx (
np.ndarray
) –P(true_label=kx)
is a(N, K)
matrix of modelpredicted probabilities. Each row of this matrix corresponds to an example x and contains the modelpredicted probabilities that x belongs to each possible class. The columns must be ordered such that these probabilities correspond to class 0,1,2,… pred_probs should have been computed using 3 (or higher) fold crossvalidation.