noise_generation#
Helper methods that are useful for benchmarking cleanlab’s core algorithms. These methods introduce synthetic noise into the labels of a classification dataset. Specifically, this module provides methods for generating valid noise matrices (for which learning with noise is possible), generating noisy labels given a noise matrix, generating valid noise matrices with a specific trace value, and more.
Functions:
Generates n random probabilities that sum to m. 


Generates a 

Generates noisy labels from perfect labels true_labels, "exactly" yielding the provided noise_matrix between labels and true_labels. 

Given a prior py representing 
Returns a uniformly random numpy integer array of length N that sums to K. 
 cleanlab.benchmarking.noise_generation.generate_n_rand_probabilities_that_sum_to_m(n, m, *, max_prob=1.0, min_prob=0.0)[source]#
Generates n random probabilities that sum to m.
When
min_prob=0
andmax_prob = 1.0
, usenp.random.dirichlet(np.ones(n))*m
instead. Parameters:
n (
int
) – Length of array of random probabilities to be returned.m (
float
) – Sum of array of random probabilities that is returned.max_prob (
float
, default1.0
) – Maximum probability of any entry in the returned array. Must be between 0 and 1.min_prob (
float
, default0.0
) – Minimum probability of any entry in the returned array. Must be between 0 and 1.
 Return type:
ndarray
 Returns:
probabilities (
np.ndarray
) – An array of probabilities.
 cleanlab.benchmarking.noise_generation.generate_noise_matrix_from_trace(K, trace, *, max_trace_prob=1.0, min_trace_prob=1e05, max_noise_rate=0.99999, min_noise_rate=0.0, valid_noise_matrix=True, py=None, frac_zero_noise_rates=0.0, seed=0, max_iter=10000)[source]#
Generates a
K x K
noise matrixP(label=k_strue_label=k_y)
withnp.sum(np.diagonal(noise_matrix))
equal to the given trace. Parameters:
K (
int
) – Creates a noise matrix of shape(K, K)
. Implies there are K classes for learning with noisy labels.trace (
float
) – Sum of diagonal entries of array of random probabilities returned.max_trace_prob (
float
) – Maximum probability of any entry in the trace of the return matrix.min_trace_prob (
float
) – Minimum probability of any entry in the trace of the return matrix.max_noise_rate (
float
) – Maximum noise_rate (nondiagonal entry) in the returned np.ndarray.min_noise_rate (
float
) – Minimum noise_rate (nondiagonal entry) in the returned np.ndarray.valid_noise_matrix (
bool
, defaultTrue
) – IfTrue
, returns a matrix having all necessary conditions for learning with noisy labels. In particular,p(true_label=k)p(label=k) < p(true_label=k,label=k)
is satisfied. This requires thattrace > 1
.py (
np.ndarray
) – An array of shape(K,)
representing the fraction (prior probability) of each true class label,P(true_label = k)
. This argument is required whenvalid_noise_matrix=True
.frac_zero_noise_rates (
float
) – The fraction of then*(n1)
noise rates that will be set to 0. Note that if you set a high trace, it may be impossible to also have a low fraction of zero noise rates without forcing all non1 diagonal values. Instead, when this happens we only guarantee to produce a noise matrix with frac_zero_noise_rates or higher. The opposite occurs with a small trace.seed (
int
) – Seeds the random number generator for numpy.max_iter (
int
, default10000
) – The max number of tries to produce a valid matrix before returningNone
.
 Return type:
Optional
[ndarray
] Returns:
noise_matrix (
np.ndarray
orNone
) – An array of shape(K, K)
representing the noise matrixP(label=k_strue_label=k_y)
with trace equal tonp.sum(np.diagonal(noise_matrix))
. This a conditional probability matrix and a left stochastic matrix. ReturnsNone
if max_iter is exceeded.
 cleanlab.benchmarking.noise_generation.generate_noisy_labels(true_labels, noise_matrix)[source]#
Generates noisy labels from perfect labels true_labels, “exactly” yielding the provided noise_matrix between labels and true_labels.
Below we provide a for loop implementation of what this function does. We do not use this implementation as it is not a fast algorithm, but it explains as Python pseudocode what is happening in this function.
 Parameters:
true_labels (
np.ndarray
) – An array of shape(N,)
representing perfect labels, without any noise. Contains K distinct natural number classes, 0, 1, …, K1.noise_matrix (
np.ndarray
) – An array of shape(K, K)
representing the conditional probability matrixP(label=k_strue_label=k_y)
containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.
 Return type:
ndarray
 Returns:
labels (
np.ndarray
) – An array of shape(N,)
of noisy labels.
Examples
# Generate labels count_joint = (noise_matrix * py * len(y)).round().astype(int) labels = np.ndarray(y) for k_s in range(K): for k_y in range(K): if k_s != k_y: idx_flip = np.where((labels==k_y)&(true_label==k_y))[0] if len(idx_flip): # pragma: no cover labels[np.random.choice( idx_flip, count_joint[k_s][k_y], replace=False, )] = k_s
 cleanlab.benchmarking.noise_generation.noise_matrix_is_valid(noise_matrix, py, *, verbose=False)[source]#
Given a prior py representing
p(true_label=k)
, checks if the given noise_matrix is a learnable matrix. Learnability means that it is possible to achieve better than random performance, on average, for the amount of noise in noise_matrix. Parameters:
noise_matrix (
np.ndarray
) – An array of shape(K, K)
representing the conditional probability matrixP(label=k_strue_label=k_y)
containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.py (
np.ndarray
) – An array of shape(K,)
representing the fraction (prior probability) of each true class label,P(true_label = k)
.
 Return type:
bool
 Returns:
is_valid (
bool
) – Whether the noise matrix is a learnable matrix.
 cleanlab.benchmarking.noise_generation.randomly_distribute_N_balls_into_K_bins(N, K, *, max_balls_per_bin=None, min_balls_per_bin=None)[source]#
Returns a uniformly random numpy integer array of length N that sums to K.
 Parameters:
N (
int
) – Number of balls.K (
int
) – Number of bins.max_balls_per_bin (
int
) – Ensure that each bin contains at most max_balls_per_bin balls.min_balls_per_bin (
int
) – Ensure that each bin contains at least min_balls_per_bin balls.
 Return type:
ndarray
 Returns:
int_array (
np.array
) – Length N array that sums to K.