noise_generation#
Helper methods that are useful for benchmarking cleanlab’s core algorithms. These methods introduce synthetic noise into the labels of a classification dataset. Specifically, this module provides methods for generating valid noise matrices (for which learning with noise is possible), generating noisy labels given a noise matrix, generating valid noise matrices with a specific trace value, and more.
Functions:
|
Given a prior |
|
Generates noisy |
|
Generates a |
Generates |
|
Returns a uniformly random numpy integer array of length |
- cleanlab.benchmarking.noise_generation.noise_matrix_is_valid(noise_matrix, py, *, verbose=False)[source]#
Given a prior
py
representingp(true_label=k)
, checks if the givennoise_matrix
is a learnable matrix. Learnability means that it is possible to achieve better than random performance, on average, for the amount of noise innoise_matrix
.- Parameters:
noise_matrix (
np.ndarray
) – An array of shape(K, K)
representing the conditional probability matrixP(label=k_s|true_label=k_y)
containing the fraction of examples in every class, labeled as every other class. Assumes columns ofnoise_matrix
sum to 1.py (
np.ndarray
) – An array of shape(K,)
representing the fraction (prior probability) of each true class label,P(true_label = k)
.
- Return type:
bool
- Returns:
is_valid (
bool
) – Whether the noise matrix is a learnable matrix.
- cleanlab.benchmarking.noise_generation.generate_noisy_labels(true_labels, noise_matrix)[source]#
Generates noisy
labels
from perfect labelstrue_labels
, “exactly” yielding the providednoise_matrix
betweenlabels
andtrue_labels
.Below we provide a for loop implementation of what this function does. We do not use this implementation as it is not a fast algorithm, but it explains as Python pseudocode what is happening in this function.
- Parameters:
true_labels (
np.ndarray
) – An array of shape(N,)
representing perfect labels, without any noise. Contains K distinct natural number classes, 0, 1, …, K-1.noise_matrix (
np.ndarray
) – An array of shape(K, K)
representing the conditional probability matrixP(label=k_s|true_label=k_y)
containing the fraction of examples in every class, labeled as every other class. Assumes columns ofnoise_matrix
sum to 1.
- Return type:
ndarray
- Returns:
labels (
np.ndarray
) – An array of shape(N,)
of noisy labels.
Examples
# Generate labels count_joint = (noise_matrix * py * len(y)).round().astype(int) labels = np.ndarray(y) for k_s in range(K): for k_y in range(K): if k_s != k_y: idx_flip = np.where((labels==k_y)&(true_label==k_y))[0] if len(idx_flip): # pragma: no cover labels[np.random.choice( idx_flip, count_joint[k_s][k_y], replace=False, )] = k_s
- cleanlab.benchmarking.noise_generation.generate_noise_matrix_from_trace(K, trace, *, max_trace_prob=1.0, min_trace_prob=1e-05, max_noise_rate=0.99999, min_noise_rate=0.0, valid_noise_matrix=True, py=None, frac_zero_noise_rates=0.0, seed=0, max_iter=10000)[source]#
Generates a
K x K
noise matrixP(label=k_s|true_label=k_y)
withnp.sum(np.diagonal(noise_matrix))
equal to the giventrace
.- Parameters:
K (
int
) – Creates a noise matrix of shape(K, K)
. Implies there are K classes for learning with noisy labels.trace (
float
) – Sum of diagonal entries of array of random probabilities returned.max_trace_prob (
float
) – Maximum probability of any entry in the trace of the return matrix.min_trace_prob (
float
) – Minimum probability of any entry in the trace of the return matrix.max_noise_rate (
float
) – Maximum noise_rate (non-diagonal entry) in the returned np.ndarray.min_noise_rate (
float
) – Minimum noise_rate (non-diagonal entry) in the returned np.ndarray.valid_noise_matrix (
bool
, defaultTrue
) – IfTrue
, returns a matrix having all necessary conditions for learning with noisy labels. In particular,p(true_label=k)p(label=k) < p(true_label=k,label=k)
is satisfied. This requires thattrace > 1
.py (
np.ndarray
) – An array of shape(K,)
representing the fraction (prior probability) of each true class label,P(true_label = k)
. This argument is required whenvalid_noise_matrix=True
.frac_zero_noise_rates (
float
) – The fraction of then*(n-1)
noise rates that will be set to 0. Note that if you set a high trace, it may be impossible to also have a low fraction of zero noise rates without forcing all non-1 diagonal values. Instead, when this happens we only guarantee to produce a noise matrix withfrac_zero_noise_rates
or higher. The opposite occurs with a small trace.seed (
int
) – Seeds the random number generator for numpy.max_iter (
int
, default10000
) – The max number of tries to produce a valid matrix before returningNone
.
- Return type:
Optional
[ndarray
]- Returns:
noise_matrix (
np.ndarray
orNone
) – An array of shape(K, K)
representing the noise matrixP(label=k_s|true_label=k_y)
withtrace
equal tonp.sum(np.diagonal(noise_matrix))
. This a conditional probability matrix and a left stochastic matrix. ReturnsNone
ifmax_iter
is exceeded.
- cleanlab.benchmarking.noise_generation.generate_n_rand_probabilities_that_sum_to_m(n, m, *, max_prob=1.0, min_prob=0.0)[source]#
Generates
n
random probabilities that sum tom
.When
min_prob=0
andmax_prob = 1.0
, usenp.random.dirichlet(np.ones(n))*m
instead.- Parameters:
n (
int
) – Length of array of random probabilities to be returned.m (
float
) – Sum of array of random probabilities that is returned.max_prob (
float
, default1.0
) – Maximum probability of any entry in the returned array. Must be between 0 and 1.min_prob (
float
, default0.0
) – Minimum probability of any entry in the returned array. Must be between 0 and 1.
- Return type:
ndarray
- Returns:
probabilities (
np.ndarray
) – An array of probabilities.
- cleanlab.benchmarking.noise_generation.randomly_distribute_N_balls_into_K_bins(N, K, *, max_balls_per_bin=None, min_balls_per_bin=None)[source]#
Returns a uniformly random numpy integer array of length
N
that sums toK
.- Parameters:
N (
int
) – Number of balls.K (
int
) – Number of bins.max_balls_per_bin (
int
) – Ensure that each bin contains at mostmax_balls_per_bin
balls.min_balls_per_bin (
int
) – Ensure that each bin contains at leastmin_balls_per_bin
balls.
- Return type:
ndarray
- Returns:
int_array (
np.array
) – LengthN
array that sums toK
.