noise_generation#

Helper methods that are useful for benchmarking cleanlab’s core algorithms. These methods introduce synthetic noise into the labels of a classification dataset. Specifically, this module provides methods for generating valid noise matrices (for which learning with noise is possible), generating noisy labels given a noise matrix, generating valid noise matrices with a specific trace value, and more.

Functions:

noise_matrix_is_valid(noise_matrix, py, *[, ...])

Given a prior py representing p(true_label=k), checks if the given noise_matrix is a learnable matrix.

generate_noisy_labels(true_labels, noise_matrix)

Generates noisy labels from perfect labels true_labels, "exactly" yielding the provided noise_matrix between labels and true_labels.

generate_noise_matrix_from_trace(K, trace, *)

Generates a K x K noise matrix P(label=k_s|true_label=k_y) with np.sum(np.diagonal(noise_matrix)) equal to the given trace.

generate_n_rand_probabilities_that_sum_to_m(n, m, *)

Generates n random probabilities that sum to m.

randomly_distribute_N_balls_into_K_bins(N, K, *)

Returns a uniformly random numpy integer array of length N that sums to K.

cleanlab.benchmarking.noise_generation.noise_matrix_is_valid(noise_matrix, py, *, verbose=False)[source]#

Given a prior py representing p(true_label=k), checks if the given noise_matrix is a learnable matrix. Learnability means that it is possible to achieve better than random performance, on average, for the amount of noise in noise_matrix.

Parameters:
  • noise_matrix (np.ndarray) – An array of shape (K, K) representing the conditional probability matrix P(label=k_s|true_label=k_y) containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.

  • py (np.ndarray) – An array of shape (K,) representing the fraction (prior probability) of each true class label, P(true_label = k).

Return type:

bool

Returns:

is_valid (bool) – Whether the noise matrix is a learnable matrix.

cleanlab.benchmarking.noise_generation.generate_noisy_labels(true_labels, noise_matrix)[source]#

Generates noisy labels from perfect labels true_labels, “exactly” yielding the provided noise_matrix between labels and true_labels.

Below we provide a for loop implementation of what this function does. We do not use this implementation as it is not a fast algorithm, but it explains as Python pseudocode what is happening in this function.

Parameters:
  • true_labels (np.ndarray) – An array of shape (N,) representing perfect labels, without any noise. Contains K distinct natural number classes, 0, 1, …, K-1.

  • noise_matrix (np.ndarray) – An array of shape (K, K) representing the conditional probability matrix P(label=k_s|true_label=k_y) containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.

Return type:

ndarray

Returns:

labels (np.ndarray) – An array of shape (N,) of noisy labels.

Examples

# Generate labels
count_joint = (noise_matrix * py * len(y)).round().astype(int)
labels = np.ndarray(y)
for k_s in range(K):
    for k_y in range(K):
        if k_s != k_y:
            idx_flip = np.where((labels==k_y)&(true_label==k_y))[0]
            if len(idx_flip): # pragma: no cover
                labels[np.random.choice(
                    idx_flip,
                    count_joint[k_s][k_y],
                    replace=False,
                )] = k_s
cleanlab.benchmarking.noise_generation.generate_noise_matrix_from_trace(K, trace, *, max_trace_prob=1.0, min_trace_prob=1e-05, max_noise_rate=0.99999, min_noise_rate=0.0, valid_noise_matrix=True, py=None, frac_zero_noise_rates=0.0, seed=0, max_iter=10000)[source]#

Generates a K x K noise matrix P(label=k_s|true_label=k_y) with np.sum(np.diagonal(noise_matrix)) equal to the given trace.

Parameters:
  • K (int) – Creates a noise matrix of shape (K, K). Implies there are K classes for learning with noisy labels.

  • trace (float) – Sum of diagonal entries of array of random probabilities returned.

  • max_trace_prob (float) – Maximum probability of any entry in the trace of the return matrix.

  • min_trace_prob (float) – Minimum probability of any entry in the trace of the return matrix.

  • max_noise_rate (float) – Maximum noise_rate (non-diagonal entry) in the returned np.ndarray.

  • min_noise_rate (float) – Minimum noise_rate (non-diagonal entry) in the returned np.ndarray.

  • valid_noise_matrix (bool, default True) – If True, returns a matrix having all necessary conditions for learning with noisy labels. In particular, p(true_label=k)p(label=k) < p(true_label=k,label=k) is satisfied. This requires that trace > 1.

  • py (np.ndarray) – An array of shape (K,) representing the fraction (prior probability) of each true class label, P(true_label = k). This argument is required when valid_noise_matrix=True.

  • frac_zero_noise_rates (float) – The fraction of the n*(n-1) noise rates that will be set to 0. Note that if you set a high trace, it may be impossible to also have a low fraction of zero noise rates without forcing all non-1 diagonal values. Instead, when this happens we only guarantee to produce a noise matrix with frac_zero_noise_rates or higher. The opposite occurs with a small trace.

  • seed (int) – Seeds the random number generator for numpy.

  • max_iter (int, default 10000) – The max number of tries to produce a valid matrix before returning None.

Return type:

Optional[ndarray]

Returns:

noise_matrix (np.ndarray or None) – An array of shape (K, K) representing the noise matrix P(label=k_s|true_label=k_y) with trace equal to np.sum(np.diagonal(noise_matrix)). This a conditional probability matrix and a left stochastic matrix. Returns None if max_iter is exceeded.

cleanlab.benchmarking.noise_generation.generate_n_rand_probabilities_that_sum_to_m(n, m, *, max_prob=1.0, min_prob=0.0)[source]#

Generates n random probabilities that sum to m.

When min_prob=0 and max_prob = 1.0, use np.random.dirichlet(np.ones(n))*m instead.

Parameters:
  • n (int) – Length of array of random probabilities to be returned.

  • m (float) – Sum of array of random probabilities that is returned.

  • max_prob (float, default 1.0) – Maximum probability of any entry in the returned array. Must be between 0 and 1.

  • min_prob (float, default 0.0) – Minimum probability of any entry in the returned array. Must be between 0 and 1.

Return type:

ndarray

Returns:

probabilities (np.ndarray) – An array of probabilities.

cleanlab.benchmarking.noise_generation.randomly_distribute_N_balls_into_K_bins(N, K, *, max_balls_per_bin=None, min_balls_per_bin=None)[source]#

Returns a uniformly random numpy integer array of length N that sums to K.

Parameters:
  • N (int) – Number of balls.

  • K (int) – Number of bins.

  • max_balls_per_bin (int) – Ensure that each bin contains at most max_balls_per_bin balls.

  • min_balls_per_bin (int) – Ensure that each bin contains at least min_balls_per_bin balls.

Return type:

ndarray

Returns:

int_array (np.array) – Length N array that sums to K.