util#

Ancillary helper methods used internally throughout this package; mostly related to Confident Learning algorithms.

Functions:

append_extra_datapoint(to_data, from_data, index)

Appends an extra datapoint to the data object to_data.

clip_noise_rates(noise_matrix)

Clip all noise rates to proper range [0,1), but do not modify the diagonal terms because they are not noise rates.

clip_values(x[, low, high, new_sum])

Clip all values in p to range [low,high].

compress_int_array(int_array, ...)

Compresses dtype of np.ndarray<int> if num_possible_values is small enough.

confusion_matrix(true, pred)

Implements a confusion matrix for true labels and predicted labels.

csr_vstack(a, b)

Takes in 2 csr_matrices and appends the second one to the bottom of the first one.

estimate_pu_f1(s, prob_s_eq_1)

Computes Claesen's estimate of f1 in the pulearning setting.

extract_indices_tf(X, idx, allow_shuffle)

Extracts subset of tensorflow dataset corresponding to examples at particular indices.

get_num_classes([labels, pred_probs, ...])

Determines the number of classes based on information considered in a canonical ordering.

int2onehot(labels)

Convert list of lists to a onehot matrix for multi-labels

is_tensorflow_dataset(X)

rtype:

bool

is_torch_dataset(X)

rtype:

bool

num_unique_classes(labels[, multi_label])

Finds the number of unique classes for both single-labeled and multi-labeled labels.

onehot2int(onehot_matrix)

Convert a onehot matrix for multi-labels to a list of lists of ints

print_inverse_noise_matrix(inverse_noise_matrix)

Pretty prints the inverse noise matrix.

print_joint_matrix(joint_matrix[, round_places])

Pretty prints the joint label noise matrix.

print_noise_matrix(noise_matrix[, round_places])

Pretty prints the noise matrix.

print_square_matrix(matrix[, left_name, ...])

Pretty prints a matrix.

remove_noise_from_class(noise_matrix, ...)

A helper function in the setting of PU learning.

round_preserving_row_totals(confident_joint)

Rounds confident_joint cj to type int while preserving the totals of reach row.

round_preserving_sum(iterable)

Rounds an iterable of floats while retaining the original summed value.

smart_display_dataframe(df)

Display a pandas dataframe if in a jupyter notebook, otherwise print it to console.

subset_X_y(X, labels, mask)

Extracts subset of features/labels where mask is True

subset_data(X, mask)

Extracts subset of data examples where mask (np.ndarray) is True

subset_labels(labels, mask)

Extracts subset of labels where mask is True

train_val_split(X, labels, train_idx, ...)

Splits data into training/validation sets based on given indices

unshuffle_tensorflow_dataset(X)

Applies iterative inverse transformations to dataset to get version before ShuffleDataset was created.

value_counts(x)

Returns an np.ndarray of shape (K, 1), with the value counts for every unique item in the labels list/array, where K is the number of unique entries in labels.

cleanlab.internal.util.append_extra_datapoint(to_data, from_data, index)[source]#

Appends an extra datapoint to the data object to_data. This datapoint is taken from the data object from_data at the corresponding index. One place this could be useful is ensuring no missing classes after train/validation split.

Return type:

Any

cleanlab.internal.util.clip_noise_rates(noise_matrix)[source]#

Clip all noise rates to proper range [0,1), but do not modify the diagonal terms because they are not noise rates.

ASSUMES noise_matrix columns sum to 1.

Parameters:

noise_matrix (np.ndarray of shape (K, K), K = number of classes) – A conditional probability matrix containing the fraction of examples in every class, labeled as every other class. Diagonal terms are not noise rates, but are consistency P(label=k|true_label=k) Assumes columns of noise_matrix sum to 1

Return type:

ndarray

cleanlab.internal.util.clip_values(x, low=0.0, high=1.0, new_sum=None)[source]#

Clip all values in p to range [low,high]. Preserves sum of x.

Parameters:
  • x (np.ndarray) – An array / list of values to be clipped.

  • low (float) – values in x greater than ‘low’ are clipped to this value

  • high (float) – values in x greater than ‘high’ are clipped to this value

  • new_sum (float) – normalizes x after clipping to sum to new_sum

Return type:

ndarray

Returns:

x (np.ndarray) – A list of clipped values, summing to the same sum as x.

cleanlab.internal.util.compress_int_array(int_array, num_possible_values)[source]#

Compresses dtype of np.ndarray<int> if num_possible_values is small enough.

Return type:

ndarray

cleanlab.internal.util.confusion_matrix(true, pred)[source]#

Implements a confusion matrix for true labels and predicted labels. true and pred MUST BE the same length and have the same distinct set of class labels represented.

Results are identical (and similar computation time) to:

“sklearn.metrics.confusion_matrix”

However, this function avoids the dependency on sklearn.

Parameters:
  • true (np.ndarray 1d) – Contains labels. Assumes true and pred contains the same set of distinct labels.

  • pred (np.ndarray 1d) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in {0,1,…,K-1}.

Return type:

ndarray

Returns:

confusion_matrix (np.ndarray (2D)) – matrix of confusion counts with true on rows and pred on columns.

cleanlab.internal.util.csr_vstack(a, b)[source]#

Takes in 2 csr_matrices and appends the second one to the bottom of the first one. Alternative to scipy.sparse.vstack. Returns a sparse matrix.

Return type:

Any

cleanlab.internal.util.estimate_pu_f1(s, prob_s_eq_1)[source]#

Computes Claesen’s estimate of f1 in the pulearning setting.

Parameters:
  • s (iterable (list or np.ndarray)) – Binary label (whether each element is labeled or not) in pu learning.

  • prob_s_eq_1 (iterable (list or np.ndarray)) – The probability, for each example, whether it has label=1 P(label=1|x)

  • (float) (Output) –

  • ------

  • setting. (Claesen's estimate for f1 in the pulearning) –

Return type:

float

cleanlab.internal.util.extract_indices_tf(X, idx, allow_shuffle)[source]#

Extracts subset of tensorflow dataset corresponding to examples at particular indices.

Args:

X : tensorflow.data.Dataset

idxarray_like of integer indices corresponding to examples to keep in the dataset.

Returns subset of examples in the dataset X that correspond to these indices.

allow_shufflebool

Whether or not shuffling of this data is allowed (eg. must turn off shuffling for validation data).

Note: this code only works on Datasets in which: * shuffle() has been called before batch(), * no other order-destroying operation (eg. repeat()) has been applied.

Indices are extracted from the original version of Dataset (before shuffle was called rather than in shuffled order).

Return type:

Any

cleanlab.internal.util.get_num_classes(labels=None, pred_probs=None, label_matrix=None, multi_label=None)[source]#

Determines the number of classes based on information considered in a canonical ordering. label_matrix can be: noise_matrix, inverse_noise_matrix, confident_joint, or any other K x K matrix where K = number of classes.

Return type:

int

cleanlab.internal.util.int2onehot(labels)[source]#

Convert list of lists to a onehot matrix for multi-labels

Parameters:

labels (list of lists of integers) – e.g. [[0,1], [3], [1,2,3], [1], [2]] All integers from 0,1,…,K-1 must be represented.

Return type:

ndarray

cleanlab.internal.util.is_tensorflow_dataset(X)[source]#
Return type:

bool

cleanlab.internal.util.is_torch_dataset(X)[source]#
Return type:

bool

cleanlab.internal.util.num_unique_classes(labels, multi_label=None)[source]#

Finds the number of unique classes for both single-labeled and multi-labeled labels. If multi_label is set to None (default) this method will infer if multi_label is True or False based on the format of labels. This allows for a more general form of multiclass labels that looks like this: [1, [1,2], [0], [0, 1], 2, 1]

Return type:

int

cleanlab.internal.util.onehot2int(onehot_matrix)[source]#

Convert a onehot matrix for multi-labels to a list of lists of ints

Parameters:

onehot_matrix (2D np.ndarray of 0s and 1s) – A one hot encoded matrix representation of multi-labels.

Return type:

list

Returns:

labels (list of lists of integers) – e.g. [[0,1], [3], [1,2,3], [1], [2]] All integers from 0,1,…,K-1 must be represented.

cleanlab.internal.util.print_inverse_noise_matrix(inverse_noise_matrix, round_places=2)[source]#

Pretty prints the inverse noise matrix.

cleanlab.internal.util.print_joint_matrix(joint_matrix, round_places=2)[source]#

Pretty prints the joint label noise matrix.

cleanlab.internal.util.print_noise_matrix(noise_matrix, round_places=2)[source]#

Pretty prints the noise matrix.

cleanlab.internal.util.print_square_matrix(matrix, left_name='s', top_name='y', title=' A square matrix', short_title='s,y', round_places=2)[source]#

Pretty prints a matrix.

Parameters:
  • matrix (np.ndarray) – the matrix to be printed

  • left_name (str) – the name of the variable on the left of the matrix

  • top_name (str) – the name of the variable on the top of the matrix

  • title (str) – Prints this string above the printed square matrix.

  • short_title (str) – A short title (6 characters or fewer) like P(labels|y) or P(labels,y).

  • round_places (int) – Number of decimals to show for each matrix value.

cleanlab.internal.util.remove_noise_from_class(noise_matrix, class_without_noise)[source]#

A helper function in the setting of PU learning. Sets all P(label=class_without_noise|true_label=any_other_class) = 0 in noise_matrix for pulearning setting, where we have generalized the positive class in PU learning to be any class of choosing, denoted by class_without_noise.

Parameters:
  • noise_matrix (np.ndarray of shape (K, K), K = number of classes) – A conditional probability matrix of the form P(label=k_s|true_label=k_y) containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.

  • class_without_noise (int) – Integer value of the class that has no noise. Traditionally, this is 1 (positive) for PU learning.

Return type:

ndarray

cleanlab.internal.util.round_preserving_row_totals(confident_joint)[source]#

Rounds confident_joint cj to type int while preserving the totals of reach row. Assumes that cj is a 2D np.ndarray of type float.

Parameters:

confident_joint (2D np.ndarray of shape (K, K)) – See compute_confident_joint docstring for details.

Return type:

ndarray

Returns:

confident_joint (2D np.ndarray of shape (K,K)) – Rounded to int while preserving row totals.

cleanlab.internal.util.round_preserving_sum(iterable)[source]#

Rounds an iterable of floats while retaining the original summed value. The name of each parameter is required. The type and description of each parameter is optional, but should be included if not obvious.

The while loop in this code was adapted from: https://github.com/cgdeboer/iteround

Parameters:

iterable (list or np.ndarray) – An iterable of floats

Return type:

ndarray

Returns:

list or np.ndarray – The iterable rounded to int, preserving sum.

cleanlab.internal.util.smart_display_dataframe(df)[source]#

Display a pandas dataframe if in a jupyter notebook, otherwise print it to console.

cleanlab.internal.util.subset_X_y(X, labels, mask)[source]#

Extracts subset of features/labels where mask is True

Return type:

Tuple[Any, Union[list, ndarray, Series, DataFrame]]

cleanlab.internal.util.subset_data(X, mask)[source]#

Extracts subset of data examples where mask (np.ndarray) is True

Return type:

Any

cleanlab.internal.util.subset_labels(labels, mask)[source]#

Extracts subset of labels where mask is True

Return type:

Union[list, ndarray, Series]

cleanlab.internal.util.train_val_split(X, labels, train_idx, holdout_idx)[source]#

Splits data into training/validation sets based on given indices

Return type:

Tuple[Any, Any, Union[list, ndarray, Series, DataFrame], Union[list, ndarray, Series, DataFrame]]

cleanlab.internal.util.unshuffle_tensorflow_dataset(X)[source]#

Applies iterative inverse transformations to dataset to get version before ShuffleDataset was created. If no ShuffleDataset is in the transformation-history of this dataset, returns None.

Parameters:

X (a tensorflow Dataset that may have been created via series of transformations, one being shuffle.) –

Return type:

tuple

Returns:

Tuple (pre_X, buffer_size) where – pre_X : Dataset that was previously transformed to get ShuffleDataset (or None), buffer_size : int buffer_size previously used in ShuffleDataset,

or len(pre_X) if buffer_size cannot be determined, or None if no ShuffleDataset found.

cleanlab.internal.util.value_counts(x)[source]#

Returns an np.ndarray of shape (K, 1), with the value counts for every unique item in the labels list/array, where K is the number of unique entries in labels.

Why this matters? Here is an example:

x = [np.random.randint(0,100) for i in range(100000)]
%timeit np.bincount(x)
# Result: 100 loops, best of 3: 3.9 ms per loop
%timeit np.unique(x, return_counts=True)[1]
# Result: 100 loops, best of 3: 7.47 ms per loop
Parameters:

x (list or np.ndarray (one dimensional)) – A list of discrete objects, like lists or strings, for example, class labels ‘y’ when training a classifier. e.g. [“dog”,”dog”,”cat”] or [1,2,0,1,1,0,2]

Return type:

Any