# util#

Functions:

 assert_inputs_are_valid(X, s[, pred_probs, ...]) Checks that X, labels, and pred_probs are correctly formatted clip_noise_rates(noise_matrix) Clip all noise rates to proper range [0,1), but do not modify the diagonal terms because they are not noise rates. clip_values(x[, low, high, new_sum]) Clip all values in p to range [low,high]. compress_int_array(int_array, ...) Compresses dtype of np.array if num_possible_values is small enough. confusion_matrix(true, pred) Implements a confusion matrix for true labels and predicted labels. estimate_pu_f1(s, prob_s_eq_1) Computes Claesen's estimate of f1 in the pulearning setting. int2onehot(labels) Convert list of lists to a onehot matrix for multi-labels onehot2int(onehot_matrix) Convert a onehot matrix for multi-labels to a list of lists of ints print_inverse_noise_matrix(inverse_noise_matrix) Pretty prints the inverse noise matrix. print_joint_matrix(joint_matrix[, round_places]) Pretty prints the joint label noise matrix. print_noise_matrix(noise_matrix[, round_places]) Pretty prints the noise matrix. print_square_matrix(matrix[, left_name, ...]) Pretty prints a matrix. remove_noise_from_class(noise_matrix, ...) A helper function in the setting of PU learning. round_preserving_row_totals(confident_joint) Rounds confident_joint cj to type int while preserving the totals of reach row. round_preserving_sum(iterable) Rounds an iterable of floats while retaining the original summed value. Display a pandas dataframe if in a jupyter notebook, otherwise print it to console. Returns an np.array of shape (K, 1), with the value counts for every unique item in the labels list/array, where K is the number of unique entries in labels.
cleanlab.internal.util.assert_inputs_are_valid(X, s, pred_probs=None, allow_empty_X=False)[source]#

Checks that X, labels, and pred_probs are correctly formatted

cleanlab.internal.util.clip_noise_rates(noise_matrix)[source]#

Clip all noise rates to proper range [0,1), but do not modify the diagonal terms because they are not noise rates.

ASSUMES noise_matrix columns sum to 1.

Parameters

noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probability matrix containing the fraction of examples in every class, labeled as every other class. Diagonal terms are not noise rates, but are consistency P(label=k|true_label=k) Assumes columns of noise_matrix sum to 1

cleanlab.internal.util.clip_values(x, low=0.0, high=1.0, new_sum=None)[source]#

Clip all values in p to range [low,high]. Preserves sum of x.

Parameters
• x (np.array) – An array / list of values to be clipped.

• low (float) – values in x greater than ‘low’ are clipped to this value

• high (float) – values in x greater than ‘high’ are clipped to this value

• new_sum (float) – normalizes x after clipping to sum to new_sum

Returns

x – A list of clipped values, summing to the same sum as x.

Return type

np.array

cleanlab.internal.util.compress_int_array(int_array, num_possible_values)[source]#

Compresses dtype of np.array<int> if num_possible_values is small enough.

cleanlab.internal.util.confusion_matrix(true, pred)[source]#

Implements a confusion matrix for true labels and predicted labels. true and pred MUST BE the same length and have the same distinct set of class labels represented.

Results are identical (and similar computation time) to:

“sklearn.metrics.confusion_matrix”

However, this function avoids the dependency on sklearn.

Parameters
• true (np.array 1d) – Contains labels. Assumes true and pred contains the same set of distinct labels.

• pred (np.array 1d) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in {0,1,…,K-1}.

Returns

confusion_matrix – matrix of confusion counts with true on rows and pred on columns.

Return type

np.array (2D)

cleanlab.internal.util.estimate_pu_f1(s, prob_s_eq_1)[source]#

Computes Claesen’s estimate of f1 in the pulearning setting.

Parameters
• s (iterable (list or np.array)) – Binary label (whether each element is labeled or not) in pu learning.

• prob_s_eq_1 (iterable (list or np.array)) – The probability, for each example, whether it has label=1 P(label=1|x)

• (float) (Output) –

• ------

• setting. (Claesen's estimate for f1 in the pulearning) –

cleanlab.internal.util.int2onehot(labels)[source]#

Convert list of lists to a onehot matrix for multi-labels

Parameters

labels (list of lists of integers) – e.g. [[0,1], [3], [1,2,3], [1], [2]] All integers from 0,1,…,K-1 must be represented.

cleanlab.internal.util.onehot2int(onehot_matrix)[source]#

Convert a onehot matrix for multi-labels to a list of lists of ints

Parameters

onehot_matrix (2D np.array of 0s and 1s) – A one hot encoded matrix representation of multi-labels.

Returns

labels – e.g. [[0,1], [3], [1,2,3], [1], [2]] All integers from 0,1,…,K-1 must be represented.

Return type

list of lists of integers

cleanlab.internal.util.print_inverse_noise_matrix(inverse_noise_matrix, round_places=2)[source]#

Pretty prints the inverse noise matrix.

cleanlab.internal.util.print_joint_matrix(joint_matrix, round_places=2)[source]#

Pretty prints the joint label noise matrix.

cleanlab.internal.util.print_noise_matrix(noise_matrix, round_places=2)[source]#

Pretty prints the noise matrix.

cleanlab.internal.util.print_square_matrix(matrix, left_name='s', top_name='y', title=' A square matrix', short_title='s,y', round_places=2)[source]#

Pretty prints a matrix.

Parameters
• matrix (np.array) – the matrix to be printed

• left_name (str) – the name of the variable on the left of the matrix

• top_name (str) – the name of the variable on the top of the matrix

• title (str) – Prints this string above the printed square matrix.

• short_title (str) – A short title (6 characters or fewer) like P(labels|y) or P(labels,y).

• round_places (int) – Number of decimals to show for each matrix value.

cleanlab.internal.util.remove_noise_from_class(noise_matrix, class_without_noise)[source]#

A helper function in the setting of PU learning. Sets all P(label=class_without_noise|true_label=any_other_class) = 0 in noise_matrix for pulearning setting, where we have generalized the positive class in PU learning to be any class of choosing, denoted by class_without_noise.

Parameters
• noise_matrix (np.array of shape (K, K), K = number of classes) – A conditional probability matrix of the form P(label=k_s|true_label=k_y) containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.

• class_without_noise (int) – Integer value of the class that has no noise. Traditionally, this is 1 (positive) for PU learning.

cleanlab.internal.util.round_preserving_row_totals(confident_joint)[source]#

Rounds confident_joint cj to type int while preserving the totals of reach row. Assumes that cj is a 2D np.array of type float.

Parameters

confident_joint (2D np.array of shape (K, K)) – See compute_confident_joint docstring for details.

Returns

confident_joint – Rounded to int while preserving row totals.

Return type

2D np.array of shape (K,K)

cleanlab.internal.util.round_preserving_sum(iterable)[source]#

Rounds an iterable of floats while retaining the original summed value. The name of each parameter is required. The type and description of each parameter is optional, but should be included if not obvious.

The while loop in this code was adapted from: https://github.com/cgdeboer/iteround

Parameters

iterable (list or np.array) – An iterable of floats

Returns

The iterable rounded to int, preserving sum.

Return type

list or np.array

cleanlab.internal.util.smart_display_dataframe(df)[source]#

Display a pandas dataframe if in a jupyter notebook, otherwise print it to console.

cleanlab.internal.util.value_counts(x)[source]#

Returns an np.array of shape (K, 1), with the value counts for every unique item in the labels list/array, where K is the number of unique entries in labels.

Why this matters? Here is an example:

x = [np.random.randint(0,100) for i in range(100000)]

%timeit np.bincount(x)
# Result: 100 loops, best of 3: 3.9 ms per loop

%timeit np.unique(x, return_counts=True)[1]
# Result: 100 loops, best of 3: 7.47 ms per loop

Parameters

x (list or np.array (one dimensional)) – A list of discrete objects, like lists or strings, for example, class labels ‘y’ when training a classifier. e.g. [“dog”,”dog”,”cat”] or [1,2,0,1,1,0,2]