util#
Ancillary helper methods used internally throughout this package; mostly related to Confident Learning algorithms.
Functions:
|
A helper function in the setting of PU learning. |
|
Clip all noise rates to proper range [0,1), but do not modify the diagonal terms because they are not noise rates. |
|
Clip all values in p to range [low,high]. |
|
Returns an np.ndarray of shape (K, 1), with the value counts for every unique item in the labels list/array, where K is the number of unique entries in labels. |
|
Same as |
|
Find which classes are present in |
|
Rounds an iterable of floats while retaining the original summed value. |
|
Rounds confident_joint cj to type int while preserving the totals of reach row. |
|
Computes Claesen's estimate of f1 in the pulearning setting. |
|
Implements a confusion matrix for true labels and predicted labels. |
|
Pretty prints a matrix. |
|
Pretty prints the noise matrix. |
|
Pretty prints the inverse noise matrix. |
|
Pretty prints the joint label noise matrix. |
|
Compresses dtype of np.ndarray<int> if num_possible_values is small enough. |
|
Splits data into training/validation sets based on given indices |
|
Extracts subset of features/labels where mask is True |
|
Extracts subset of labels where mask is True |
|
Extracts subset of data examples where mask (np.ndarray) is True |
|
Extracts subset of tensorflow dataset corresponding to examples at particular indices. |
Applies iterative inverse transformations to dataset to get version before ShuffleDataset was created. |
|
|
|
|
|
|
Takes in 2 csr_matrices and appends the second one to the bottom of the first one. |
|
Appends an extra datapoint to the data object |
|
Determines the number of classes based on information considered in a canonical ordering. |
|
Finds the number of unique classes for both single-labeled and multi-labeled labels. |
|
Returns the set of unique classes for both single-labeled and multi-labeled labels. |
|
Takes an array of labels and formats it such that labels are in the set |
Display a pandas dataframe if in a jupyter notebook, otherwise print it to console. |
|
Enforce the dimensionality of a dataset to two dimensions for the use of CleanLearning default classifier, which is sklearn.linear_model.LogisticRegression. |
- cleanlab.internal.util.remove_noise_from_class(noise_matrix, class_without_noise)[source]#
A helper function in the setting of PU learning. Sets all P(label=class_without_noise|true_label=any_other_class) = 0 in noise_matrix for pulearning setting, where we have generalized the positive class in PU learning to be any class of choosing, denoted by class_without_noise.
- Parameters:
noise_matrix (
np.ndarray
ofshape (K
,K)
,K = number
ofclasses
) – A conditional probability matrix of the form P(label=k_s|true_label=k_y) containing the fraction of examples in every class, labeled as every other class. Assumes columns of noise_matrix sum to 1.class_without_noise (
int
) – Integer value of the class that has no noise. Traditionally, this is 1 (positive) for PU learning.
- Return type:
ndarray
- cleanlab.internal.util.clip_noise_rates(noise_matrix)[source]#
Clip all noise rates to proper range [0,1), but do not modify the diagonal terms because they are not noise rates.
ASSUMES noise_matrix columns sum to 1.
- Parameters:
noise_matrix (
np.ndarray
ofshape (K
,K)
,K = number
ofclasses
) – A conditional probability matrix containing the fraction of examples in every class, labeled as every other class. Diagonal terms are not noise rates, but are consistency P(label=k|true_label=k) Assumes columns of noise_matrix sum to 1- Return type:
ndarray
- cleanlab.internal.util.clip_values(x, low=0.0, high=1.0, new_sum=None)[source]#
Clip all values in p to range [low,high]. Preserves sum of x.
- Parameters:
x (
np.ndarray
) – An array / list of values to be clipped.low (
float
) – values in x greater than ‘low’ are clipped to this valuehigh (
float
) – values in x greater than ‘high’ are clipped to this valuenew_sum (
float
) – normalizes x after clipping to sum to new_sum
- Return type:
ndarray
- Returns:
x (
np.ndarray
) – A list of clipped values, summing to the same sum as x.
- cleanlab.internal.util.value_counts(x, *, num_classes=None, multi_label=False)[source]#
Returns an np.ndarray of shape (K, 1), with the value counts for every unique item in the labels list/array, where K is the number of unique entries in labels.
Works for both single-labeled and multi-labeled data.
- Parameters:
x (
list
ornp.ndarray (one dimensional)
) – A list of discrete objects, like lists or strings, for example, class labels ‘y’ when training a classifier. e.g. [“dog”,”dog”,”cat”] or [1,2,0,1,1,0,2]num_classes (
int (default
:None)
) – Setting this fills the value counts for missing classes with zeros. For example, if x = [0, 0, 1, 1, 3] then settingnum_classes=5
returns [2, 2, 0, 1, 0] whereas settingnum_classes=None
would return [2, 2, 1]. This assumes your labels come from the set [0, 1,… num_classes=1] even if some classes are missing.multi_label (
bool
, optional) – IfTrue
, labels should be an iterable (e.g. list) of iterables, containing a list of labels for each example, instead of just a single label. Assumes all classes in pred_probs.shape[1] are represented in labels. The multi-label setting supports classification tasks where an example has 1 or more labels. Example of a multi-labeled labels input:[[0,1], [1], [0,2], [0,1,2], [0], [1], ...]
. The major difference in how this is calibrated versus single-label is that the total number of errors considered is based on the number of labels, not the number of examples. So, the calibrated confident_joint will sum to the number of total labels.
- Return type:
ndarray
- cleanlab.internal.util.value_counts_fill_missing_classes(x, num_classes, *, multi_label=False)[source]#
Same as
internal.util.value_counts
but requires that num_classes is provided and always fills missing classes with zero counts.See
internal.util.value_counts
for parameter docstrings.- Return type:
ndarray
- cleanlab.internal.util.get_missing_classes(labels, *, pred_probs=None, num_classes=None, multi_label=False)[source]#
Find which classes are present in
pred_probs
but not present inlabels
.See
count.compute_confident_joint
for parameter docstrings.
- cleanlab.internal.util.round_preserving_sum(iterable)[source]#
Rounds an iterable of floats while retaining the original summed value. The name of each parameter is required. The type and description of each parameter is optional, but should be included if not obvious.
The while loop in this code was adapted from: https://github.com/cgdeboer/iteround
- Parameters:
iterable (
list
ornp.ndarray
) – An iterable of floats- Return type:
ndarray
- Returns:
list
ornp.ndarray
– The iterable rounded to int, preserving sum.
- cleanlab.internal.util.round_preserving_row_totals(confident_joint)[source]#
Rounds confident_joint cj to type int while preserving the totals of reach row. Assumes that cj is a 2D np.ndarray of type float.
- Parameters:
confident_joint (
2D np.ndarray
ofshape (K
,K)
) – See compute_confident_joint docstring for details.- Return type:
ndarray
- Returns:
confident_joint (
2D np.ndarray
ofshape (K,K)
) – Rounded to int while preserving row totals.
- cleanlab.internal.util.estimate_pu_f1(s, prob_s_eq_1)[source]#
Computes Claesen’s estimate of f1 in the pulearning setting.
- Parameters:
s (
iterable (list
ornp.ndarray)
) – Binary label (whether each element is labeled or not) in pu learning.prob_s_eq_1 (
iterable (list
ornp.ndarray)
) – The probability, for each example, whether it has label=1 P(label=1|x)(float) (Output) –
------ –
setting. (Claesen's estimate for f1 in the pulearning) –
- Return type:
float
- cleanlab.internal.util.confusion_matrix(true, pred)[source]#
Implements a confusion matrix for true labels and predicted labels. true and pred MUST BE the same length and have the same distinct set of class labels represented.
- Results are identical (and similar computation time) to:
“sklearn.metrics.confusion_matrix”
However, this function avoids the dependency on sklearn.
- Parameters:
true (
np.ndarray 1d
) – Contains labels. Assumes true and pred contains the same set of distinct labels.pred (
np.ndarray 1d
) – A discrete vector of noisy labels, i.e. some labels may be erroneous. Format requirements: for dataset with K classes, labels must be in {0,1,…,K-1}.
- Return type:
ndarray
- Returns:
confusion_matrix (
np.ndarray (2D)
) – matrix of confusion counts with true on rows and pred on columns.
- cleanlab.internal.util.print_square_matrix(matrix, left_name='s', top_name='y', title=' A square matrix', short_title='s,y', round_places=2)[source]#
Pretty prints a matrix.
- Parameters:
matrix (
np.ndarray
) – the matrix to be printedleft_name (
str
) – the name of the variable on the left of the matrixtop_name (
str
) – the name of the variable on the top of the matrixtitle (
str
) – Prints this string above the printed square matrix.short_title (
str
) – A short title (6 characters or fewer) like P(labels|y) or P(labels,y).round_places (
int
) – Number of decimals to show for each matrix value.
- cleanlab.internal.util.print_noise_matrix(noise_matrix, round_places=2)[source]#
Pretty prints the noise matrix.
- cleanlab.internal.util.print_inverse_noise_matrix(inverse_noise_matrix, round_places=2)[source]#
Pretty prints the inverse noise matrix.
- cleanlab.internal.util.print_joint_matrix(joint_matrix, round_places=2)[source]#
Pretty prints the joint label noise matrix.
- cleanlab.internal.util.compress_int_array(int_array, num_possible_values)[source]#
Compresses dtype of np.ndarray<int> if num_possible_values is small enough.
- Return type:
ndarray
- cleanlab.internal.util.train_val_split(X, labels, train_idx, holdout_idx)[source]#
Splits data into training/validation sets based on given indices
- Return type:
Tuple
[Any
,Any
,Union
[list
,ndarray
,Series
,DataFrame
],Union
[list
,ndarray
,Series
,DataFrame
]]
- cleanlab.internal.util.subset_X_y(X, labels, mask)[source]#
Extracts subset of features/labels where mask is True
- Return type:
Tuple
[Any
,Union
[list
,ndarray
,Series
,DataFrame
]]
- cleanlab.internal.util.subset_labels(labels, mask)[source]#
Extracts subset of labels where mask is True
- Return type:
Union
[list
,ndarray
,Series
]
- cleanlab.internal.util.subset_data(X, mask)[source]#
Extracts subset of data examples where mask (np.ndarray) is True
- Return type:
Any
- cleanlab.internal.util.extract_indices_tf(X, idx, allow_shuffle)[source]#
Extracts subset of tensorflow dataset corresponding to examples at particular indices.
- Return type:
Any
- Args:
X :
tensorflow.data.Dataset
- idxarray_like of integer indices corresponding to examples to keep in the dataset.
Returns subset of examples in the dataset X that correspond to these indices.
- allow_shufflebool
Whether or not shuffling of this data is allowed (eg. must turn off shuffling for validation data).
Note: this code only works on Datasets in which: *
shuffle()
has been called beforebatch()
, * no other order-destroying operation (eg.repeat()
) has been applied.Indices are extracted from the original version of Dataset (before shuffle was called rather than in shuffled order).
- cleanlab.internal.util.unshuffle_tensorflow_dataset(X)[source]#
Applies iterative inverse transformations to dataset to get version before ShuffleDataset was created. If no ShuffleDataset is in the transformation-history of this dataset, returns None.
- Parameters:
X (
a tensorflow Dataset that may have been created via series
oftransformations
,one being shuffle.
) –- Return type:
tuple
- Returns:
Tuple (pre_X
,buffer_size) where
– pre_X : Dataset that was previously transformed to get ShuffleDataset (or None), buffer_size : int buffer_size previously used in ShuffleDataset,or
len(pre_X)
if buffer_size cannot be determined, or None if no ShuffleDataset found.
- cleanlab.internal.util.csr_vstack(a, b)[source]#
Takes in 2 csr_matrices and appends the second one to the bottom of the first one. Alternative to scipy.sparse.vstack. Returns a sparse matrix.
- Return type:
Any
- cleanlab.internal.util.append_extra_datapoint(to_data, from_data, index)[source]#
Appends an extra datapoint to the data object
to_data
. This datapoint is taken from the data objectfrom_data
at the corresponding index. One place this could be useful is ensuring no missing classes after train/validation split.- Return type:
Any
- cleanlab.internal.util.get_num_classes(labels=None, pred_probs=None, label_matrix=None, multi_label=None)[source]#
Determines the number of classes based on information considered in a canonical ordering. label_matrix can be: noise_matrix, inverse_noise_matrix, confident_joint, or any other K x K matrix where K = number of classes.
- Return type:
int
- cleanlab.internal.util.num_unique_classes(labels, multi_label=None)[source]#
Finds the number of unique classes for both single-labeled and multi-labeled labels. If multi_label is set to None (default) this method will infer if multi_label is True or False based on the format of labels. This allows for a more general form of multiclass labels that looks like this: [1, [1,2], [0], [0, 1], 2, 1]
- Return type:
int
- cleanlab.internal.util.get_unique_classes(labels, multi_label=None)[source]#
Returns the set of unique classes for both single-labeled and multi-labeled labels. If multi_label is set to None (default) this method will infer if multi_label is True or False based on the format of labels. This allows for a more general form of multiclass labels that looks like this: [1, [1,2], [0], [0, 1], 2, 1]
- Return type:
set
- cleanlab.internal.util.format_labels(labels)[source]#
Takes an array of labels and formats it such that labels are in the set
0, 1, ..., K-1
, whereK
is the number of classes. The labels are assigned based on lexicographic order. This is useful for mapping string class labels to the integer format required by many cleanlab (and sklearn) functions.- Return type:
Tuple
[ndarray
,dict
]- Returns:
formatted_labels
– Returns np.ndarray of shape(N,)
. The return labels will be properly formatted and can be passed to other cleanlab functions.mapping
– A dictionary showing the mapping of new to old labels, such thatmapping[k]
returns the name of the k-th class.
- cleanlab.internal.util.smart_display_dataframe(df)[source]#
Display a pandas dataframe if in a jupyter notebook, otherwise print it to console.
- cleanlab.internal.util.force_two_dimensions(X)[source]#
Enforce the dimensionality of a dataset to two dimensions for the use of CleanLearning default classifier, which is `sklearn.linear_model.LogisticRegression
- Parameters:
X (
np.ndarray
orDatasetLike
) –- Return type:
Any
- Returns:
X (
np.ndarray
orDatasetLike
) – The original dataset reduced to two dimensions, so that the dataset will have the shape(N, sum(...))
, where N is still the number of examples.