label_issues_batched#

Implementation of filter.find_label_issues that does not need much memory by operating in mini-batches. You can also use this approach to estimate label quality scores or the number of label issues for big datasets with limited memory.

With default settings, the results returned from this approach closely approximate those returned from: cleanlab.filter.find_label_issues(..., filter_by="low_self_confidence", return_indices_ranked_by="self_confidence")

To run this approach, either use the find_label_issues_batched() convenience function defined in this module, or follow the examples script for the LabelInspector class if you require greater customization.

Data:

Functions:

 find_label_issues_batched([labels, ...]) Variant of filter.find_label_issues that requires less memory by reading from pred_probs, labels in mini-batches. split_arr(arr, chunksize) Helper function to split array into chunks for multiprocessing.

Classes:

 LabelInspector(*, num_class[, ...]) Class for finding label issues in big datasets where memory becomes a problem for other cleanlab methods.
cleanlab.experimental.label_issues_batched.labels_shared: Union[list, ndarray, Series, DataFrame]#
cleanlab.experimental.label_issues_batched.pred_probs_shared: ndarray#
cleanlab.experimental.label_issues_batched.find_label_issues_batched(labels=None, pred_probs=None, *, labels_file=None, pred_probs_file=None, batch_size=10000, n_jobs=1, verbose=True, quality_score_kwargs=None, num_issue_kwargs=None)[source]#

Variant of filter.find_label_issues that requires less memory by reading from pred_probs, labels in mini-batches. To avoid loading big pred_probs, labels arrays into memory, provide these as memory-mapped objects like Zarr arrays or memmap arrays instead of regular numpy arrays. See: https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/

With default settings, the results returned from this method closely approximate those returned from: cleanlab.filter.find_label_issues(..., filter_by="low_self_confidence", return_indices_ranked_by="self_confidence")

This function internally implements the example usage script of the LabelInspector class, but you can further customize that script by running it yourself instead of this function. See the documentation of LabelInspector to learn more about how this method works internally.

Parameters:
• labels (np.ndarray-like object, optional) –

1D array of given class labels for each example in the dataset, (int) values in 0,1,2,...,K-1. To avoid loading big objects into memory, you should pass this as a memory-mapped object like: Zarr array loaded with zarr.convenience.open(YOURFILE.zarr, mode="r"), or memmap array loaded with np.load(YOURFILE.npy, mmap_mode="r").

Tip: You can save an existing numpy array to Zarr via: zarr.convenience.save_array(YOURFILE.zarr, your_array), or to .npy file that can be loaded with mmap via: np.save(YOURFILE.npy, your_array).

• pred_probs (np.ndarray-like object, optional) – 2D array of model-predicted class probabilities (floats) for each example in the dataset. To avoid loading big objects into memory, you should pass this as a memory-mapped object like: Zarr array loaded with zarr.convenience.open(YOURFILE.zarr, mode="r") or memmap array loaded with np.load(YOURFILE.npy, mmap_mode="r").

• labels_file (str, optional) – Specify this instead of labels if you want this method to load from file for you into a memmap array. Path to .npy file where the entire 1D labels numpy array is stored on disk (list format is not supported). This is loaded using: np.load(labels_file, mmap_mode="r") so make sure this file was created via: np.save() or other compatible methods (.npz not supported).

• pred_probs_file (str, optional) – Specify this instead of pred_probs if you want this method to load from file for you into a memmap array. Path to .npy file where the entire pred_probs numpy array is stored on disk. This is loaded using: np.load(pred_probs_file, mmap_mode="r") so make sure this file was created via: np.save() or other compatible methods (.npz not supported).

• batch_size (int, optional) – Size of mini-batches to use for estimating the label issues. To maximize efficiency, try to use the largest batch_size your memory allows.

• n_jobs (int, optional) – Number of processes for multiprocessing (default value = 1). Only used on Linux. If n_jobs=None, will use either the number of: physical cores if psutil is installed, or logical cores otherwise.

• verbose (bool, optional) – Whether to suppress print statements or not.

• quality_score_kwargs (dict, optional) – Keyword arguments to pass into rank.get_label_quality_scores.

• num_issue_kwargs (dict, optional) – Keyword arguments to count.num_label_issues to control estimation of the number of label issues. The only supported kwarg here for now is: estimation_method.

Return type:

ndarray

Returns:

issue_indices (np.ndarray) – Indices of examples with label issues, sorted by label quality score.

Examples

>>> batch_size = 10000  # for efficiency, set this to as large of a value as your memory can handle
>>> # Just demonstrating how to save your existing numpy labels, pred_probs arrays to compatible .npy files:
>>> np.save("LABELS.npy", labels_array)
>>> np.save("PREDPROBS.npy", pred_probs_array)
>>> # You can load these back into memmap arrays via: labels = np.load("LABELS.npy", mmap_mode="r")
>>> # and then run this method on the memmap arrays, or just run it directly on the .npy files like this:
>>> issues = find_label_issues_batched(labels_file="LABELS.npy", pred_probs_file="PREDPROBS.npy", batch_size=batch_size)
>>> # This method also works with Zarr arrays:
>>> import zarr
>>> # Just demonstrating how to save your existing numpy labels, pred_probs arrays to compatible .zarr files:
>>> zarr.convenience.save_array("LABELS.zarr", labels_array)
>>> zarr.convenience.save_array("PREDPROBS.zarr", pred_probs_array)
>>> # You can load from such files into Zarr arrays:
>>> labels = zarr.convenience.open("LABELS.zarr", mode="r")
>>> pred_probs = zarr.convenience.open("PREDPROBS.zarr", mode="r")
>>> # This method can be directly run on Zarr arrays, memmap arrays, or regular numpy arrays:
>>> issues = find_label_issues_batched(labels=labels, pred_probs=pred_probs, batch_size=batch_size)

class cleanlab.experimental.label_issues_batched.LabelInspector(*, num_class, store_results=True, verbose=True, quality_score_kwargs=None, num_issue_kwargs=None, n_jobs=1)[source]#

Bases: object

Class for finding label issues in big datasets where memory becomes a problem for other cleanlab methods. Only create one such object per dataset and do not try to use the same LabelInspector across 2 datasets. For efficiency, this class does little input checking. You can first run filter.find_label_issues on a small subset of your data to verify your inputs are properly formatted. Do NOT modify any of the attributes of this class yourself! Multi-label classification is not supported by this class, it is only for multi-class classification.

The recommended usage demonstrated in the examples script below involves two passes over your data: one pass to compute confident_thresholds, another to evaluate each label. To maximize efficiency, try to use the largest batch_size your memory allows. To reduce runtime further, you can run the first pass on a subset of your dataset as long as it contains enough data from each class to estimate confident_thresholds accurately.

In the examples script below: - labels is a (big) 1D np.ndarray of class labels represented as integers in 0,1,...,K-1. - pred_probs = is a (big) 2D np.ndarray of predicted class probabilities, where each row is an example, each column represents a class.

labels and pred_probs can be stored in a file instead where you load chunks of them at a time. Methods to load arrays in chunks include: np.load(...,mmap_mode='r'), numpy.memmap(), HDF5 or Zarr files, see: https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/

Examples

>>> n = len(labels)
>>> batch_size = 10000  # you can change this in between batches, set as big as your RAM allows
>>> lab = LabelInspector(num_class = pred_probs.shape[1])
>>> # First compute confident thresholds (for faster results, can also do this on a random subset of your data):
>>> i = 0
>>> while i < n:
>>>     end_index = i + batch_size
>>>     labels_batch = labels[i:end_index]
>>>     pred_probs_batch = pred_probs[i:end_index,:]
>>>     i = end_index
>>>     lab.update_confident_thresholds(labels_batch, pred_probs_batch)
>>> # See what we calculated:
>>> confident_thresholds = lab.get_confident_thresholds()
>>> # Evaluate the quality of the labels (run this on full dataset you want to evaluate):
>>> i = 0
>>> while i < n:
>>>     end_index = i + batch_size
>>>     labels_batch = labels[i:end_index]
>>>     pred_probs_batch = pred_probs[i:end_index,:]
>>>     i = end_index
>>>     batch_results = lab.score_label_quality(labels_batch, pred_probs_batch)
>>> # Indices of examples with label issues, sorted by label quality score (most severe to least severe):
>>> indices_of_examples_with_issues = lab.get_label_issues()
>>> # If your pred_probs and labels are arrays already in memory,
>>> # then you can use this shortcut for all of the above:
>>> indices_of_examples_with_issues = find_label_issues_batched(labels, pred_probs, batch_size=10000)

Parameters:
• num_class (int) – The number of classes in your multi-class classification task.

• store_results (bool, optional) – Whether this object will store all label quality scores, a 1D array of shape (N,) where N is the total number of examples in your dataset. Set this to False if you encounter memory problems even for small batch sizes (~1000). If False, you can still identify the label issues yourself by aggregating the label quality scores for each batch, sorting them across all batches, and returning the top T indices with T = self.get_num_issues().

• verbose (bool, optional) – Whether to suppress print statements or not.

• n_jobs (int, optional) – Number of processes for multiprocessing (default value = 1). Only used on Linux. If n_jobs=None, will use either the number of: physical cores if psutil is installed, or logical cores otherwise.

• quality_score_kwargs (dict, optional) – Keyword arguments to pass into rank.get_label_quality_scores.

• num_issue_kwargs (dict, optional) – Keyword arguments to count.num_label_issues to control estimation of the number of label issues. The only supported kwarg here for now is: estimation_method.

Methods:

 get_confident_thresholds([silent]) Fetches already-computed confident thresholds from the data seen so far in same format as: count.get_confident_thresholds. get_num_issues([silent]) Fetches already-computed estimate of the number of label issues in the data seen so far in the same format as: count.num_label_issues. Fetches already-computed estimate of the label quality of each example seen so far in the same format as: rank.get_label_quality_scores. Fetches already-computed estimate of indices of examples with label issues in the data seen so far, in the same format as: filter.find_label_issues with its return_indices_ranked_by argument specified. update_confident_thresholds(labels, pred_probs) Updates the estimate of confident_thresholds stored in this class using a new batch of data. score_label_quality(labels, pred_probs, *[, ...]) Scores the label quality of each example in the provided batch of data, and also updates the number of label issues stored in this class.
get_confident_thresholds(silent=False)[source]#

Fetches already-computed confident thresholds from the data seen so far in same format as: count.get_confident_thresholds.

Return type:

ndarray

Returns:

confident_thresholds (np.ndarray) – An array of shape (K, ) where K is the number of classes.

get_num_issues(silent=False)[source]#

Fetches already-computed estimate of the number of label issues in the data seen so far in the same format as: count.num_label_issues.

Note: The estimated number of issues may differ from count.num_label_issues by 1 due to rounding differences.

Return type:

int

Returns:

num_issues (int) – The estimated number of examples with label issues in the data seen so far.

get_quality_scores()[source]#

Fetches already-computed estimate of the label quality of each example seen so far in the same format as: rank.get_label_quality_scores.

Return type:

ndarray

Returns:

label_quality_scores (np.ndarray) – Contains one score (between 0 and 1) per example seen so far. Lower scores indicate more likely mislabeled examples.

get_label_issues()[source]#

Fetches already-computed estimate of indices of examples with label issues in the data seen so far, in the same format as: filter.find_label_issues with its return_indices_ranked_by argument specified.

Note: this method corresponds to filter.find_label_issues(..., filter_by=METHOD1, return_indices_ranked_by=METHOD2) where by default: METHOD1="low_self_confidence", METHOD2="self_confidence" or if this object was instantiated with quality_score_kwargs = {"method": "normalized_margin"} then we instead have: METHOD1="low_normalized_margin", METHOD2="normalized_margin".

Note: The estimated number of issues may differ from filter.find_label_issues by 1 due to rounding differences.

Return type:

ndarray

Returns:

issue_indices (np.ndarray) – Indices of examples with label issues, sorted by label quality score.

update_confident_thresholds(labels, pred_probs)[source]#

Updates the estimate of confident_thresholds stored in this class using a new batch of data. Inputs should be in same format as for: count.get_confident_thresholds.

Parameters:
• labels (np.ndarray or list) – Given class labels for each example in the batch, values in 0,1,2,...,K-1.

• pred_probs (np.ndarray) – 2D array of model-predicted class probabilities for each example in the batch.

score_label_quality(labels, pred_probs, *, update_num_issues=True)[source]#

Scores the label quality of each example in the provided batch of data, and also updates the number of label issues stored in this class. Inputs should be in same format as for: rank.get_label_quality_scores.

Parameters:
• labels (np.ndarray) – Given class labels for each example in the batch, values in 0,1,2,...,K-1.

• pred_probs (np.ndarray) – 2D array of model-predicted class probabilities for each example in the batch of data.

• update_num_issues (bool, optional) – Whether or not to update the number of label issues or only compute label quality scores. For lower runtimes, set this to False if you only want to score label quality and not find label issues.

Return type:

ndarray

Returns:

label_quality_scores (np.ndarray) – Contains one score (between 0 and 1) for each example in the batch of data.

cleanlab.experimental.label_issues_batched.split_arr(arr, chunksize)[source]#

Helper function to split array into chunks for multiprocessing.

Return type:

List[ndarray]