label_issues_batched#
Implementation of filter.find_label_issues
that does not need much memory by operating in mini-batches.
You can also use this approach to estimate label quality scores or the number of label issues
for big datasets with limited memory.
With default settings, the results returned from this approach closely approximate those returned from:
cleanlab.filter.find_label_issues(..., filter_by="low_self_confidence", return_indices_ranked_by="self_confidence")
To run this approach, either use the find_label_issues_batched()
convenience function defined in this module,
or follow the examples script for the LabelInspector
class if you require greater customization.
Data:
Functions:
|
Variant of |
|
Helper function to split array into chunks for multiprocessing. |
Classes:
|
Class for finding label issues in big datasets where memory becomes a problem for other cleanlab methods. |
- cleanlab.experimental.label_issues_batched.find_label_issues_batched(labels=None, pred_probs=None, *, labels_file=None, pred_probs_file=None, batch_size=10000, n_jobs=1, verbose=True, quality_score_kwargs=None, num_issue_kwargs=None)[source]#
Variant of
filter.find_label_issues
that requires less memory by reading frompred_probs
,labels
in mini-batches. To avoid loading bigpred_probs
,labels
arrays into memory, provide these as memory-mapped objects like Zarr arrays or memmap arrays instead of regular numpy arrays. See: https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/With default settings, the results returned from this method closely approximate those returned from:
cleanlab.filter.find_label_issues(..., filter_by="low_self_confidence", return_indices_ranked_by="self_confidence")
This function internally implements the example usage script of the
LabelInspector
class, but you can further customize that script by running it yourself instead of this function. See the documentation ofLabelInspector
to learn more about how this method works internally.- Parameters:
labels (
np.ndarray-like object
, optional) –1D array of given class labels for each example in the dataset, (int) values in
0,1,2,...,K-1
. To avoid loading big objects into memory, you should pass this as a memory-mapped object like: Zarr array loaded withzarr.convenience.open(YOURFILE.zarr, mode="r")
, or memmap array loaded withnp.load(YOURFILE.npy, mmap_mode="r")
.Tip: You can save an existing numpy array to Zarr via:
zarr.convenience.save_array(YOURFILE.zarr, your_array)
, or to .npy file that can be loaded with mmap via:np.save(YOURFILE.npy, your_array)
.pred_probs (
np.ndarray-like object
, optional) – 2D array of model-predicted class probabilities (floats) for each example in the dataset. To avoid loading big objects into memory, you should pass this as a memory-mapped object like: Zarr array loaded withzarr.convenience.open(YOURFILE.zarr, mode="r")
or memmap array loaded withnp.load(YOURFILE.npy, mmap_mode="r")
.labels_file (
str
, optional) – Specify this instead oflabels
if you want this method to load from file for you into a memmap array. Path to .npy file where the entire 1Dlabels
numpy array is stored on disk (list format is not supported). This is loaded using:np.load(labels_file, mmap_mode="r")
so make sure this file was created via:np.save()
or other compatible methods (.npz not supported).pred_probs_file (
str
, optional) – Specify this instead ofpred_probs
if you want this method to load from file for you into a memmap array. Path to .npy file where the entirepred_probs
numpy array is stored on disk. This is loaded using:np.load(pred_probs_file, mmap_mode="r")
so make sure this file was created via:np.save()
or other compatible methods (.npz not supported).batch_size (
int
, optional) – Size of mini-batches to use for estimating the label issues. To maximize efficiency, try to use the largestbatch_size
your memory allows.n_jobs (
int
, optional) – Number of processes for multiprocessing (default value = 1). Only used on Linux. Ifn_jobs=None
, will use either the number of: physical cores if psutil is installed, or logical cores otherwise.verbose (
bool
, optional) – Whether to suppress print statements or not.quality_score_kwargs (
dict
, optional) – Keyword arguments to pass intorank.get_label_quality_scores
.num_issue_kwargs (
dict
, optional) – Keyword arguments tocount.num_label_issues
to control estimation of the number of label issues. The only supported kwarg here for now is:estimation_method
.
- Return type:
ndarray
- Returns:
issue_indices (
np.ndarray
) – Indices of examples with label issues, sorted by label quality score.
Examples
>>> batch_size = 10000 # for efficiency, set this to as large of a value as your memory can handle >>> # Just demonstrating how to save your existing numpy labels, pred_probs arrays to compatible .npy files: >>> np.save("LABELS.npy", labels_array) >>> np.save("PREDPROBS.npy", pred_probs_array) >>> # You can load these back into memmap arrays via: labels = np.load("LABELS.npy", mmap_mode="r") >>> # and then run this method on the memmap arrays, or just run it directly on the .npy files like this: >>> issues = find_label_issues_batched(labels_file="LABELS.npy", pred_probs_file="PREDPROBS.npy", batch_size=batch_size) >>> # This method also works with Zarr arrays: >>> import zarr >>> # Just demonstrating how to save your existing numpy labels, pred_probs arrays to compatible .zarr files: >>> zarr.convenience.save_array("LABELS.zarr", labels_array) >>> zarr.convenience.save_array("PREDPROBS.zarr", pred_probs_array) >>> # You can load from such files into Zarr arrays: >>> labels = zarr.convenience.open("LABELS.zarr", mode="r") >>> pred_probs = zarr.convenience.open("PREDPROBS.zarr", mode="r") >>> # This method can be directly run on Zarr arrays, memmap arrays, or regular numpy arrays: >>> issues = find_label_issues_batched(labels=labels, pred_probs=pred_probs, batch_size=batch_size)
- class cleanlab.experimental.label_issues_batched.LabelInspector(*, num_class, store_results=True, verbose=True, quality_score_kwargs=None, num_issue_kwargs=None, n_jobs=1)[source]#
Bases:
object
Class for finding label issues in big datasets where memory becomes a problem for other cleanlab methods. Only create one such object per dataset and do not try to use the same
LabelInspector
across 2 datasets. For efficiency, this class does little input checking. You can first runfilter.find_label_issues
on a small subset of your data to verify your inputs are properly formatted. Do NOT modify any of the attributes of this class yourself! Multi-label classification is not supported by this class, it is only for multi-class classification.The recommended usage demonstrated in the examples script below involves two passes over your data: one pass to compute
confident_thresholds
, another to evaluate each label. To maximize efficiency, try to use the largest batch_size your memory allows. To reduce runtime further, you can run the first pass on a subset of your dataset as long as it contains enough data from each class to estimateconfident_thresholds
accurately.In the examples script below: -
labels
is a (big) 1Dnp.ndarray
of class labels represented as integers in0,1,...,K-1
. -pred_probs
= is a (big) 2Dnp.ndarray
of predicted class probabilities, where each row is an example, each column represents a class.labels
andpred_probs
can be stored in a file instead where you load chunks of them at a time. Methods to load arrays in chunks include:np.load(...,mmap_mode='r')
,numpy.memmap()
, HDF5 or Zarr files, see: https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/Examples
>>> n = len(labels) >>> batch_size = 10000 # you can change this in between batches, set as big as your RAM allows >>> lab = LabelInspector(num_class = pred_probs.shape[1]) >>> # First compute confident thresholds (for faster results, can also do this on a random subset of your data): >>> i = 0 >>> while i < n: >>> end_index = i + batch_size >>> labels_batch = labels[i:end_index] >>> pred_probs_batch = pred_probs[i:end_index,:] >>> i = end_index >>> lab.update_confident_thresholds(labels_batch, pred_probs_batch) >>> # See what we calculated: >>> confident_thresholds = lab.get_confident_thresholds() >>> # Evaluate the quality of the labels (run this on full dataset you want to evaluate): >>> i = 0 >>> while i < n: >>> end_index = i + batch_size >>> labels_batch = labels[i:end_index] >>> pred_probs_batch = pred_probs[i:end_index,:] >>> i = end_index >>> batch_results = lab.score_label_quality(labels_batch, pred_probs_batch) >>> # Indices of examples with label issues, sorted by label quality score (most severe to least severe): >>> indices_of_examples_with_issues = lab.get_label_issues() >>> # If your `pred_probs` and `labels` are arrays already in memory, >>> # then you can use this shortcut for all of the above: >>> indices_of_examples_with_issues = find_label_issues_batched(labels, pred_probs, batch_size=10000)
- Parameters:
num_class (
int
) – The number of classes in your multi-class classification task.store_results (
bool
, optional) – Whether this object will store all label quality scores, a 1D array of shape(N,)
whereN
is the total number of examples in your dataset. Set this to False if you encounter memory problems even for small batch sizes (~1000). IfFalse
, you can still identify the label issues yourself by aggregating the label quality scores for each batch, sorting them across all batches, and returning the topT
indices withT = self.get_num_issues()
.verbose (
bool
, optional) – Whether to suppress print statements or not.n_jobs (
int
, optional) – Number of processes for multiprocessing (default value = 1). Only used on Linux. Ifn_jobs=None
, will use either the number of: physical cores if psutil is installed, or logical cores otherwise.quality_score_kwargs (
dict
, optional) – Keyword arguments to pass intorank.get_label_quality_scores
.num_issue_kwargs (
dict
, optional) – Keyword arguments tocount.num_label_issues
to control estimation of the number of label issues. The only supported kwarg here for now is:estimation_method
.
Methods:
get_confident_thresholds
([silent])Fetches already-computed confident thresholds from the data seen so far in same format as:
count.get_confident_thresholds
.get_num_issues
([silent])Fetches already-computed estimate of the number of label issues in the data seen so far in the same format as:
count.num_label_issues
.Fetches already-computed estimate of the label quality of each example seen so far in the same format as:
rank.get_label_quality_scores
.Fetches already-computed estimate of indices of examples with label issues in the data seen so far, in the same format as:
filter.find_label_issues
with itsreturn_indices_ranked_by
argument specified.update_confident_thresholds
(labels, pred_probs)Updates the estimate of confident_thresholds stored in this class using a new batch of data.
score_label_quality
(labels, pred_probs, *[, ...])Scores the label quality of each example in the provided batch of data, and also updates the number of label issues stored in this class.
- get_confident_thresholds(silent=False)[source]#
Fetches already-computed confident thresholds from the data seen so far in same format as:
count.get_confident_thresholds
.- Return type:
ndarray
- Returns:
confident_thresholds (
np.ndarray
) – An array of shape(K, )
whereK
is the number of classes.
- get_num_issues(silent=False)[source]#
Fetches already-computed estimate of the number of label issues in the data seen so far in the same format as:
count.num_label_issues
.Note: The estimated number of issues may differ from
count.num_label_issues
by 1 due to rounding differences.- Return type:
int
- Returns:
num_issues (
int
) – The estimated number of examples with label issues in the data seen so far.
- get_quality_scores()[source]#
Fetches already-computed estimate of the label quality of each example seen so far in the same format as:
rank.get_label_quality_scores
.- Return type:
ndarray
- Returns:
label_quality_scores (
np.ndarray
) – Contains one score (between 0 and 1) per example seen so far. Lower scores indicate more likely mislabeled examples.
- get_label_issues()[source]#
Fetches already-computed estimate of indices of examples with label issues in the data seen so far, in the same format as:
filter.find_label_issues
with itsreturn_indices_ranked_by
argument specified.Note: this method corresponds to
filter.find_label_issues(..., filter_by=METHOD1, return_indices_ranked_by=METHOD2)
where by default:METHOD1="low_self_confidence"
,METHOD2="self_confidence"
or if this object was instantiated withquality_score_kwargs = {"method": "normalized_margin"}
then we instead have:METHOD1="low_normalized_margin"
,METHOD2="normalized_margin"
.Note: The estimated number of issues may differ from
filter.find_label_issues
by 1 due to rounding differences.- Return type:
ndarray
- Returns:
issue_indices (
np.ndarray
) – Indices of examples with label issues, sorted by label quality score.
- update_confident_thresholds(labels, pred_probs)[source]#
Updates the estimate of confident_thresholds stored in this class using a new batch of data. Inputs should be in same format as for:
count.get_confident_thresholds
.- Parameters:
labels (
np.ndarray
orlist
) – Given class labels for each example in the batch, values in0,1,2,...,K-1
.pred_probs (
np.ndarray
) – 2D array of model-predicted class probabilities for each example in the batch.
- score_label_quality(labels, pred_probs, *, update_num_issues=True)[source]#
Scores the label quality of each example in the provided batch of data, and also updates the number of label issues stored in this class. Inputs should be in same format as for:
rank.get_label_quality_scores
.- Parameters:
labels (
np.ndarray
) – Given class labels for each example in the batch, values in0,1,2,...,K-1
.pred_probs (
np.ndarray
) – 2D array of model-predicted class probabilities for each example in the batch of data.update_num_issues (
bool
, optional) – Whether or not to update the number of label issues or only compute label quality scores. For lower runtimes, set this toFalse
if you only want to score label quality and not find label issues.
- Return type:
ndarray
- Returns:
label_quality_scores (
np.ndarray
) – Contains one score (between 0 and 1) for each example in the batch of data.