FAQ#

Answers to frequently asked questions about the cleanlab open-source package.

What data can cleanlab detect issues in?#

Currently, cleanlab can be used to detect label issues in any classification dataset, including those involving: multiple annotators per example (multi-annotator), or multiple labels per example (multi-label). This includes data from any modality such as: image, text, tabular, audio, etc. For text data, cleanlab also supports NLP tasks like entity recognition in which each word is individually labeled (token classification). We’re working to add support for all other common supervised learning tasks. If you have a particular task in mind, let us know!

How do I format classification labels for cleanlab?#

cleanlab only works with integer-encoded labels in the range {0,1, ... K-1} where K = number_of_classes. The labels array should only contain integer values in the range {0, K-1} and be of shape (N,) where N = total_number_of_data_points. Do not pass in labels where some classes are entirely missing or are extremely rare, as cleanlab may not perform as expected. It is better to remove such classes entirely from the dataset first (also dropping the corresponding dimensions from pred_probs and then renormalizing it).

Text or string labels should to be mapped to integers for each possible value. For example if your original data labels look like this: ["dog", "dog", "cat", "mouse", "cat"], you should feed them to cleanlab like this: labels = [1,1,0,2,0] and keep track of which integer uniquely represents which class (classes were ordered alphabetically in this example).

One-hot encoded labels should be integer-encoded by finding the argmax along the one-hot encoded axis. An example of what this might look like is shown below.

[2]:
import numpy as np

# This example arr has 4 labels (one per data point) where
# each label can be one of 3 possible classes

arr  = np.array([[0,1,0],[1,0,0],[0,0,1],[1,0,0]])
labels_proper_format = np.argmax(arr, axis=1)  # How labels should be formatted when passed into the model

How do I infer the correct labels for examples cleanlab has flagged?#

If you have a classifier that is compatible with CleanLearning (i.e. follows the sklearn API), here’s an easy way to see predicted labels alongside the label issues:

[3]:
cl = cleanlab.classification.CleanLearning(your_classifier)
issues_dataframe = cl.find_label_issues(data, labels)

Alternatively if you have already computed out-of-sample predicted probabilities (pred_probs) from a classifier:

[4]:
cl = cleanlab.classification.CleanLearning()
issues_dataframe = cl.find_label_issues(X=None, labels=labels, pred_probs=pred_probs)

If you have already found issues via:

[5]:
issues = cleanlab.filter.find_label_issues(labels, pred_probs)

then you can see your trained classifier’s class prediction for each flagged example via:

[6]:
class_predicted_for_flagged_examples = pred_probs[issues].argmax(axis=1)

where you can see the classifier’s class prediction for every example via:

[7]:
class_predicted_for_all_examples = pred_probs.argmax(axis=1)

We caution against just blindly taking the predicted label for granted, many of these suggestions may be wrong! You will be able to produce a much better version of your dataset interactively using Cleanlab Studio, which helps you efficiently fix issues like this in large datasets.

Why isn’t CleanLearning working for me?#

At this time, CleanLearning only works with data formatted as numpy matrices or pd.DataFrames, and with models that are compatible with the sklearn API (check out skorch for Pytorch compatibility and scikeras for Tensorflow/Keras compatibility). You can still use cleanlab with other data formats though! Just separately obtain predicted probabilities (pred_probs) from your model via cross-validation and pass them as inputs.

If CleanLearning is running successfully but not improving predictive accuracy of your model, here are some tips:

  1. Use cleanlab to find label issues in your test data as well (we recommend pooling labels across both training and test data into one input for find_label_issues()). Then manually review and fix label issues identified in the test data to verify accuracy measurements are actually meaningful.

  2. Try different values for filter_by, frac_noise, and min_examples_per_class which can be set via the find_label_issues_kwargs argument in the initialization of CleanLearning().

  3. Try to find a better model (eg. via hyperparameter tuning or changing to another classifier). CleanLearning can find better label issues by leveraging a better model, which allows it to produce better quality training data. This can form a virtuous cycle in which better models -> better issue detection -> better data -> even better models!

  4. Try jointly tuning both model hyperparameters and find_label_issues_kwargs values.

  5. Does your dataset have a junk (or clutter, unknown, other) class? If you have bad data, consider creating one (c.f. Caltech-256).

  6. Consider merging similar/overlapping classes found via cleanlab.dataset.find_overlapping_classes.

Other general tips to improve label error detection performance:

  1. Try creating more restrictive new filters by combining their intersections (e.g. combined_boolean_mask = mask1 & mask2 where mask1 and mask2 are the boolean masks created by running find_label_issues with different values of the filter_by argument).

  2. If your pred_probs are obtained via a neural network, try averaging the pred_probs over the last K epochs of training instead of just using the final pred_probs. Similarly, you can try averaging pred_probs from several models (remember to re-normalize) or using cleanlab.rank.get_label_quality_ensemble_scores.

How can I use different models for data cleaning vs. final training in CleanLearning?#

The code below demonstrates CleanLearning with 2 different classifiers: LogisticRegression() and GradientBoostingClassifier(). A LogisticRegression model is used to detect label issues (via cross-validation run inside CleanLearning) and a GradientBoostingClassifier model is finally trained on a clean subset of the data with issues removed. This can be done with any two classifiers.

[8]:
from cleanlab.classification import CleanLearning
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

# Make example data
data = np.vstack([np.random.random((100, 2)), np.random.random((100, 2)) + 10])
labels = np.array([0] * 100 + [1] * 100)

# Introduce label errors
true_errors = [97, 98, 100, 101, 102, 104]
for idx in true_errors:
    labels[idx] = 1 - labels[idx]

# CleanLearning with 2 different classifiers: one classifier is used to detect label issues
# and a different classifier is subsequently trained on the clean subset of the data.

model_to_find_errors = LogisticRegression()  # this model will be trained many times via cross-validation
model_to_return = GradientBoostingClassifier()  # this model will be trained once on clean subset of data

cl0 = CleanLearning(model_to_find_errors)
issues = cl0.find_label_issues(data, labels)

cl = CleanLearning(model_to_return).fit(data, labels, label_issues=issues)
pred_probs = cl.predict_proba(data)  # predictions from GradientBoostingClassifier

print(cl0.clf)  # will be LogisticRegression()
print(cl.clf)  # will be GradientBoostingClassifier()
LogisticRegression()
GradientBoostingClassifier()

How do I hyperparameter tune only the final model trained (and not the one finding label issues) in CleanLearning?#

The code below demonstrates CleanLearning using a GradientBoostingClassifier() with no hyperparameter-tuning to find label issues but with hyperparameter-tuning via RandomizedSearchCV(...) for the final training of this model on the clean subset of the data. This is a useful trick to avoid expensive hyperparameter-tuning for every fold of cross-validation (which is needed to find label issues).

[9]:
import numpy as np
from cleanlab.classification import CleanLearning
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV

# Make example data
data = np.vstack([np.random.random((100, 2)), np.random.random((100, 2)) + 10])
labels = np.array([0] * 100 + [1] * 100)

# Introduce label errors
true_errors = [97, 98, 100, 101, 102, 104]
for idx in true_errors:
    labels[idx] = 1 - labels[idx]

# CleanLearning with no hyperparameter-tuning during expensive cross-validation to find label issues
# but hyperparameter-tuning for the final training of model on clean subset of the data:

model_to_find_errors = GradientBoostingClassifier()  # this model will be trained many times via cross-validation
model_to_return = RandomizedSearchCV(GradientBoostingClassifier(),
                    param_distributions = {
                        "learning_rate": [0.001, 0.05, 0.1, 0.2, 0.5],
                        "max_depth": [3, 5, 10],
                    }
                )   # this model will be trained once on clean subset of data

cl0 = CleanLearning(model_to_find_errors)
issues = cl0.find_label_issues(data, labels)

cl = CleanLearning(model_to_return).fit(data, labels, label_issues=issues)  # CleanLearning for hyperparameter final training
pred_probs = cl.predict_proba(data)  # predictions from hyperparameter-tuned GradientBoostingClassifier

print(cl0.clf)  # will be GradientBoostingClassifier()
print(cl.clf)  # will be RandomizedSearchCV(estimator=GradientBoostingClassifier(),...)
GradientBoostingClassifier()
RandomizedSearchCV(estimator=GradientBoostingClassifier(),
                   param_distributions={'learning_rate': [0.001, 0.05, 0.1, 0.2,
                                                          0.5],
                                        'max_depth': [3, 5, 10]})

What license is cleanlab open-sourced under?#

AGPL-3.0 license

What does this mean? If you’re working at a company, you can use this open-source library to clean up your internal datasets. You can also use this open-source library to clean up a dataset used to train a model that is deployed in a commercial product. For non-commercial purposes, feel free to release altered versions of the source code as long as you include the same license.

Please email info@cleanlab.ai to discuss licensing needs if you would like to offer a commercial product that utilizes any cleanlab source code.

Can’t find an answer to your question?#

If your question is not addressed in these tutorials, please refer to the: Cleanlab Github issues, Cleanlab Code Examples or our Slack Community.

If your question is not addressed anywhere, please open a new Github issue. Our developers may also provide personalized assistance in our Slack Community.