cleanlab documentation#

cleanlab automatically finds and fixes label issues in your ML datasets.

This reduces manual work needed to fix data errors and helps train reliable ML models on noisy real-world datasets. cleanlab has already found thousands of label errors in ImageNet, MNIST, and other popular ML benchmarking datasets, so let’s get started with yours!

Quickstart#

1. Install `cleanlab`#

pip install cleanlab

conda install -c cleanlab cleanlab

pip install git+https://github.com/cleanlab/cleanlab.git

2. Find label errors in your data#

cleanlab’s find_label_issues function tells you which examples in your dataset are likely mislabeled. At a minimum, it expects two inputs — your data’s given labels, labels, and predicted probabilities, pred_probs, from some trained classification model. These must be out-of-sample predictions where the data points were held out from the model during training, which can be obtained via cross-validation.

Setting return_indices_ranked_by in this function instructs cleanlab to return the indices of potential mislabeled examples, ordered by the likelihood of their given label being incorrect. This is estimated via a label quality score, which for example can be specified as 'self_confidence' (predicted probability the given label).

from cleanlab.filter import find_label_issues

ordered_label_issues = find_label_issues(
    labels=labels,
    pred_probs=pred_probs,
    return_indices_ranked_by='self_confidence',
)

Important

The predicted probabilities, pred_probs, from your model must be out-of-sample! You should never provide predictions on the same data points used to train the model as these predictions are overfit and unsuitable for finding label errors. To compute out-of-sample predicted probabilities for your entire dataset, you can use cross-validation.

3. Train robust models with noisy labels#

cleanlab’s CleanLearning class adapts any existing (scikit-learn compatible) classification model, clf, to a more reliable one by allowing it to train directly on partially mislabeled datasets.

When the .fit() method is called, it automatically removes any examples identified as “noisy” in the provided dataset and returns a model trained only on the clean data.

from sklearn.linear_model import LogisticRegression
from cleanlab.classification import CleanLearning

clf = LogisticRegression() # any classifier implementing the sklearn API
cl = CleanLearning(clf=clf)
cl.fit(X=X, labels=labels)

4. Dataset curation: fix dataset-level issues#

cleanlab’s dataset module helps you deal with dataset-level issues by finding overlapping classes (classes to merge), rank class-level label quality (classes to keep/delete), and measure overall dataset health (to track dataset quality as you make adjustments).

The example below shows how to view all dataset-level issues in one line of code with dataset.health_summary(). Check out the dataset tutorial for more examples.

from cleanlab.dataset import health_summary

health_summary(labels, pred_probs, class_names=class_names)