cleanlab documentation#

cleanlab automatically detects data and label issues in your ML datasets.

This helps you improve your data and train reliable ML models on noisy real-world datasets. cleanlab has already found thousands of label errors in ImageNet, MNIST, and other popular ML benchmarking datasets. Beyond handling label errors, this is a comprehensive open-source library implementing many data-centric AI capabilities. Start using automation to improve your data in 5 minutes!

Quickstart#

1. Install `cleanlab`#

pip install cleanlab

To install the package with all optional dependencies:

pip install "cleanlab[all]"

conda install -c cleanlab cleanlab

pip install git+https://github.com/cleanlab/cleanlab.git

To install the package with all optional dependencies:

pip install "git+https://github.com/cleanlab/cleanlab.git#egg=cleanlab[all]"

2. Find common issues in your data#

cleanlab automatically detects various issues in any dataset that a classifier can be trained on. The cleanlab package works with any ML model by operating on model outputs (predicted class probabilities or feature embeddings) – it doesn’t require that a particular model created those outputs. For any classification dataset, use your trained model to produce pred_probs (predicted class probabilities) and/or feature_embeddings (numeric vector representations of each datapoint). Then, these few lines of code can detect common real-world issues in your dataset like label errors, outliers, near duplicates, etc:

from cleanlab import Datalab

lab = Datalab(data=your_dataset, label_name="column_name_of_labels")
lab.find_issues(features=feature_embeddings, pred_probs=pred_probs)
lab.report()  # summarize issues in dataset, how severe they are, ...

3. Handle label errors and train robust models with noisy labels#

Mislabeled data is a particularly concerning issue plaguing real-world datasets. To use a scikit-learn-compatible model for classification with noisy labels, you don’t need to train a model to find label issues – you can pass the untrained model object, data, and labels into CleanLearning.find_label_issues and cleanlab will handle model training for you.

from cleanlab.classification import CleanLearning

# This works with any sklearn-compatible model - just input data + labels and cleanlab will detect label issues ツ
label_issues_info = CleanLearning(clf=sklearn_compatible_model).find_label_issues(data, labels)

CleanLearning also works with models from most standard ML frameworks by wrapping the model for scikit-learn compliance, e.g. tensorflow/keras (using our KerasWrapperModel), pytorch (using skorch package), etc.

find_label_issues returns a boolean mask flagging which examples have label issues and a numeric label quality score for each example quantifying our confidence that its label is correct.

Beyond standard classification tasks, cleanlab can also detect mislabeled examples in: multi-label data (e.g. image/document tagging), sequence prediction (e.g. entity recognition), and data labeled by multiple annotators (e.g. crowdsourcing).

Important

Cleanlab performs better if the pred_probs from your model are out-of-sample. Details on how to compute out-of-sample predicted probabilities for your entire dataset are here.

cleanlab’s CleanLearning class trains a more robust version of any existing (scikit-learn compatible) classification model, clf, by fitting it to an automatically filtered version of your dataset with low-quality data removed. It returns a model trained only on the clean data, from which you can get predictions in the same way as your existing classifier.

from sklearn.linear_model import LogisticRegression
from cleanlab.classification import CleanLearning

cl = CleanLearning(clf=LogisticRegression())  # any sklearn-compatible classifier
cl.fit(train_data, labels)

# Estimate the predictions you would have gotten if you trained without mislabeled data
predictions = cl.predict(test_data)

4. Dataset curation: fix dataset-level issues#

cleanlab’s dataset module helps you deal with dataset-level issues – find overlapping classes (classes to merge), rank class-level label quality (classes to keep/delete), and measure overall dataset health (to track dataset quality as you make adjustments).

View all dataset-level issues in one line of code with dataset.health_summary().

from cleanlab.dataset import health_summary

health_summary(labels, pred_probs, class_names=class_names)

5. Improve your data via many other techniques#

Beyond handling label errors, cleanlab supports other data-centric AI capabilities including:

Detecting outliers and out-of-distribution examples in both training and future test data (tutorial)
Analyzing data labeled by multiple annotators to estimate consensus labels and their quality (tutorial)
Active learning with multiple annotators to identify which data is most informative to label or re-label next (tutorial)

If you have questions, check out our FAQ and feel free to ask in Slack!

Contributing#

As cleanlab is an open-source project, we welcome contributions from the community.

Please see our contributing guidelines for more information.

cleanlab documentation#

Quickstart#

1. Install cleanlab#

2. Find common issues in your data#

3. Handle label errors and train robust models with noisy labels#

4. Dataset curation: fix dataset-level issues#

5. Improve your data via many other techniques#

Contributing#

1. Install `cleanlab`#