cleanlab documentation#
cleanlab automatically detects data and label issues in your ML datasets.
Quickstart#
1. Install cleanlab
#
pip install cleanlab
To install the package with all optional dependencies:
pip install "cleanlab[all]"
conda install -c cleanlab cleanlab
pip install git+https://github.com/cleanlab/cleanlab.git
To install the package with all optional dependencies:
pip install "git+https://github.com/cleanlab/cleanlab.git#egg=cleanlab[all]"
2. Find common issues in your data#
cleanlab automatically detects various issues in any dataset that a classifier can be trained on. The cleanlab package works with any ML model by operating on model outputs (predicted class probabilities or feature embeddings) – it doesn’t require that a particular model created those outputs. For any classification dataset, use your trained model to produce pred_probs
(predicted class probabilities) and/or feature_embeddings
(numeric vector representations of each datapoint). Then, these few lines of code can detect common real-world issues in your dataset like label errors, outliers, near duplicates, etc:
from cleanlab import Datalab
lab = Datalab(data=your_dataset, label_name="column_name_of_labels")
lab.find_issues(features=feature_embeddings, pred_probs=pred_probs)
lab.report() # summarize issues in dataset, how severe they are, ...
3. Handle label errors and train robust models with noisy labels#
Mislabeled data is a particularly concerning issue plaguing real-world datasets. To use a scikit-learn-compatible model for classification with noisy labels, you don’t need to train a model to find label issues – you can pass the untrained model object, data, and labels into CleanLearning.find_label_issues
and cleanlab will handle model training for you.
from cleanlab.classification import CleanLearning
# This works with any sklearn-compatible model - just input data + labels and cleanlab will detect label issues ツ
label_issues_info = CleanLearning(clf=sklearn_compatible_model).find_label_issues(data, labels)
CleanLearning
also works with models from most standard ML frameworks by wrapping the model for scikit-learn compliance, e.g. tensorflow/keras (using our KerasWrapperModel), pytorch (using skorch package), etc.
find_label_issues
returns a boolean mask flagging which examples have label issues and a numeric label quality score for each example quantifying our confidence that its label is correct.
Beyond standard classification tasks, cleanlab can also detect mislabeled examples in: multi-label data (e.g. image/document tagging), sequence prediction (e.g. entity recognition), and data labeled by multiple annotators (e.g. crowdsourcing).
Important
Cleanlab performs better if the pred_probs
from your model are out-of-sample. Details on how to compute out-of-sample predicted probabilities for your entire dataset are here.
cleanlab’s CleanLearning
class trains a more robust version of any existing (scikit-learn compatible) classification model, clf
, by fitting it to an automatically filtered version of your dataset with low-quality data removed. It returns a model trained only on the clean data, from which you can get predictions in the same way as your existing classifier.
from sklearn.linear_model import LogisticRegression
from cleanlab.classification import CleanLearning
cl = CleanLearning(clf=LogisticRegression()) # any sklearn-compatible classifier
cl.fit(train_data, labels)
# Estimate the predictions you would have gotten if you trained without mislabeled data
predictions = cl.predict(test_data)
4. Dataset curation: fix dataset-level issues#
cleanlab’s dataset module helps you deal with dataset-level issues – find overlapping classes
(classes to merge), rank class-level label quality
(classes to keep/delete), and measure overall dataset health
(to track dataset quality as you make adjustments).
View all dataset-level issues in one line of code with dataset.health_summary()
.
from cleanlab.dataset import health_summary
health_summary(labels, pred_probs, class_names=class_names)
5. Improve your data via many other techniques#
Beyond handling label errors, cleanlab supports other data-centric AI capabilities including:
Detecting outliers and out-of-distribution examples in both training and future test data (tutorial)
Analyzing data labeled by multiple annotators to estimate consensus labels and their quality (tutorial)
Active learning with multiple annotators to identify which data is most informative to label or re-label next (tutorial)
If you have questions, check out our FAQ and feel free to ask in Slack!
Contributing#
As cleanlab is an open-source project, we welcome contributions from the community.
Please see our contributing guidelines for more information.