cleanlab automatically detects data and label issues in your ML datasets.
pip install cleanlab
conda install -c cleanlab cleanlab
pip install git+https://github.com/cleanlab/cleanlab.git
2. Find label errors in your data#
cleanlab finds issues in any dataset that a classifier can be trained on. The cleanlab package works with any ML model by using model outputs (predicted probabilities) as input – it doesn’t depend on which model created those outputs.
If you’re using a scikit-learn-compatible model (option 1), you don’t need to train a model – you can pass the model, data, and labels into
CleanLearning.find_label_issues and cleanlab will handle model training for you. If you want to use any non-sklearn-compatible model (option 2), you can input the trained model’s out-of-sample predicted probabilities into
find_label_issues. Examples for both options are below.
from cleanlab.classification import CleanLearning from cleanlab.filter import find_label_issues # Option 1 - works with sklearn-compatible models - just input the data and labels ツ label_issues_info = CleanLearning(clf=sklearn_compatible_model).find_label_issues(data, labels) # Option 2 - works with ANY ML model - just input the model's predicted probabilities ordered_label_issues = find_label_issues( labels=labels, pred_probs=pred_probs, # predicted probabilities from any model (ideally out-of-sample predictions) return_indices_ranked_by='self_confidence', )
CleanLearning (option 1) also works with models from most standard ML frameworks by wrapping the model for scikit-learn compliance, e.g. tensorflow/keras (using our KerasWrapperModel), pytorch (using skorch package), etc.
find_label_issues returns a boolean mask of label issues. You can instead return the indices of potential mislabeled examples by setting return_indices_ranked_by in
find_label_issues. The indices are ordered by likelihood of a label error (estimated via
Beyond standard classification tasks, cleanlab can also detect mislabeled examples in: multi-label data (e.g. image/document tagging), sequence prediction (e.g. entity recognition), and data labeled by multiple annotators (e.g. crowdsourcing).
Cleanlab performs better if the
pred_probs from your model are out-of-sample. Details on how to compute out-of-sample predicted probabilities for your entire dataset are here.
3. Train robust models with noisy labels#
.fit() method is called, it automatically removes any examples identified as “noisy” in the provided dataset and returns a model trained only on the clean data.
from sklearn.linear_model import LogisticRegression from cleanlab.classification import CleanLearning cl = CleanLearning(clf=LogisticRegression()) # any sklearn-compatible classifier cl.fit(train_data, labels) # Estimate the predictions you would have gotten if you trained without mislabeled data. predictions = cl.predict(test_data)
4. Dataset curation: fix dataset-level issues#
cleanlab’s dataset module helps you deal with dataset-level issues –
find overlapping classes (classes to merge),
rank class-level label quality (classes to keep/delete), and
measure overall dataset health (to track dataset quality as you make adjustments).
View all dataset-level issues in one line of code with
from cleanlab.dataset import health_summary health_summary(labels, pred_probs, class_names=class_names)
5. Improve your data via many other techniques#
Beyond handling label errors, cleanlab supports other data-centric AI capabilities including:
Detecting outliers and out-of-distribution examples in both training and future test data (tutorial)
Analyzing data labeled by multiple annotators to estimate consensus labels and their quality (tutorial)
Active learning with multiple annotators to identify which data is most informative to label or re-label next (tutorial)
As cleanlab is an open-source project, we welcome contributions from the community.
Please see our contributing guidelines for more information.