cleanlab automatically finds and fixes label issues in your ML datasets.
pip install cleanlab
conda install -c cleanlab cleanlab
pip install git+https://github.com/cleanlab/cleanlab.git
2. Find label errors in your data#
cleanlab finds issues in any dataset that a classifier can be trained on. The cleanlab package works with any model by using model outputs (predicted probabilities) as input – it doesn’t depend on which model created those outputs.
If you’re using a scikit-learn-compatible model (option 1), you don’t need to train a model – you can pass the model, data, and labels into
CleanLearning.find_label_issues and cleanlab will handle model training for you. If you want to use any non-sklearn-compatible model (option 2), you can input the trained model’s out-of-sample predicted probabilities into
find_label_issues. Examples for both options are below.
from cleanlab.classification import CleanLearning from cleanlab.filter import find_label_issues # Option 1 - works with sklearn-compatible models - just input the data and labels ツ label_issues_info = CleanLearning(clf=sklearn_compatible_model).find_label_issues(data, labels) # Option 2 - works with ANY ML model - just input the model's predicted probabilities ordered_label_issues = find_label_issues( labels=labels, pred_probs=pred_probs, # out-of-sample predicted probabilities from any model return_indices_ranked_by='self_confidence', )
CleanLearning (option 1) also works with models from most standard ML frameworks by wrapping the model for scikit-learn compliance, e.g. huggingface/tensorflow/keras (using our KerasWrapperModel), pytorch (using skorch package), etc.
find_label_issues returns a boolean mask of label issues. You can instead return the indices of potential mislabeled examples by setting return_indices_ranked_by in
find_label_issues. The indices are ordered by likelihood of a label error (estimated via
The predicted probabilities,
pred_probs, from your model must be out-of-sample. Never provide predictions on the same data points used to train the model – these predictions are overfit and unsuitable for finding label errors. Details on how to compute out-of-sample predicted probabilities for your entire dataset are here.
3. Train robust models with noisy labels#
.fit() method is called, it automatically removes any examples identified as “noisy” in the provided dataset and returns a model trained only on the clean data.
from sklearn.linear_model import LogisticRegression from cleanlab.classification import CleanLearning cl = CleanLearning(clf=LogisticRegression()) # any sklearn-compatible classifier cl.fit(train_data, labels) # Estimate the predictions you would have gotten if you trained without mislabeled data. predictions = cl.predict(test_data)
4. Dataset curation: fix dataset-level issues#
dataset module helps you deal with dataset-level issues by finding overlapping classes (classes to merge), rank class-level label quality (classes to keep/delete), and measure overall dataset health (to track dataset quality as you make adjustments).
The example below shows how to view all dataset-level issues in one line of code with
dataset.health_summary(). Check out the dataset tutorial for more examples.
from cleanlab.dataset import health_summary health_summary(labels, pred_probs, class_names=class_names)