cleanlab automatically finds and fixes label issues in your ML datasets.
pip install cleanlab
conda install -c cleanlab cleanlab
pip install git+https://github.com/cleanlab/cleanlab.git
2. Find label errors in your data#
find_label_issues function tells you which examples in your dataset are likely mislabeled. At a minimum, it expects two inputs — your data’s given labels, labels, and predicted probabilities, pred_probs, from some trained classification model. These must be out-of-sample predictions where the data points were held out from the model during training, which can be obtained via cross-validation.
Setting return_indices_ranked_by in this function instructs cleanlab to return the indices of potential mislabeled examples, ordered by the likelihood of their given label being incorrect. This is estimated via a label quality score, which for example can be specified as
'self_confidence' (predicted probability the given label).
from cleanlab.filter import find_label_issues ordered_label_issues = find_label_issues( labels=labels, pred_probs=pred_probs, return_indices_ranked_by='self_confidence', )
The predicted probabilities,
pred_probs, from your model must be out-of-sample! You should never provide predictions on the same data points used to train the model as these predictions are overfit and unsuitable for finding label errors. To compute out-of-sample predicted probabilities for your entire dataset, you can use cross-validation.
3. Train robust models with noisy labels#
.fit() method is called, it automatically removes any examples identified as “noisy” in the provided dataset and returns a model trained only on the clean data.
from sklearn.linear_model import LogisticRegression from cleanlab.classification import CleanLearning clf = LogisticRegression() # any classifier implementing the sklearn API cl = CleanLearning(clf=clf) cl.fit(X=X, labels=labels)
4. Dataset curation: fix dataset-level issues#
dataset module helps you deal with dataset-level issues by finding overlapping classes (classes to merge), rank class-level label quality (classes to keep/delete), and measure overall dataset health (to track dataset quality as you make adjustments).
The example below shows how to view all dataset-level issues in one line of code with
dataset.health_summary(). Check out the dataset tutorial for more examples.
from cleanlab.dataset import health_summary health_summary(labels, pred_probs, class_names=class_names)