Improving ML Performance via Data Curation with Train vs Test Splits#

In typical Machine Learning projects, we split our dataset into training data for fitting models and test data to evaluate model performance. For noisy real-world datasets, detecting/correcting errors in the training data is important to train robust models, but it’s less recognized that the test set can also be noisy. For accurate model evaluation, it is vital to find and fix issues in the test data as well. Some evaluation metrics are particularly sensitive to outliers and noisy labels. This tutorial demonstrates a way to use cleanlab (via Datalab) to curate both your training and test data, ensuring robust model training and reliable performance evaluation. We recommend first completing some Datalab tutorials before diving into this more complex subject.

Here’s how we recommend handling noisy training and test data (this tutorial walks through these steps):

Preprocess your training and test data to be suitable for ML. Use cleanlab to check for fundamental train/test setup problems in the merged dataset like train/test leakage or drift.
Fit your ML model to your noisy training data and get its predictions/embeddings for your test data. Use these model outputs with cleanlab to detect issues in your test data.
Manually review/correct cleanlab-detected issues in your test data. We caution against blindly automated correction of test data. Changes to your test set should be carefully verified to ensure they will lead to more accurate model evaluation. We also caution against comparing the performance of different ML models across different versions of your test data; performance comparions between models should be based on the same test data.
Cross-validate a new copy of your ML model on your training data, and then use it with cleanlab to detect issues in the training dataset. Do not include test data in any part of this step to avoid leaking test set information into the training data curation.
You can try automated techniques to curate your training data based on cleanlab results, train models on the curated training data, and evaluate them on the cleaned test data.

Consider this tutorial as a blueprint for using cleanlab in diverse ML projects spanning various data modalities. The same ideas apply if you substitute test data with validation data above. In a final advanced section of this tutorial, we show how training data edits can be parameterized in terms of cleanlab’s detected issues, such that hyperparameter optimization can identify the optimal combination of data edits for training an effective ML model.

Note: This tutorial trains an XGBoost model on a tabular dataset, but the same approach applies to any ML model and data modality.

Why did you make this tutorial?#

TLDR: Reliable ML requires both reliable training and reliable evaluation. This tutorial shows you how to achieve both using cleanlab.

Longer answer: Many users wish to use cleanlab to improve their ML model by improving their data, but make subtle mistakes. This multi-step tutorial shows one way to do this properly. Some users curate (e.g. fix label issues in) their training data, train ML model, and evaluate it on test data. But they see no improvement in test-set accuracy, because they have introduced distribution-shift by altering their training data. If the test data also has issues, they must also be fixed for a faithful model evaluation. Other users therefore curate their test data too, but some blindly auto-fix their test data, which is dangerous! This cleanlab package is based on ML and thus inevitably imperfect. Issues that cleanlab detected in test data should not be blindly auto-fixed – this risks making model evaluation wrong. Instead we recommend the multi-step workflow above, where less algorithmic/automated correction is applied to test data than to training data (focus your manual efforts on curating test rather than training data).

1. Install dependencies#

Datalab has additional dependencies that are not included in the standard installation of cleanlab. You can use pip to install all packages required for this tutorial as follows:

!pip install xgboost
!pip install "cleanlab[datalab]"
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+https://github.com/cleanlab/cleanlab.git

[2]:

import random
import os
import math
import numpy as np
from xgboost import XGBClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import pandas as pd
import cleanlab
from cleanlab import Datalab

SEED = 123456  # for reproducibility
np.random.seed(SEED)
random.seed(SEED)

2. Preprocess the data#

This tutorial considers a classification task with structured/tabular data. The ML task is to predict each student’s final grade in a course (class label) based on various numeric/categorical features about them (exam scores and notes).

[3]:

df_train = pd.read_csv(
    "https://cleanlab-public.s3.amazonaws.com/Datasets/student-grades/clos_train_data.csv"
)

df_test = pd.read_csv(
    "https://cleanlab-public.s3.amazonaws.com/Datasets/student-grades/clos_test_data.csv"
)

df_train.head()

[3]:

	stud_ID	exam_1	exam_2	exam_3	notes	noisy_letter_grade
0	018bff	94.0	41.0	91.0	great participation +10	B
1	076d92	0.0	79.0	65.0	cheated on exam, gets 0pts	F
2	c80059	86.0	89.0	85.0	great final presentation +10	F
3	e38f8a	50.0	67.0	94.0	great final presentation +10	B
4	d57e1a	92.0	79.0	98.0	great final presentation +10	A

Before training a ML model, we preprocess our dataset. The type of preprocessing that is best will depend on what ML model you use. This tutorial will demonstrate an XGBoost model, so we’ll process the notes and noisy_letter_grade columns into categorical columns for this model (each category encoded as an integer). You can alternatively use Cleanlab Studio, which will automatically produce a high-accuracy ML model for your raw data, without you having to worry about any ML modeling or data preprocessing work.

[4]:

# Create label encoders for the categorical columns
grade_le = preprocessing.LabelEncoder()
notes_le = preprocessing.LabelEncoder()

# Process the feature columns
train_features = df_train.drop(["stud_ID", "noisy_letter_grade"], axis=1).copy()
train_features["notes"] = notes_le.fit_transform(train_features["notes"])
train_features["notes"] = train_features["notes"].astype("category")

# Process the label column
train_labels = pd.DataFrame(grade_le.fit_transform(df_train["noisy_letter_grade"].copy()), columns=["noisy_letter_grade"])

# Keep separate copies of these training features and labels for later use
train_features_v2 = train_features.copy()
train_labels_v2 = train_labels.copy()

We first solely preprocessed the training data to avoid information leakage (using test data information that would not be available at prediction time). Here’s how the preprocessed training features look:

[5]:

train_features.head()

[5]:

	exam_1	exam_2	exam_3	notes
0	94.0	41.0	91.0	2
1	0.0	79.0	65.0	0
2	86.0	89.0	85.0	1
3	50.0	67.0	94.0	1
4	92.0	79.0	98.0	1

We apply the same preprocessing to the test data.

[6]:

test_features = df_test.drop(
    ["stud_ID", "noisy_letter_grade"], axis=1
).copy()
test_features["notes"] = notes_le.transform(test_features["notes"])
test_features["notes"] = test_features["notes"].astype("category")

test_labels = pd.DataFrame(grade_le.transform(df_test["noisy_letter_grade"].copy()), columns=["noisy_letter_grade"])

We then appropriately format the datasets for the ML model used in this tutorial.

[7]:

train_labels = train_labels.astype('object')
test_labels = test_labels.astype('object')

train_features["notes"] = train_features["notes"].astype(int)
test_features["notes"] = test_features["notes"].astype(int)

preprocessed_train_data = pd.concat([train_features, train_labels], axis=1)
preprocessed_train_data["stud_ID"] = df_train["stud_ID"]

preprocessed_test_data = pd.concat([test_features, test_labels], axis=1)
preprocessed_test_data["stud_ID"] = df_test["stud_ID"]

3. Check for fundamental problems in the train/test setup#

Before training any ML model, we can quickly check for fundamental issues in our setup with cleanlab. To audit all of our data at once, we merge the training and test sets into one dataset, from which we construct a Datalab object. Datalab automatically detects many types of common issues in a dataset, but requires a trained ML model for a comprehensive audit. We haven’t trained any model yet, so here we instruct Datalab to only check for specific data issues: near duplicates, and whether the data appears non-IID (violations of the IID assumption include: data drift or lack of statistical independence between data points).

Datalab can detect many additional types of data issues, depending on what inputs it is given. Below we provide features = features_df as the sole input to Datalab.find_issues(), which solely contains numerical values here. If you have heterogenoues/complex data types (eg. text or images), you could instead provide vector feature representations (eg. pretrained model embeddings) of your data as the features.

[8]:

full_df = pd.concat([preprocessed_train_data, preprocessed_test_data], axis=0).reset_index(drop=True)
features_df = full_df.drop(["noisy_letter_grade", "stud_ID"], axis=1)  # can instead use model embeddings

[9]:

lab = Datalab(data=full_df, label_name="noisy_letter_grade", task="classification")
lab.find_issues(features=features_df.to_numpy(), issue_types={"near_duplicate": {}, "non_iid": {}})
lab.report(show_summary_score=True, show_all_issues=True)

Finding near_duplicate issues ...
Finding non_iid issues ...

Audit complete. 100 issues found in the dataset.
Dataset Information: num_examples: 749, num_classes: 5

Here is a summary of various issues found in your data:

    issue_type    score  num_issues
near_duplicate 0.583745         100
       non_iid 0.291382           0

(Note: A lower score indicates a more severe issue across all examples in the dataset.)

Learn about each issue: https://docs.cleanlab.ai/stable/cleanlab/datalab/guide/issue_type_description.html
See which examples in your dataset exhibit each issue via: `datalab.get_issues(<ISSUE_NAME>)`

Data indices corresponding to top examples of each issue are shown below.


------------------ near_duplicate issues -------------------

About this issue:
        A (near) duplicate issue refers to two or more examples in
    a dataset that are extremely similar to each other, relative
    to the rest of the dataset.  The examples flagged with this issue
    may be exactly duplicated, or lie atypically close together when
    represented as vectors (i.e. feature embeddings).


Number of examples with this issue: 100
Overall dataset quality in terms of this issue: 0.5837

Examples representing most severe instances of this issue:
     is_near_duplicate_issue  near_duplicate_score near_duplicate_sets  distance_to_nearest_neighbor
748                     True                   0.0               [604]                           0.0
510                     True                   0.0               [227]                           0.0
71                      True                   0.0               [719]                           0.0
65                      True                   0.0          [690, 444]                           0.0
547                     True                   0.0               [647]                           0.0


---------------------- non_iid issues ----------------------

About this issue:
        Whether the dataset exhibits statistically significant
    violations of the IID assumption like:
    changepoints or shift, drift, autocorrelation, etc.
    The specific violation considered is whether the
    examples are ordered such that almost adjacent examples
    tend to have more similar feature values.


Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 0.2914

Examples representing most severe instances of this issue:
     is_non_iid_issue  non_iid_score
611             False       0.687869
610             False       0.687883
612             False       0.688146
609             False       0.688189
613             False       0.688713

Additional Information:
p-value: 0.2913818469137725

cleanlab does not find significant evidence that our data is non-IID, which is good. Otherwise, we’d need to further consider where our data came from and whether conclusions/predictions from this dataset can really generalize to our population of interest.

But cleanlab did detect many near duplicates in the dataset. We see some exact duplicates between our training and test data, which may indicate data leakage! Since we didn’t expect these duplicates in our dataset, let’s drop the extra duplicated copies of test data points found in our training set from this training set. This helps ensure that our model evaluations reflect generalization capabilities. Here’s how we can review the near duplicates detected via Datalab.

[10]:

full_duplicate_results = lab.get_issues("near_duplicate")
full_duplicate_results.sort_values("near_duplicate_score").head()

[10]:

	is_near_duplicate_issue	near_duplicate_sets
748	True	[604]
510	True	[227]
71	True	[719]
65	True	[690, 444]
547	True	[647]

To distinguish between near vs. exact duplicates, we can filter where the distance_to_nearest_neighbor column has value = 0. We specifically filter for exact duplicates between our training and test set in order to drop the extra copies of such data points from our training set.

[11]:

train_idx_cutoff = len(preprocessed_train_data) - 1  # last index of training data in the merged dataset

# Create column to list which duplicate sets include some test data:
full_duplicate_results['nd_set_has_index_over_training_cutoff'] = full_duplicate_results['near_duplicate_sets'].apply(lambda x: any(i > train_idx_cutoff for i in x))

exact_duplicates = full_duplicate_results.query('is_near_duplicate_issue == True and near_duplicate_score == 0.0 and nd_set_has_index_over_training_cutoff == True').sort_values("near_duplicate_score")
exact_duplicates

[11]:

	is_near_duplicate_issue	near_duplicate_sets	nd_set_has_index_over_training_cutoff
33	True	[627]	True
53	True	[678]	True
65	True	[690, 444]	True
71	True	[719]	True
82	True	[709]	True
100	True	[615]	True
292	True	[620]	True
420	True	[704]	True
431	True	[688]	True
459	True	[672]	True
547	True	[647]	True
564	True	[696]	True
604	True	[748]	True
605	True	[723]	True

[12]:

exact_duplicates_indices = exact_duplicates.index
exact_duplicates_indices

[12]:

Index([33, 53, 65, 71, 82, 100, 292, 420, 431, 459, 547, 564, 604, 605], dtype='int64')

Below we remove the exact duplicates that occur between our training and test sets from the training data.

[13]:

indices_of_duplicates_to_drop = [idx for idx in exact_duplicates_indices if idx <= train_idx_cutoff]
indices_of_duplicates_to_drop

[13]:

[33, 53, 65, 71, 82, 100, 292, 420, 431, 459, 547, 564, 604, 605]

Here are the examples we’ll drop from our training data, since they are exact duplicates of test examples.

[14]:

full_df.iloc[indices_of_duplicates_to_drop]

[14]:

	exam_1	exam_2	exam_3	notes	noisy_letter_grade	stud_ID
33	83.0	92.0	80.0	3	2	4a3f75
53	91.0	0.0	94.0	0	3	d030b5
65	93.0	73.0	82.0	5	1	ddd0ba
71	90.0	95.0	75.0	1	0	8e6d24
82	78.0	81.0	74.0	4	3	464aab
100	80.0	96.0	83.0	4	2	ee3387
292	79.0	62.0	82.0	5	2	61e807
420	99.0	53.0	76.0	5	2	71d7b9
431	90.0	92.0	88.0	2	0	83e31f
459	70.0	63.0	95.0	2	1	edeb53
547	68.0	93.0	73.0	5	2	cd52b5
564	84.0	92.0	86.0	5	1	454e51
604	87.0	74.0	95.0	3	2	042686
605	96.0	83.0	73.0	1	0	12a73f

[15]:

df_train = df_train.drop(indices_of_duplicates_to_drop, axis=0).reset_index(drop=True)
train_features = train_features.drop(indices_of_duplicates_to_drop, axis=0).reset_index(drop=True)
train_labels = train_labels.drop(indices_of_duplicates_to_drop, axis=0).reset_index(drop=True).astype(int)

4. Train model with original (noisy) training data#

After handling fundamental issues in our training/test setup, let’s fit our ML model to the training data. Here we use XGBoost as an example, but the same ideas of this tutorial apply to any other ML model.

[16]:

train_labels = train_labels["noisy_letter_grade"]
clf = XGBClassifier(tree_method="hist", enable_categorical=True, random_state=SEED)
clf.fit(train_features, train_labels)

[16]:

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=True, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, objective='multi:softprob', ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Compute out-of-sample predicted probabilities for the test data from this baseline model#

Make sure that the columns of your predicted class probabilities are properly ordered with respect to the ordering of classes, which for Datalab is: lexicographically sorted by class name.

[17]:

test_pred_probs = clf.predict_proba(test_features)

5. Check for issues in test data and manually address them#

While we could evaluate our model’s accuracy using the predictions above, this will be unreliable if the test data have issues. Based on the given labels, model predictions, and feature representations, Datalab can automatically detect issues lurking in our test data.

[18]:

test_lab = Datalab(data=df_test, label_name="noisy_letter_grade", task="classification")
test_features_array = test_features.to_numpy()  # could alternatively be model embeddings
test_lab.find_issues(features=test_features_array, pred_probs=test_pred_probs)
test_lab.report(show_summary_score=True, show_all_issues=True)

Finding null issues ...
Finding label issues ...
Finding outlier issues ...
Finding near_duplicate issues ...
Finding non_iid issues ...
Finding class_imbalance issues ...
Finding underperforming_group issues ...

Audit complete. 30 issues found in the dataset.
Dataset Information: num_examples: 134, num_classes: 5

Here is a summary of various issues found in your data:

           issue_type    score  num_issues
                label 0.798507          25
              outlier 0.370259           5
                 null 1.000000           0
       near_duplicate 0.625352           0
              non_iid 0.524042           0
      class_imbalance 0.097015           0
underperforming_group 1.000000           0

(Note: A lower score indicates a more severe issue across all examples in the dataset.)

Learn about each issue: https://docs.cleanlab.ai/stable/cleanlab/datalab/guide/issue_type_description.html
See which examples in your dataset exhibit each issue via: `datalab.get_issues(<ISSUE_NAME>)`

Data indices corresponding to top examples of each issue are shown below.


----------------------- label issues -----------------------

About this issue:
        Examples whose given label is estimated to be potentially incorrect
    (e.g. due to annotation error) are flagged as having label issues.


Number of examples with this issue: 25
Overall dataset quality in terms of this issue: 0.7985

Examples representing most severe instances of this issue:
     is_label_issue  label_score given_label predicted_label
70             True     0.000537           F               A
90            False     0.000903           F               C
79            False     0.001743           F               C
106            True     0.001853           F               A
46             True     0.002121           F               A


---------------------- outlier issues ----------------------

About this issue:
        Examples that are very different from the rest of the dataset
    (i.e. potentially out-of-distribution or rare/anomalous instances).


Number of examples with this issue: 5
Overall dataset quality in terms of this issue: 0.3703

Examples representing most severe instances of this issue:
    is_outlier_issue  outlier_score
63              True   4.752463e-99
89              True   3.784418e-09
40              True   5.477741e-06
57              True   1.134230e-05
32              True   7.153555e-03


----------------------- null issues ------------------------

About this issue:
        Examples identified with the null issue correspond to rows that have null/missing values across all feature columns (i.e. the entire row is missing values).


Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 1.0000

Examples representing most severe instances of this issue:
    is_null_issue  null_score
0           False         1.0
97          False         1.0
96          False         1.0
95          False         1.0
94          False         1.0


------------------ near_duplicate issues -------------------

About this issue:
        A (near) duplicate issue refers to two or more examples in
    a dataset that are extremely similar to each other, relative
    to the rest of the dataset.  The examples flagged with this issue
    may be exactly duplicated, or lie atypically close together when
    represented as vectors (i.e. feature embeddings).


Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 0.6254

Examples representing most severe instances of this issue:
    is_near_duplicate_issue  near_duplicate_score near_duplicate_sets  distance_to_nearest_neighbor
43                    False              0.143272                  []                      0.000016
93                    False              0.143272                  []                      0.000016
20                    False              0.146501                  []                      0.000016
83                    False              0.146501                  []                      0.000016
75                    False              0.161431                  []                      0.000018


---------------------- non_iid issues ----------------------

About this issue:
        Whether the dataset exhibits statistically significant
    violations of the IID assumption like:
    changepoints or shift, drift, autocorrelation, etc.
    The specific violation considered is whether the
    examples are ordered such that almost adjacent examples
    tend to have more similar feature values.


Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 0.5240

Examples representing most severe instances of this issue:
     is_non_iid_issue  non_iid_score
12              False       0.765240
35              False       0.771221
28              False       0.801589
7               False       0.801652
112             False       0.810735

Additional Information:
p-value: 0.5240417899434826


------------------ class_imbalance issues ------------------

About this issue:
        Examples belonging to the most under-represented class in the dataset.

Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 0.0970

Examples representing most severe instances of this issue:
    is_class_imbalance_issue  class_imbalance_score given_label
88                     False               0.097015           F
70                     False               0.097015           F
2                      False               0.097015           F
71                     False               0.097015           F
46                     False               0.097015           F

Additional Information:
Rarest Class: NA


--------------- underperforming_group issues ---------------

About this issue:
        An underperforming group refers to a cluster of similar examples
    (i.e. a slice) in the dataset for which the ML model predictions
    are particularly poor (loss evaluation over this subpopulation is high).


Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 1.0000

Examples representing most severe instances of this issue:
    is_underperforming_group_issue  underperforming_group_score
0                            False                          1.0
97                           False                          1.0
96                           False                          1.0
95                           False                          1.0
94                           False                          1.0

Datalab automatically audits our dataset for various common issues. The report above indicates many label issues in our data.

We can see which examples are estimated to be mislabeled (as well as a numeric quality score quantifying how likely their label is correct) via the get_issues() method. To review the most likely label errors, we sort our data by the label_score (a lower score represents that the label is less likely to be correct).

[19]:

test_label_issue_results = test_lab.get_issues("label")
test_label_issues_ordered = df_test.join(test_label_issue_results)
test_label_issues_ordered = test_label_issues_ordered[test_label_issue_results["is_label_issue"] == True].sort_values("label_score")

print(test_label_issues_ordered)

    stud_ID  exam_1  exam_2  exam_3                           notes  \
70   2bd759    93.0    79.0    97.0         great participation +10
106  34ccdd    90.0   100.0    89.0         great participation +10
46   bb3bab    97.0    88.0    74.0         great participation +10
103  bf1b14    66.0    83.0    96.0                             NaN
97   4787de    73.0    84.0    68.0         great participation +10
92   865cbd    95.0    87.0    82.0     missed class frequently -10
72   32d53f    71.0    78.0    80.0    great final presentation +10
22   5b2f76    99.0    86.0    95.0     missed class frequently -10
3    28f8b4    67.0    82.0    98.0                             NaN
69   df814d    78.0    85.0    84.0                             NaN
45   f17261    95.0    88.0    69.0                             NaN
98   1db3ff    95.0    81.0    76.0                             NaN
109  ded944    86.0    85.0    89.0                             NaN
124  343dd3    67.0    87.0    95.0  missed homework frequently -10
20   8d904d    73.0    73.0    76.0     missed class frequently -10
83   e4f0d5    86.0    85.0    89.0  missed homework frequently -10
120  d6d208    97.0    97.0    92.0  missed homework frequently -10
29   76c083    91.0    92.0    74.0                             NaN
63   d030b5    91.0     0.0    94.0      cheated on exam, gets 0pts
23   695f96    96.0    69.0    92.0                             NaN
84   745c23    89.0    95.0    72.0                             NaN
10   13b36e    98.0    92.0    96.0                             NaN
89   71d7b9    99.0    53.0    76.0                             NaN
127  5ba892    98.0    97.0    93.0                             NaN
43   9f0216    94.0    79.0    89.0                             NaN

    noisy_letter_grade  is_label_issue  label_score given_label  \
70                   F            True     0.000537           F
106                  F            True     0.001853           F
46                   F            True     0.002121           F
103                  D            True     0.003628           D
97                   D            True     0.004006           D
92                   A            True     0.004031           A
72                   D            True     0.007930           D
22                   B            True     0.013226           B
3                    D            True     0.015255           D
69                   B            True     0.017692           B
45                   D            True     0.019767           D
98                   B            True     0.036197           B
109                  D            True     0.054746           D
124                  C            True     0.055110           C
20                   D            True     0.062675           D
83                   C            True     0.112695           C
120                  B            True     0.121059           B
29                   B            True     0.171280           B
63                   D            True     0.181689           D
23                   B            True     0.208001           B
84                   B            True     0.275028           B
10                   A            True     0.346032           A
89                   C            True     0.396350           C
127                  A            True     0.401493           A
43                   B            True     0.474349           B

    predicted_label
70                A
106               A
46                A
103               F
97                B
92                C
72                A
22                A
3                 B
69                D
45                B
98                D
109               B
124               B
20                B
83                A
120               A
29                D
63                B
23                D
84                D
10                F
89                D
127               F
43                D

The dataframe above shows the original label (given_label) for examples that cleanlab finds most likely to be mislabeled, as well as an alternative predicted_label for each example. These examples have likely been labeled incorrectly and should be carefully re-examined. After manually inspecting our label issues above, we can add the indices for the label issues we want to remove from our data to our previously defined list.

Remember to inspect and manually handle issues detected in your test data and to avoid handling them automatically. Otherwise you risk misleading model evaluations!

In this case, we manually found that the first 11 label issues with lowest label_score correspond to real label errors. We’ll drop those data points from our test set, in order to curate a cleaner test set. Here we solely address mislabeled data for brevity, but you can similarly address other issues detected in your test data to ensure the most reliable model evaluation.

[20]:

indices_to_drop_from_test_data = test_label_issues_ordered.index[:11]  # found by manually inspecting test_label_issues_ordered

[21]:

df_test_cleaned = df_test.drop(indices_to_drop_from_test_data, axis=0).reset_index(drop=True)
test_features = test_features.drop(indices_to_drop_from_test_data, axis=0).reset_index(drop=True)
test_labels = test_labels.drop(indices_to_drop_from_test_data, axis=0).reset_index(drop=True)

Use clean test data to evaluate the performance of model trained on noisy training data#

[22]:

preds = clf.predict(test_features)
acc_original = accuracy_score(test_labels.astype(int), preds.astype(int))
print(
    f"Accuracy of model fit to noisy training data, measured on clean test data: {round(acc_original*100,1)}%"
)

Accuracy of model fit to noisy training data, measured on clean test data: 78.0%

Although curating clean test data does not directly help train a better ML model, more reliable model evaluation can improve your overall ML project. For instance, clean test data enables better informed decisions regarding when to deploy a model and better model/hyperparameter selection. While manually curating data can be tedious, Cleanlab Studio offers data correction interfaces to streamline this work.

6. Check for issues in training data and algorithmically correct them#

To run Datalab on our training set, we first compute out-of-sample predicted probabilities for our training data (via cross-validation).

[23]:

from sklearn.model_selection import cross_val_predict

num_crossval_folds = 5
pred_probs = cross_val_predict(
    clf,
    train_features,
    train_labels,
    cv=num_crossval_folds,
    method="predict_proba",
)

Based on these ML model outputs, we similarly run Datalab to detect issues in our training data.

[24]:

train_features_array = train_features.to_numpy()  # could alternatively be model embeddings

train_lab = Datalab(data=df_train, label_name="noisy_letter_grade", task="classification")
train_lab.find_issues(features=train_features_array, pred_probs=pred_probs)
train_lab.report(show_summary_score=True, show_all_issues=True)

Finding null issues ...
Finding label issues ...
Finding outlier issues ...
Finding near_duplicate issues ...
Finding non_iid issues ...
Finding class_imbalance issues ...
Finding underperforming_group issues ...

Audit complete. 318 issues found in the dataset.
Dataset Information: num_examples: 601, num_classes: 5

Here is a summary of various issues found in your data:

           issue_type    score  num_issues
                label 0.740433         175
              outlier 0.344154          72
       near_duplicate 0.588290          71
                 null 1.000000           0
              non_iid 0.437267           0
      class_imbalance 0.146423           0
underperforming_group 0.977223           0

(Note: A lower score indicates a more severe issue across all examples in the dataset.)

Learn about each issue: https://docs.cleanlab.ai/stable/cleanlab/datalab/guide/issue_type_description.html
See which examples in your dataset exhibit each issue via: `datalab.get_issues(<ISSUE_NAME>)`

Data indices corresponding to top examples of each issue are shown below.


----------------------- label issues -----------------------

About this issue:
        Examples whose given label is estimated to be potentially incorrect
    (e.g. due to annotation error) are flagged as having label issues.


Number of examples with this issue: 175
Overall dataset quality in terms of this issue: 0.7404

Examples representing most severe instances of this issue:
     is_label_issue  label_score given_label predicted_label
162            True     0.000072           F               A
348            True     0.000161           B               D
232            True     0.000256           F               B
205            True     0.000458           F               A
400            True     0.000738           C               D


---------------------- outlier issues ----------------------

About this issue:
        Examples that are very different from the rest of the dataset
    (i.e. potentially out-of-distribution or rare/anomalous instances).


Number of examples with this issue: 72
Overall dataset quality in terms of this issue: 0.3442

Examples representing most severe instances of this issue:
     is_outlier_issue  outlier_score
588              True   2.358961e-46
336              True   2.490911e-36
269              True   3.122475e-28
321              True   5.374139e-22
311              True   1.358617e-17


------------------ near_duplicate issues -------------------

About this issue:
        A (near) duplicate issue refers to two or more examples in
    a dataset that are extremely similar to each other, relative
    to the rest of the dataset.  The examples flagged with this issue
    may be exactly duplicated, or lie atypically close together when
    represented as vectors (i.e. feature embeddings).


Number of examples with this issue: 71
Overall dataset quality in terms of this issue: 0.5883

Examples representing most severe instances of this issue:
     is_near_duplicate_issue  near_duplicate_score                       near_duplicate_sets  distance_to_nearest_neighbor
600                     True                   0.0  [592, 593, 594, 595, 596, 597, 598, 599]                  0.000000e+00
221                     True                   0.0                                     [500]                  0.000000e+00
222                     True                   0.0                                [315, 332]                  7.791060e-09
243                     True                   0.0                                     [540]                  2.379106e-09
599                     True                   0.0  [592, 593, 594, 595, 596, 597, 598, 600]                  0.000000e+00


----------------------- null issues ------------------------

About this issue:
        Examples identified with the null issue correspond to rows that have null/missing values across all feature columns (i.e. the entire row is missing values).


Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 1.0000

Examples representing most severe instances of this issue:
     is_null_issue  null_score
0            False         1.0
396          False         1.0
397          False         1.0
398          False         1.0
399          False         1.0


---------------------- non_iid issues ----------------------

About this issue:
        Whether the dataset exhibits statistically significant
    violations of the IID assumption like:
    changepoints or shift, drift, autocorrelation, etc.
    The specific violation considered is whether the
    examples are ordered such that almost adjacent examples
    tend to have more similar feature values.


Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 0.4373

Examples representing most severe instances of this issue:
     is_non_iid_issue  non_iid_score
165             False       0.550374
598             False       0.627357
599             False       0.627496
597             False       0.627502
600             False       0.627919

Additional Information:
p-value: 0.43726734378061227


------------------ class_imbalance issues ------------------

About this issue:
        Examples belonging to the most under-represented class in the dataset.

Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 0.1464

Examples representing most severe instances of this issue:
     is_class_imbalance_issue  class_imbalance_score given_label
321                     False               0.146423           F
112                     False               0.146423           F
506                     False               0.146423           F
393                     False               0.146423           F
508                     False               0.146423           F

Additional Information:
Rarest Class: NA


--------------- underperforming_group issues ---------------

About this issue:
        An underperforming group refers to a cluster of similar examples
    (i.e. a slice) in the dataset for which the ML model predictions
    are particularly poor (loss evaluation over this subpopulation is high).


Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 0.9772

Examples representing most severe instances of this issue:
     is_underperforming_group_issue  underperforming_group_score
0                             False                     0.977223
402                           False                     0.977223
401                           False                     0.977223
400                           False                     0.977223
399                           False                     0.977223

Now instead of manually inspecting the detected issues in our training data, we will automatically filter all data points out of the training set that cleanlab has flagged as being likely mislabeled, outliers, or near duplicates. Unlike the test data which cannot be blindly auto-curated because we must ensure reliable model evaluation, the training data can be more aggressively modified as long as we’re able to faithfully evaluate the resulting fitted model.

[25]:

label_issue_results = train_lab.get_issues("label")
label_issues_idx = label_issue_results[label_issue_results["is_label_issue"] == True].index
label_issues_idx

[25]:

Index([  2,   7,  12,  21,  23,  25,  26,  29,  32,  33,
       ...
       566, 568, 571, 572, 574, 576, 578, 585, 587, 590],
      dtype='int64', length=175)

[26]:

near_duplicates = train_lab.get_issues("near_duplicate")
near_duplicates_idx = near_duplicates[near_duplicates["is_near_duplicate_issue"] == True].index
near_duplicates_idx

[26]:

Index([ 19,  29,  41,  43,  71,  83,  85,  88, 101, 106, 117, 122, 146, 155,
       156, 173, 187, 196, 221, 222, 224, 243, 252, 272, 277, 279, 288, 292,
       300, 315, 329, 332, 342, 352, 363, 365, 366, 384, 388, 393, 394, 397,
       404, 431, 436, 474, 480, 494, 500, 506, 508, 515, 516, 536, 537, 539,
       540, 542, 559, 575, 576, 582, 592, 593, 594, 595, 596, 597, 598, 599,
       600],
      dtype='int64')

[27]:

outliers = train_lab.get_issues("outlier")
outliers_idx = outliers[outliers["is_outlier_issue"] == True].index
outliers_idx

[27]:

Index([  0,   1,   3,   7,  26,  46,  52,  77,  89,  99, 101, 131, 132, 143,
       153, 155, 159, 163, 193, 194, 195, 199, 208, 212, 240, 241, 242, 247,
       256, 269, 287, 295, 299, 307, 311, 313, 321, 330, 336, 337, 340, 350,
       361, 378, 379, 388, 392, 419, 432, 444, 476, 479, 484, 485, 489, 492,
       504, 510, 511, 522, 523, 535, 543, 546, 547, 567, 571, 578, 579, 585,
       588, 591],
      dtype='int64')

[28]:

idx_to_drop = list(set(list(label_issues_idx) + list(near_duplicates_idx) + list(outliers_idx)))
len(idx_to_drop)

[28]:

[29]:

df_train_curated = df_train.drop(idx_to_drop, axis=0).reset_index(drop=True)
train_features = train_features.drop(idx_to_drop, axis=0).reset_index(drop=True)
train_labels = train_labels.drop(idx_to_drop, axis=0).reset_index(drop=True)

7. Train model on cleaned training data#

[30]:

clean_clf = XGBClassifier(tree_method="hist", enable_categorical=True, random_state=SEED)
clean_clf.fit(train_features, train_labels)

[30]:

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=True, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, objective='multi:softprob', ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Use clean test data to evaluate the performance of model trained on cleaned training data#

[31]:

clean_preds = clean_clf.predict(test_features)
acc_clean = accuracy_score(test_labels.astype(int), clean_preds.astype(int))
print(
    f"Accuracy of model fit to clean training data, measured on clean test data: {round(acc_clean*100,1)}%"
)

Accuracy of model fit to clean training data, measured on clean test data: 78.9%

Although this simple data filtering may not be the maximally effective training set curation (particularly if the initial ML model was poor-quality and hence the detected issues are inaccurate), we can at least faithfully assess its effect using our clean test data. In this case, we do see the resulting ML model has improved, even with this simple training data filtering.

8. Identifying better training data curation strategies via hyperparameter optimization techniques#

Thus far, we’ve seen how to detect issues in the training and test data to improve model training and evaluation. While we should manually curate the test data to ensure faithful evaluation, we are free to algorithmically curate the training data. Since the simple filtering strategy above is not necessarily optimal, here we consider how to identify a better algorithmic curation strategy. Note however that the best strategy will be a hybrid of automated and manual data corrections, as you can efficiently do via the data correction interface in Cleanlab Studio.

Above we made basic training data edits to improve test performance, where each one of these data edits can be quantitatively parameterized (eg. what fraction of each issue to filter from the dataset). We can use (hyper)parameter-tuning techniques to automatically search for combinations of training data edits that result in particularly accurate models. Here we apply this hyperparameter optimization to maximize test set performance for brevity, but in practice you should use a separate validation set (which you can curate similarly to the test data in this tutorial, in order to ensure reliable model evaluations).

We define a dict to parameterize our dataset changes:

[32]:

default_edit_params = {
        "drop_label_issue": 0.5,
        "drop_outlier": 0.5,
        "drop_near_duplicate": 0.2,
    }

These example values translate into the following training data edits:

drop_label_issue: We filter the top 50% of the datapoints flagged with label issues (with most severe label score).
drop_outlier: We filter the top 50% most severe outliers based on outlier score (amongst the set of flagged outliers).
drop_near_duplicate: We drop extra copies of the top 20% of near duplicates (based on near duplicate score). Amongst each set of near duplicates, we keep the data point that has highest self-confidence score for its given label.

We will search over various values for these parameters, fit a model to each corresponding training dataset edited based on the parameter values, and see which combination of values yields the best model.

Note: Datalab detects other issue types that could also be considered in this algorithmic data curation.

To more easily apply candidate training data edits, we first sort our data points flagged with each issue type based on the corresponding severity score:

[33]:

label_issues = train_lab.get_issues("label").query("is_label_issue").sort_values("label_score")
near_duplicates = train_lab.get_issues("near_duplicate").query("is_near_duplicate_issue").sort_values("near_duplicate_score")
outliers = train_lab.get_issues("outlier").query("is_outlier_issue").sort_values("outlier_score")

We introduce a edit_data function to implement candidate training data edits, fit a model to the edited training set, and evaluate it on our cleaned test data (can skip these details).

See the implementation of edit_data (click to expand)

# Note: This pulldown content is for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.

def edit_data(train_features, train_labels, label_issues, near_duplicates, outliers,
              drop_label_issue, drop_near_duplicate, drop_outlier):
    """
    Edits the training data by dropping a specified percentage of data points identified as label issues,
    near duplicates, and outliers based on the full datasets provided for each issue type.

    Args:
        train_features (pd.DataFrame): DataFrame containing the training features.
        train_labels (pd.Series): Series containing the training labels.
        label_issues (pd.DataFrame): DataFrame containing data points with label issues.
        near_duplicates (pd.DataFrame): DataFrame containing data points identified as near duplicates.
        outliers (pd.DataFrame): DataFrame containing data points identified as outliers.
        drop_label_issue (float): Percentage of label issue data points to drop.
        drop_near_duplicate (float): Percentage of near duplicate data points to drop.
        drop_outlier (float): Percentage of outlier data points to drop.

    Returns:
        pd.DataFrame: The cleaned training features.
        pd.Series: The cleaned training labels.
    """
    # Extract indices for each type of issue
    label_issues_idx = label_issues.index.tolist()
    near_duplicates_idx = near_duplicates.index.tolist()
    outliers_idx = outliers.index.tolist()

    # Calculate the number of each type of data point to drop except near duplicates, which requires separate logic
    num_label_issues_to_drop = int(len(label_issues_idx) * drop_label_issue)
    num_outliers_to_drop = int(len(outliers_idx) * drop_outlier)

    # Calculate number of near duplicates to drop
    # Assuming the 'near_duplicate_sets' are lists of indices (integers) of near duplicates
    clusters = []
    for i in near_duplicates_idx:
        # Create a set for each cluster, add the current index to its near duplicate set
        cluster = set(near_duplicates.at[i, 'near_duplicate_sets'])
        cluster.add(i)
        clusters.append(cluster)

    # Deduplicate clusters by converting the list of sets to a set of frozensets
    unique_clusters = set(frozenset(cluster) for cluster in clusters)

    # If you need the unique clusters back in list of lists format:
    unique_clusters_list = [list(cluster) for cluster in unique_clusters]

    near_duplicates_idx_to_drop = []

    for cluster in unique_clusters_list:
        # Calculate the number of rows to drop, ensuring at least one datapoint remains
        n_drop = max(math.ceil(len(cluster) * drop_near_duplicate), 1)  # Drop at least k% or 1 row
        if len(cluster) > n_drop:  # Ensure we keep at least one datapoint
            # Randomly select datapoints to drop
            drops = random.sample(cluster, n_drop)
        else:
            # If the cluster is too small, adjust the number to keep at least one datapoint
            drops = random.sample(cluster, len(cluster) - 1)  # Keep at least one
        near_duplicates_idx_to_drop.extend(drops)

    # Determine the specific indices to drop
    label_issues_idx_to_drop = label_issues_idx[:num_label_issues_to_drop]
    outliers_idx_to_drop = outliers_idx[:num_outliers_to_drop]

    # Combine the indices to drop
    idx_to_drop = list(set(label_issues_idx_to_drop + near_duplicates_idx_to_drop + outliers_idx_to_drop))

    # Drop the rows from the training data
    train_features_cleaned = train_features.drop(idx_to_drop).reset_index(drop=True)
    train_labels_cleaned = train_labels.drop(idx_to_drop).reset_index(drop=True)

    return train_features_cleaned, train_labels_cleaned

[35]:

from itertools import product

# List of possible values for each data edit parameter to search over (finer grid will yield better results but longer runtimes)
param_grid = {
    'drop_label_issue': [0.2, 0.5, 0.7, 1.0],
    'drop_near_duplicate': [0.0, 0.2, 0.5],
    'drop_outlier': [0.2, 0.5, 0.7],
}

# Generate all combinations of parameters
param_combinations = list(product(param_grid['drop_label_issue'], param_grid['drop_near_duplicate'], param_grid['drop_outlier']))

[36]:

best_score = 0
best_params = None

for drop_label_issue, drop_near_duplicate, drop_outlier in param_combinations:
    # Preprocess the data for the current combination of parameters
    train_features_preprocessed, train_labels_preprocessed = edit_data(
        train_features_v2, train_labels_v2, label_issues, near_duplicates, outliers,
        drop_label_issue, drop_near_duplicate, drop_outlier)

    # Train and evaluate the model
    model = XGBClassifier(tree_method="hist", enable_categorical=True, random_state=SEED)
    model.fit(train_features_preprocessed, train_labels_preprocessed)
    predictions = model.predict(test_features)
    accuracy = accuracy_score(test_labels.astype(int), predictions.astype(int))

    # Update the best score and parameters if the current model is better
    if accuracy > best_score:
        best_score = accuracy
        best_params = {'drop_label_issue': drop_label_issue, 'drop_near_duplicate': drop_near_duplicate, 'drop_outlier': drop_outlier}

# Print the best parameters and score
print(f"Best parameters found in search: {best_params}")

Best parameters found in search: {'drop_label_issue': 0.5, 'drop_near_duplicate': 0.0, 'drop_outlier': 0.7}

[37]:

print(
    f"Accuracy of model fit to optimally cleaned training data, measured on clean test data: {round(best_score*100,1)}%"
)

Accuracy of model fit to optimally cleaned training data, measured on clean test data: 82.1%

9. Conclusion#

This tutorial demonstrated how you can properly use cleanlab to improve your own ML model. When dealing with noisy data, you should first manually curate your test data to ensure reliable model evaluation. After that, you can algorithmically curate your training data. We demonstrated a simple hyperparameter tuning technique to identify effective training data edits that produce an accurate model. As well as how cleanlab can help catch fundamental problems in the overall train/test setup like duplicates/leakage and data drift.

Note that we never evaluated different models with different test set versions (which does not yield meaningful comparisons). We curated the test data to be as high-quality as possible and then based all model evaluations on this fixed version of the test data.

For brevity, this tutorial focused mostly around label issues and data pruning strategies. For classification tasks where you already have high-quality test data and solely want to handle label errors in your training data: cleanlab’s CleanLearning class offers an alternative convenience method to train a robust ML model. You can achieve better results by considering additional data issues beyond label errors and curation strategies like fixing incorrect values – this is all streamlined via the intelligent data correction interface of Cleanlab Studio.