Improving ML Performance via Data Curation with Train vs Test Splits#
In typical Machine Learning projects, we split our dataset into training data for fitting models and test data to evaluate model performance. For noisy real-world datasets, detecting/correcting errors in the training data is important to train robust models, but it’s less recognized that the test set can also be noisy. For accurate model evaluation, it is vital to find and fix issues in the test data as well. Some evaluation metrics are particularly sensitive to outliers and noisy
labels. This tutorial demonstrates a way to use cleanlab (via Datalab
) to curate both your training and test data, ensuring robust model training and reliable performance evaluation. We recommend first completing some Datalab tutorials before diving into this more complex subject.
Here’s how we recommend handling noisy training and test data (this tutorial walks through these steps):
Preprocess your training and test data to be suitable for ML. Use cleanlab to check for fundamental train/test setup problems in the merged dataset like train/test leakage or drift.
Fit your ML model to your noisy training data and get its predictions/embeddings for your test data. Use these model outputs with cleanlab to detect issues in your test data.
Manually review/correct cleanlab-detected issues in your test data. We caution against blindly automated correction of test data. Changes to your test set should be carefully verified to ensure they will lead to more accurate model evaluation. We also caution against comparing the performance of different ML models across different versions of your test data; performance comparions between models should be based on the same test data.
Cross-validate a new copy of your ML model on your training data, and then use it with cleanlab to detect issues in the training dataset. Do not include test data in any part of this step to avoid leaking test set information into the training data curation.
You can try automated techniques to curate your training data based on cleanlab results, train models on the curated training data, and evaluate them on the cleaned test data.
Consider this tutorial as a blueprint for using cleanlab in diverse ML projects spanning various data modalities. The same ideas apply if you substitute test data with validation data above. In a final advanced section of this tutorial, we show how training data edits can be parameterized in terms of cleanlab’s detected issues, such that hyperparameter optimization can identify the optimal combination of data edits for training an effective ML model.
Note: This tutorial trains an XGBoost model on a tabular dataset, but the same approach applies to any ML model and data modality.
Why did you make this tutorial?#
TLDR: Reliable ML requires both reliable training and reliable evaluation. This tutorial shows you how to achieve both using cleanlab.
Longer answer: Many users wish to use cleanlab to improve their ML model by improving their data, but make subtle mistakes. This multi-step tutorial shows one way to do this properly. Some users curate (e.g. fix label issues in) their training data, train ML model, and evaluate it on test data. But they see no improvement in test-set accuracy, because they have introduced distribution-shift by altering their training data. If the test data also has issues, they must also be fixed for a faithful model evaluation. Other users therefore curate their test data too, but some blindly auto-fix their test data, which is dangerous! This cleanlab package is based on ML and thus inevitably imperfect. Issues that cleanlab detected in test data should not be blindly auto-fixed – this risks making model evaluation wrong. Instead we recommend the multi-step workflow above, where less algorithmic/automated correction is applied to test data than to training data (focus your manual efforts on curating test rather than training data).
1. Install dependencies#
Datalab
has additional dependencies that are not included in the standard installation of cleanlab. You can use pip
to install all packages required for this tutorial as follows:
!pip install xgboost
!pip install "cleanlab[datalab]"
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
# !pip install git+https://github.com/cleanlab/cleanlab.git
[2]:
import random
import os
import math
import numpy as np
from xgboost import XGBClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import pandas as pd
import cleanlab
from cleanlab import Datalab
SEED = 123456 # for reproducibility
np.random.seed(SEED)
random.seed(SEED)
2. Preprocess the data#
This tutorial considers a classification task with structured/tabular data. The ML task is to predict each student’s final grade in a course (class label) based on various numeric/categorical features about them (exam scores and notes).
[3]:
df_train = pd.read_csv(
"https://cleanlab-public.s3.amazonaws.com/Datasets/student-grades/clos_train_data.csv"
)
df_test = pd.read_csv(
"https://cleanlab-public.s3.amazonaws.com/Datasets/student-grades/clos_test_data.csv"
)
df_train.head()
[3]:
stud_ID | exam_1 | exam_2 | exam_3 | notes | noisy_letter_grade | |
---|---|---|---|---|---|---|
0 | 018bff | 94.0 | 41.0 | 91.0 | great participation +10 | B |
1 | 076d92 | 0.0 | 79.0 | 65.0 | cheated on exam, gets 0pts | F |
2 | c80059 | 86.0 | 89.0 | 85.0 | great final presentation +10 | F |
3 | e38f8a | 50.0 | 67.0 | 94.0 | great final presentation +10 | B |
4 | d57e1a | 92.0 | 79.0 | 98.0 | great final presentation +10 | A |
Before training a ML model, we preprocess our dataset. The type of preprocessing that is best will depend on what ML model you use. This tutorial will demonstrate an XGBoost model, so we’ll process the notes and noisy_letter_grade columns into categorical columns for this model (each category encoded as an integer). You can alternatively use Cleanlab Studio, which will automatically produce a high-accuracy ML model for your raw data, without you having to worry about any ML modeling or data preprocessing work.
[4]:
# Create label encoders for the categorical columns
grade_le = preprocessing.LabelEncoder()
notes_le = preprocessing.LabelEncoder()
# Process the feature columns
train_features = df_train.drop(["stud_ID", "noisy_letter_grade"], axis=1).copy()
train_features["notes"] = notes_le.fit_transform(train_features["notes"])
train_features["notes"] = train_features["notes"].astype("category")
# Process the label column
train_labels = pd.DataFrame(grade_le.fit_transform(df_train["noisy_letter_grade"].copy()), columns=["noisy_letter_grade"])
# Keep separate copies of these training features and labels for later use
train_features_v2 = train_features.copy()
train_labels_v2 = train_labels.copy()
We first solely preprocessed the training data to avoid information leakage (using test data information that would not be available at prediction time). Here’s how the preprocessed training features look:
[5]:
train_features.head()
[5]:
exam_1 | exam_2 | exam_3 | notes | |
---|---|---|---|---|
0 | 94.0 | 41.0 | 91.0 | 2 |
1 | 0.0 | 79.0 | 65.0 | 0 |
2 | 86.0 | 89.0 | 85.0 | 1 |
3 | 50.0 | 67.0 | 94.0 | 1 |
4 | 92.0 | 79.0 | 98.0 | 1 |
We apply the same preprocessing to the test data.
[6]:
test_features = df_test.drop(
["stud_ID", "noisy_letter_grade"], axis=1
).copy()
test_features["notes"] = notes_le.transform(test_features["notes"])
test_features["notes"] = test_features["notes"].astype("category")
test_labels = pd.DataFrame(grade_le.transform(df_test["noisy_letter_grade"].copy()), columns=["noisy_letter_grade"])
We then appropriately format the datasets for the ML model used in this tutorial.
[7]:
train_labels = train_labels.astype('object')
test_labels = test_labels.astype('object')
train_features["notes"] = train_features["notes"].astype(int)
test_features["notes"] = test_features["notes"].astype(int)
preprocessed_train_data = pd.concat([train_features, train_labels], axis=1)
preprocessed_train_data["stud_ID"] = df_train["stud_ID"]
preprocessed_test_data = pd.concat([test_features, test_labels], axis=1)
preprocessed_test_data["stud_ID"] = df_test["stud_ID"]
3. Check for fundamental problems in the train/test setup#
Before training any ML model, we can quickly check for fundamental issues in our setup with cleanlab. To audit all of our data at once, we merge the training and test sets into one dataset, from which we construct a Datalab
object. Datalab automatically detects many types of common issues in a dataset, but requires a trained ML model for a comprehensive audit. We haven’t trained any model yet, so here we instruct Datalab to only check for specific data issues: near duplicates, and whether
the data appears non-IID (violations of the IID assumption include: data drift or lack of statistical independence between data points).
Datalab can detect many additional types of data issues, depending on what inputs it is given. Below we provide features = features_df
as the sole input to Datalab.find_issues()
, which solely contains numerical values here. If you have heterogenoues/complex data types (eg. text or images), you could instead provide vector feature representations (eg. pretrained model embeddings) of your data as the features
.
[8]:
full_df = pd.concat([preprocessed_train_data, preprocessed_test_data], axis=0).reset_index(drop=True)
features_df = full_df.drop(["noisy_letter_grade", "stud_ID"], axis=1) # can instead use model embeddings
[9]:
lab = Datalab(data=full_df, label_name="noisy_letter_grade", task="classification")
lab.find_issues(features=features_df.to_numpy(), issue_types={"near_duplicate": {}, "non_iid": {}})
lab.report(show_summary_score=True, show_all_issues=True)
Finding near_duplicate issues ...
Finding non_iid issues ...
Audit complete. 100 issues found in the dataset.
Dataset Information: num_examples: 749, num_classes: 5
Here is a summary of various issues found in your data:
issue_type score num_issues
near_duplicate 0.583745 100
non_iid 0.291382 0
(Note: A lower score indicates a more severe issue across all examples in the dataset.)
Learn about each issue: https://docs.cleanlab.ai/stable/cleanlab/datalab/guide/issue_type_description.html
See which examples in your dataset exhibit each issue via: `datalab.get_issues(<ISSUE_NAME>)`
Data indices corresponding to top examples of each issue are shown below.
------------------ near_duplicate issues -------------------
About this issue:
A (near) duplicate issue refers to two or more examples in
a dataset that are extremely similar to each other, relative
to the rest of the dataset. The examples flagged with this issue
may be exactly duplicated, or lie atypically close together when
represented as vectors (i.e. feature embeddings).
Number of examples with this issue: 100
Overall dataset quality in terms of this issue: 0.5837
Examples representing most severe instances of this issue:
is_near_duplicate_issue near_duplicate_score near_duplicate_sets distance_to_nearest_neighbor
748 True 0.0 [604] 0.0
510 True 0.0 [227] 0.0
71 True 0.0 [719] 0.0
65 True 0.0 [690, 444] 0.0
547 True 0.0 [647] 0.0
---------------------- non_iid issues ----------------------
About this issue:
Whether the dataset exhibits statistically significant
violations of the IID assumption like:
changepoints or shift, drift, autocorrelation, etc.
The specific violation considered is whether the
examples are ordered such that almost adjacent examples
tend to have more similar feature values.
Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 0.2914
Examples representing most severe instances of this issue:
is_non_iid_issue non_iid_score
611 False 0.687869
610 False 0.687883
612 False 0.688146
609 False 0.688189
613 False 0.688713
Additional Information:
p-value: 0.2913818469137725
cleanlab does not find significant evidence that our data is non-IID, which is good. Otherwise, we’d need to further consider where our data came from and whether conclusions/predictions from this dataset can really generalize to our population of interest.
But cleanlab did detect many near duplicates in the dataset. We see some exact duplicates between our training and test data, which may indicate data leakage! Since we didn’t expect these duplicates in our dataset, let’s drop the extra duplicated copies of test data points found in our training set from this training set. This helps ensure that our model evaluations reflect generalization capabilities. Here’s how we can review the near duplicates detected via Datalab.
[10]:
full_duplicate_results = lab.get_issues("near_duplicate")
full_duplicate_results.sort_values("near_duplicate_score").head()
[10]:
is_near_duplicate_issue | near_duplicate_score | near_duplicate_sets | distance_to_nearest_neighbor | |
---|---|---|---|---|
748 | True | 0.0 | [604] | 0.0 |
510 | True | 0.0 | [227] | 0.0 |
71 | True | 0.0 | [719] | 0.0 |
65 | True | 0.0 | [690, 444] | 0.0 |
547 | True | 0.0 | [647] | 0.0 |
To distinguish between near vs. exact duplicates, we can filter where the distance_to_nearest_neighbor
column has value = 0. We specifically filter for exact duplicates between our training and test set in order to drop the extra copies of such data points from our training set.
[11]:
train_idx_cutoff = len(preprocessed_train_data) - 1 # last index of training data in the merged dataset
# Create column to list which duplicate sets include some test data:
full_duplicate_results['nd_set_has_index_over_training_cutoff'] = full_duplicate_results['near_duplicate_sets'].apply(lambda x: any(i > train_idx_cutoff for i in x))
exact_duplicates = full_duplicate_results.query('is_near_duplicate_issue == True and near_duplicate_score == 0.0 and nd_set_has_index_over_training_cutoff == True').sort_values("near_duplicate_score")
exact_duplicates
[11]:
is_near_duplicate_issue | near_duplicate_score | near_duplicate_sets | distance_to_nearest_neighbor | nd_set_has_index_over_training_cutoff | |
---|---|---|---|---|---|
33 | True | 0.0 | [627] | 0.0 | True |
53 | True | 0.0 | [678] | 0.0 | True |
65 | True | 0.0 | [690, 444] | 0.0 | True |
71 | True | 0.0 | [719] | 0.0 | True |
82 | True | 0.0 | [709] | 0.0 | True |
100 | True | 0.0 | [615] | 0.0 | True |
292 | True | 0.0 | [620] | 0.0 | True |
420 | True | 0.0 | [704] | 0.0 | True |
431 | True | 0.0 | [688] | 0.0 | True |
459 | True | 0.0 | [672] | 0.0 | True |
547 | True | 0.0 | [647] | 0.0 | True |
564 | True | 0.0 | [696] | 0.0 | True |
604 | True | 0.0 | [748] | 0.0 | True |
605 | True | 0.0 | [723] | 0.0 | True |
[12]:
exact_duplicates_indices = exact_duplicates.index
exact_duplicates_indices
[12]:
Index([33, 53, 65, 71, 82, 100, 292, 420, 431, 459, 547, 564, 604, 605], dtype='int64')
Below we remove the exact duplicates that occur between our training and test sets from the training data.
[13]:
indices_of_duplicates_to_drop = [idx for idx in exact_duplicates_indices if idx <= train_idx_cutoff]
indices_of_duplicates_to_drop
[13]:
[33, 53, 65, 71, 82, 100, 292, 420, 431, 459, 547, 564, 604, 605]
Here are the examples we’ll drop from our training data, since they are exact duplicates of test examples.
[14]:
full_df.iloc[indices_of_duplicates_to_drop]
[14]:
exam_1 | exam_2 | exam_3 | notes | noisy_letter_grade | stud_ID | |
---|---|---|---|---|---|---|
33 | 83.0 | 92.0 | 80.0 | 3 | 2 | 4a3f75 |
53 | 91.0 | 0.0 | 94.0 | 0 | 3 | d030b5 |
65 | 93.0 | 73.0 | 82.0 | 5 | 1 | ddd0ba |
71 | 90.0 | 95.0 | 75.0 | 1 | 0 | 8e6d24 |
82 | 78.0 | 81.0 | 74.0 | 4 | 3 | 464aab |
100 | 80.0 | 96.0 | 83.0 | 4 | 2 | ee3387 |
292 | 79.0 | 62.0 | 82.0 | 5 | 2 | 61e807 |
420 | 99.0 | 53.0 | 76.0 | 5 | 2 | 71d7b9 |
431 | 90.0 | 92.0 | 88.0 | 2 | 0 | 83e31f |
459 | 70.0 | 63.0 | 95.0 | 2 | 1 | edeb53 |
547 | 68.0 | 93.0 | 73.0 | 5 | 2 | cd52b5 |
564 | 84.0 | 92.0 | 86.0 | 5 | 1 | 454e51 |
604 | 87.0 | 74.0 | 95.0 | 3 | 2 | 042686 |
605 | 96.0 | 83.0 | 73.0 | 1 | 0 | 12a73f |
[15]:
df_train = df_train.drop(indices_of_duplicates_to_drop, axis=0).reset_index(drop=True)
train_features = train_features.drop(indices_of_duplicates_to_drop, axis=0).reset_index(drop=True)
train_labels = train_labels.drop(indices_of_duplicates_to_drop, axis=0).reset_index(drop=True).astype(int)
4. Train model with original (noisy) training data#
After handling fundamental issues in our training/test setup, let’s fit our ML model to the training data. Here we use XGBoost as an example, but the same ideas of this tutorial apply to any other ML model.
[16]:
train_labels = train_labels["noisy_letter_grade"]
clf = XGBClassifier(tree_method="hist", enable_categorical=True, random_state=SEED)
clf.fit(train_features, train_labels)
[16]:
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=True, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, objective='multi:softprob', ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=True, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, objective='multi:softprob', ...)
Compute out-of-sample predicted probabilities for the test data from this baseline model#
Make sure that the columns of your predicted class probabilities are properly ordered with respect to the ordering of classes, which for Datalab is: lexicographically sorted by class name.
[17]:
test_pred_probs = clf.predict_proba(test_features)
5. Check for issues in test data and manually address them#
While we could evaluate our model’s accuracy using the predictions above, this will be unreliable if the test data have issues. Based on the given labels, model predictions, and feature representations, Datalab can automatically detect issues lurking in our test data.
[18]:
test_lab = Datalab(data=df_test, label_name="noisy_letter_grade", task="classification")
test_features_array = test_features.to_numpy() # could alternatively be model embeddings
test_lab.find_issues(features=test_features_array, pred_probs=test_pred_probs)
test_lab.report(show_summary_score=True, show_all_issues=True)
Finding null issues ...
Finding label issues ...
Finding outlier issues ...
Finding near_duplicate issues ...
Finding non_iid issues ...
Finding class_imbalance issues ...
Finding underperforming_group issues ...
Audit complete. 30 issues found in the dataset.
Dataset Information: num_examples: 134, num_classes: 5
Here is a summary of various issues found in your data:
issue_type score num_issues
label 0.798507 25
outlier 0.370259 5
null 1.000000 0
near_duplicate 0.625352 0
non_iid 0.524042 0
class_imbalance 0.097015 0
underperforming_group 1.000000 0
(Note: A lower score indicates a more severe issue across all examples in the dataset.)
Learn about each issue: https://docs.cleanlab.ai/stable/cleanlab/datalab/guide/issue_type_description.html
See which examples in your dataset exhibit each issue via: `datalab.get_issues(<ISSUE_NAME>)`
Data indices corresponding to top examples of each issue are shown below.
----------------------- label issues -----------------------
About this issue:
Examples whose given label is estimated to be potentially incorrect
(e.g. due to annotation error) are flagged as having label issues.
Number of examples with this issue: 25
Overall dataset quality in terms of this issue: 0.7985
Examples representing most severe instances of this issue:
is_label_issue label_score given_label predicted_label
70 True 0.000537 F A
90 False 0.000903 F C
79 False 0.001743 F C
106 True 0.001853 F A
46 True 0.002121 F A
---------------------- outlier issues ----------------------
About this issue:
Examples that are very different from the rest of the dataset
(i.e. potentially out-of-distribution or rare/anomalous instances).
Number of examples with this issue: 5
Overall dataset quality in terms of this issue: 0.3703
Examples representing most severe instances of this issue:
is_outlier_issue outlier_score
63 True 4.752463e-99
89 True 3.784418e-09
40 True 5.477741e-06
57 True 1.134230e-05
32 True 7.153555e-03
----------------------- null issues ------------------------
About this issue:
Examples identified with the null issue correspond to rows that have null/missing values across all feature columns (i.e. the entire row is missing values).
Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 1.0000
Examples representing most severe instances of this issue:
is_null_issue null_score
0 False 1.0
97 False 1.0
96 False 1.0
95 False 1.0
94 False 1.0
------------------ near_duplicate issues -------------------
About this issue:
A (near) duplicate issue refers to two or more examples in
a dataset that are extremely similar to each other, relative
to the rest of the dataset. The examples flagged with this issue
may be exactly duplicated, or lie atypically close together when
represented as vectors (i.e. feature embeddings).
Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 0.6254
Examples representing most severe instances of this issue:
is_near_duplicate_issue near_duplicate_score near_duplicate_sets distance_to_nearest_neighbor
43 False 0.143272 [] 0.000016
93 False 0.143272 [] 0.000016
20 False 0.146501 [] 0.000016
83 False 0.146501 [] 0.000016
75 False 0.161431 [] 0.000018
---------------------- non_iid issues ----------------------
About this issue:
Whether the dataset exhibits statistically significant
violations of the IID assumption like:
changepoints or shift, drift, autocorrelation, etc.
The specific violation considered is whether the
examples are ordered such that almost adjacent examples
tend to have more similar feature values.
Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 0.5240
Examples representing most severe instances of this issue:
is_non_iid_issue non_iid_score
12 False 0.765240
35 False 0.771221
28 False 0.801589
7 False 0.801652
112 False 0.810735
Additional Information:
p-value: 0.5240417899434826
------------------ class_imbalance issues ------------------
About this issue:
Examples belonging to the most under-represented class in the dataset.
Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 0.0970
Examples representing most severe instances of this issue:
is_class_imbalance_issue class_imbalance_score given_label
88 False 0.097015 F
70 False 0.097015 F
2 False 0.097015 F
71 False 0.097015 F
46 False 0.097015 F
Additional Information:
Rarest Class: NA
--------------- underperforming_group issues ---------------
About this issue:
An underperforming group refers to a cluster of similar examples
(i.e. a slice) in the dataset for which the ML model predictions
are particularly poor (loss evaluation over this subpopulation is high).
Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 1.0000
Examples representing most severe instances of this issue:
is_underperforming_group_issue underperforming_group_score
0 False 1.0
97 False 1.0
96 False 1.0
95 False 1.0
94 False 1.0
Datalab automatically audits our dataset for various common issues. The report above indicates many label issues in our data.
We can see which examples are estimated to be mislabeled (as well as a numeric quality score quantifying how likely their label is correct) via the get_issues()
method. To review the most likely label errors, we sort our data by the label_score
(a lower score represents that the label is less likely to be correct).
[19]:
test_label_issue_results = test_lab.get_issues("label")
test_label_issues_ordered = df_test.join(test_label_issue_results)
test_label_issues_ordered = test_label_issues_ordered[test_label_issue_results["is_label_issue"] == True].sort_values("label_score")
print(test_label_issues_ordered)
stud_ID exam_1 exam_2 exam_3 notes \
70 2bd759 93.0 79.0 97.0 great participation +10
106 34ccdd 90.0 100.0 89.0 great participation +10
46 bb3bab 97.0 88.0 74.0 great participation +10
103 bf1b14 66.0 83.0 96.0 NaN
97 4787de 73.0 84.0 68.0 great participation +10
92 865cbd 95.0 87.0 82.0 missed class frequently -10
72 32d53f 71.0 78.0 80.0 great final presentation +10
22 5b2f76 99.0 86.0 95.0 missed class frequently -10
3 28f8b4 67.0 82.0 98.0 NaN
69 df814d 78.0 85.0 84.0 NaN
45 f17261 95.0 88.0 69.0 NaN
98 1db3ff 95.0 81.0 76.0 NaN
109 ded944 86.0 85.0 89.0 NaN
124 343dd3 67.0 87.0 95.0 missed homework frequently -10
20 8d904d 73.0 73.0 76.0 missed class frequently -10
83 e4f0d5 86.0 85.0 89.0 missed homework frequently -10
120 d6d208 97.0 97.0 92.0 missed homework frequently -10
29 76c083 91.0 92.0 74.0 NaN
63 d030b5 91.0 0.0 94.0 cheated on exam, gets 0pts
23 695f96 96.0 69.0 92.0 NaN
84 745c23 89.0 95.0 72.0 NaN
10 13b36e 98.0 92.0 96.0 NaN
89 71d7b9 99.0 53.0 76.0 NaN
127 5ba892 98.0 97.0 93.0 NaN
43 9f0216 94.0 79.0 89.0 NaN
noisy_letter_grade is_label_issue label_score given_label \
70 F True 0.000537 F
106 F True 0.001853 F
46 F True 0.002121 F
103 D True 0.003628 D
97 D True 0.004006 D
92 A True 0.004031 A
72 D True 0.007930 D
22 B True 0.013226 B
3 D True 0.015255 D
69 B True 0.017692 B
45 D True 0.019767 D
98 B True 0.036197 B
109 D True 0.054746 D
124 C True 0.055110 C
20 D True 0.062675 D
83 C True 0.112695 C
120 B True 0.121059 B
29 B True 0.171280 B
63 D True 0.181689 D
23 B True 0.208001 B
84 B True 0.275028 B
10 A True 0.346032 A
89 C True 0.396350 C
127 A True 0.401493 A
43 B True 0.474349 B
predicted_label
70 A
106 A
46 A
103 F
97 B
92 C
72 A
22 A
3 B
69 D
45 B
98 D
109 B
124 B
20 B
83 A
120 A
29 D
63 B
23 D
84 D
10 F
89 D
127 F
43 D
The dataframe above shows the original label (given_label
) for examples that cleanlab finds most likely to be mislabeled, as well as an alternative predicted_label
for each example. These examples have likely been labeled incorrectly and should be carefully re-examined. After manually inspecting our label issues above, we can add the indices for the label issues we want to remove from our data to our previously defined list.
Remember to inspect and manually handle issues detected in your test data and to avoid handling them automatically. Otherwise you risk misleading model evaluations!
In this case, we manually found that the first 11 label issues with lowest label_score
correspond to real label errors. We’ll drop those data points from our test set, in order to curate a cleaner test set. Here we solely address mislabeled data for brevity, but you can similarly address other issues detected in your test data to ensure the most reliable model evaluation.
[20]:
indices_to_drop_from_test_data = test_label_issues_ordered.index[:11] # found by manually inspecting test_label_issues_ordered
[21]:
df_test_cleaned = df_test.drop(indices_to_drop_from_test_data, axis=0).reset_index(drop=True)
test_features = test_features.drop(indices_to_drop_from_test_data, axis=0).reset_index(drop=True)
test_labels = test_labels.drop(indices_to_drop_from_test_data, axis=0).reset_index(drop=True)
Use clean test data to evaluate the performance of model trained on noisy training data#
[22]:
preds = clf.predict(test_features)
acc_original = accuracy_score(test_labels.astype(int), preds.astype(int))
print(
f"Accuracy of model fit to noisy training data, measured on clean test data: {round(acc_original*100,1)}%"
)
Accuracy of model fit to noisy training data, measured on clean test data: 78.0%
Although curating clean test data does not directly help train a better ML model, more reliable model evaluation can improve your overall ML project. For instance, clean test data enables better informed decisions regarding when to deploy a model and better model/hyperparameter selection. While manually curating data can be tedious, Cleanlab Studio offers data correction interfaces to streamline this work.
6. Check for issues in training data and algorithmically correct them#
To run Datalab on our training set, we first compute out-of-sample predicted probabilities for our training data (via cross-validation).
[23]:
from sklearn.model_selection import cross_val_predict
num_crossval_folds = 5
pred_probs = cross_val_predict(
clf,
train_features,
train_labels,
cv=num_crossval_folds,
method="predict_proba",
)
Based on these ML model outputs, we similarly run Datalab to detect issues in our training data.
[24]:
train_features_array = train_features.to_numpy() # could alternatively be model embeddings
train_lab = Datalab(data=df_train, label_name="noisy_letter_grade", task="classification")
train_lab.find_issues(features=train_features_array, pred_probs=pred_probs)
train_lab.report(show_summary_score=True, show_all_issues=True)
Finding null issues ...
Finding label issues ...
Finding outlier issues ...
Finding near_duplicate issues ...
Finding non_iid issues ...
Finding class_imbalance issues ...
Finding underperforming_group issues ...
Audit complete. 318 issues found in the dataset.
Dataset Information: num_examples: 601, num_classes: 5
Here is a summary of various issues found in your data:
issue_type score num_issues
label 0.740433 175
outlier 0.344154 72
near_duplicate 0.588290 71
null 1.000000 0
non_iid 0.437267 0
class_imbalance 0.146423 0
underperforming_group 0.977223 0
(Note: A lower score indicates a more severe issue across all examples in the dataset.)
Learn about each issue: https://docs.cleanlab.ai/stable/cleanlab/datalab/guide/issue_type_description.html
See which examples in your dataset exhibit each issue via: `datalab.get_issues(<ISSUE_NAME>)`
Data indices corresponding to top examples of each issue are shown below.
----------------------- label issues -----------------------
About this issue:
Examples whose given label is estimated to be potentially incorrect
(e.g. due to annotation error) are flagged as having label issues.
Number of examples with this issue: 175
Overall dataset quality in terms of this issue: 0.7404
Examples representing most severe instances of this issue:
is_label_issue label_score given_label predicted_label
162 True 0.000072 F A
348 True 0.000161 B D
232 True 0.000256 F B
205 True 0.000458 F A
400 True 0.000738 C D
---------------------- outlier issues ----------------------
About this issue:
Examples that are very different from the rest of the dataset
(i.e. potentially out-of-distribution or rare/anomalous instances).
Number of examples with this issue: 72
Overall dataset quality in terms of this issue: 0.3442
Examples representing most severe instances of this issue:
is_outlier_issue outlier_score
588 True 2.358961e-46
336 True 2.490911e-36
269 True 3.122475e-28
321 True 5.374139e-22
311 True 1.358617e-17
------------------ near_duplicate issues -------------------
About this issue:
A (near) duplicate issue refers to two or more examples in
a dataset that are extremely similar to each other, relative
to the rest of the dataset. The examples flagged with this issue
may be exactly duplicated, or lie atypically close together when
represented as vectors (i.e. feature embeddings).
Number of examples with this issue: 71
Overall dataset quality in terms of this issue: 0.5883
Examples representing most severe instances of this issue:
is_near_duplicate_issue near_duplicate_score near_duplicate_sets distance_to_nearest_neighbor
600 True 0.0 [592, 593, 594, 595, 596, 597, 598, 599] 0.000000e+00
221 True 0.0 [500] 0.000000e+00
222 True 0.0 [315, 332] 7.791060e-09
243 True 0.0 [540] 2.379106e-09
599 True 0.0 [592, 593, 594, 595, 596, 597, 598, 600] 0.000000e+00
----------------------- null issues ------------------------
About this issue:
Examples identified with the null issue correspond to rows that have null/missing values across all feature columns (i.e. the entire row is missing values).
Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 1.0000
Examples representing most severe instances of this issue:
is_null_issue null_score
0 False 1.0
396 False 1.0
397 False 1.0
398 False 1.0
399 False 1.0
---------------------- non_iid issues ----------------------
About this issue:
Whether the dataset exhibits statistically significant
violations of the IID assumption like:
changepoints or shift, drift, autocorrelation, etc.
The specific violation considered is whether the
examples are ordered such that almost adjacent examples
tend to have more similar feature values.
Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 0.4373
Examples representing most severe instances of this issue:
is_non_iid_issue non_iid_score
165 False 0.550374
598 False 0.627357
599 False 0.627496
597 False 0.627502
600 False 0.627919
Additional Information:
p-value: 0.43726734378061227
------------------ class_imbalance issues ------------------
About this issue:
Examples belonging to the most under-represented class in the dataset.
Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 0.1464
Examples representing most severe instances of this issue:
is_class_imbalance_issue class_imbalance_score given_label
321 False 0.146423 F
112 False 0.146423 F
506 False 0.146423 F
393 False 0.146423 F
508 False 0.146423 F
Additional Information:
Rarest Class: NA
--------------- underperforming_group issues ---------------
About this issue:
An underperforming group refers to a cluster of similar examples
(i.e. a slice) in the dataset for which the ML model predictions
are particularly poor (loss evaluation over this subpopulation is high).
Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 0.9772
Examples representing most severe instances of this issue:
is_underperforming_group_issue underperforming_group_score
0 False 0.977223
402 False 0.977223
401 False 0.977223
400 False 0.977223
399 False 0.977223
Now instead of manually inspecting the detected issues in our training data, we will automatically filter all data points out of the training set that cleanlab has flagged as being likely mislabeled, outliers, or near duplicates. Unlike the test data which cannot be blindly auto-curated because we must ensure reliable model evaluation, the training data can be more aggressively modified as long as we’re able to faithfully evaluate the resulting fitted model.
[25]:
label_issue_results = train_lab.get_issues("label")
label_issues_idx = label_issue_results[label_issue_results["is_label_issue"] == True].index
label_issues_idx
[25]:
Index([ 2, 7, 12, 21, 23, 25, 26, 29, 32, 33,
...
566, 568, 571, 572, 574, 576, 578, 585, 587, 590],
dtype='int64', length=175)
[26]:
near_duplicates = train_lab.get_issues("near_duplicate")
near_duplicates_idx = near_duplicates[near_duplicates["is_near_duplicate_issue"] == True].index
near_duplicates_idx
[26]:
Index([ 19, 29, 41, 43, 71, 83, 85, 88, 101, 106, 117, 122, 146, 155,
156, 173, 187, 196, 221, 222, 224, 243, 252, 272, 277, 279, 288, 292,
300, 315, 329, 332, 342, 352, 363, 365, 366, 384, 388, 393, 394, 397,
404, 431, 436, 474, 480, 494, 500, 506, 508, 515, 516, 536, 537, 539,
540, 542, 559, 575, 576, 582, 592, 593, 594, 595, 596, 597, 598, 599,
600],
dtype='int64')
[27]:
outliers = train_lab.get_issues("outlier")
outliers_idx = outliers[outliers["is_outlier_issue"] == True].index
outliers_idx
[27]:
Index([ 0, 1, 3, 7, 26, 46, 52, 77, 89, 99, 101, 131, 132, 143,
153, 155, 159, 163, 193, 194, 195, 199, 208, 212, 240, 241, 242, 247,
256, 269, 287, 295, 299, 307, 311, 313, 321, 330, 336, 337, 340, 350,
361, 378, 379, 388, 392, 419, 432, 444, 476, 479, 484, 485, 489, 492,
504, 510, 511, 522, 523, 535, 543, 546, 547, 567, 571, 578, 579, 585,
588, 591],
dtype='int64')
[28]:
idx_to_drop = list(set(list(label_issues_idx) + list(near_duplicates_idx) + list(outliers_idx)))
len(idx_to_drop)
[28]:
276
[29]:
df_train_curated = df_train.drop(idx_to_drop, axis=0).reset_index(drop=True)
train_features = train_features.drop(idx_to_drop, axis=0).reset_index(drop=True)
train_labels = train_labels.drop(idx_to_drop, axis=0).reset_index(drop=True)
7. Train model on cleaned training data#
[30]:
clean_clf = XGBClassifier(tree_method="hist", enable_categorical=True, random_state=SEED)
clean_clf.fit(train_features, train_labels)
[30]:
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=True, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, objective='multi:softprob', ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=True, eval_metric=None, feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=None, n_jobs=None, num_parallel_tree=None, objective='multi:softprob', ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Use clean test data to evaluate the performance of model trained on cleaned training data#
[31]:
clean_preds = clean_clf.predict(test_features)
acc_clean = accuracy_score(test_labels.astype(int), clean_preds.astype(int))
print(
f"Accuracy of model fit to clean training data, measured on clean test data: {round(acc_clean*100,1)}%"
)
Accuracy of model fit to clean training data, measured on clean test data: 78.9%
Although this simple data filtering may not be the maximally effective training set curation (particularly if the initial ML model was poor-quality and hence the detected issues are inaccurate), we can at least faithfully assess its effect using our clean test data. In this case, we do see the resulting ML model has improved, even with this simple training data filtering.
8. Identifying better training data curation strategies via hyperparameter optimization techniques#
Thus far, we’ve seen how to detect issues in the training and test data to improve model training and evaluation. While we should manually curate the test data to ensure faithful evaluation, we are free to algorithmically curate the training data. Since the simple filtering strategy above is not necessarily optimal, here we consider how to identify a better algorithmic curation strategy. Note however that the best strategy will be a hybrid of automated and manual data corrections, as you can efficiently do via the data correction interface in Cleanlab Studio.
Above we made basic training data edits to improve test performance, where each one of these data edits can be quantitatively parameterized (eg. what fraction of each issue to filter from the dataset). We can use (hyper)parameter-tuning techniques to automatically search for combinations of training data edits that result in particularly accurate models. Here we apply this hyperparameter optimization to maximize test set performance for brevity, but in practice you should use a separate validation set (which you can curate similarly to the test data in this tutorial, in order to ensure reliable model evaluations).
We define a dict to parameterize our dataset changes:
[32]:
default_edit_params = {
"drop_label_issue": 0.5,
"drop_outlier": 0.5,
"drop_near_duplicate": 0.2,
}
These example values translate into the following training data edits:
drop_label_issue
: We filter the top 50% of the datapoints flagged with label issues (with most severe label score).drop_outlier
: We filter the top 50% most severe outliers based on outlier score (amongst the set of flagged outliers).drop_near_duplicate
: We drop extra copies of the top 20% of near duplicates (based on near duplicate score). Amongst each set of near duplicates, we keep the data point that has highest self-confidence score for its given label.
We will search over various values for these parameters, fit a model to each corresponding training dataset edited based on the parameter values, and see which combination of values yields the best model.
Note: Datalab detects other issue types that could also be considered in this algorithmic data curation.
To more easily apply candidate training data edits, we first sort our data points flagged with each issue type based on the corresponding severity score:
[33]:
label_issues = train_lab.get_issues("label").query("is_label_issue").sort_values("label_score")
near_duplicates = train_lab.get_issues("near_duplicate").query("is_near_duplicate_issue").sort_values("near_duplicate_score")
outliers = train_lab.get_issues("outlier").query("is_outlier_issue").sort_values("outlier_score")
We introduce a edit_data
function to implement candidate training data edits, fit a model to the edited training set, and evaluate it on our cleaned test data (can skip these details).
See the implementation of edit_data
(click to expand)
edit_data
(click to expand)# Note: This pulldown content is for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.
def edit_data(train_features, train_labels, label_issues, near_duplicates, outliers,
drop_label_issue, drop_near_duplicate, drop_outlier):
"""
Edits the training data by dropping a specified percentage of data points identified as label issues,
near duplicates, and outliers based on the full datasets provided for each issue type.
Args:
train_features (pd.DataFrame): DataFrame containing the training features.
train_labels (pd.Series): Series containing the training labels.
label_issues (pd.DataFrame): DataFrame containing data points with label issues.
near_duplicates (pd.DataFrame): DataFrame containing data points identified as near duplicates.
outliers (pd.DataFrame): DataFrame containing data points identified as outliers.
drop_label_issue (float): Percentage of label issue data points to drop.
drop_near_duplicate (float): Percentage of near duplicate data points to drop.
drop_outlier (float): Percentage of outlier data points to drop.
Returns:
pd.DataFrame: The cleaned training features.
pd.Series: The cleaned training labels.
"""
# Extract indices for each type of issue
label_issues_idx = label_issues.index.tolist()
near_duplicates_idx = near_duplicates.index.tolist()
outliers_idx = outliers.index.tolist()
# Calculate the number of each type of data point to drop except near duplicates, which requires separate logic
num_label_issues_to_drop = int(len(label_issues_idx) * drop_label_issue)
num_outliers_to_drop = int(len(outliers_idx) * drop_outlier)
# Calculate number of near duplicates to drop
# Assuming the 'near_duplicate_sets' are lists of indices (integers) of near duplicates
clusters = []
for i in near_duplicates_idx:
# Create a set for each cluster, add the current index to its near duplicate set
cluster = set(near_duplicates.at[i, 'near_duplicate_sets'])
cluster.add(i)
clusters.append(cluster)
# Deduplicate clusters by converting the list of sets to a set of frozensets
unique_clusters = set(frozenset(cluster) for cluster in clusters)
# If you need the unique clusters back in list of lists format:
unique_clusters_list = [list(cluster) for cluster in unique_clusters]
near_duplicates_idx_to_drop = []
for cluster in unique_clusters_list:
# Calculate the number of rows to drop, ensuring at least one datapoint remains
n_drop = max(math.ceil(len(cluster) * drop_near_duplicate), 1) # Drop at least k% or 1 row
if len(cluster) > n_drop: # Ensure we keep at least one datapoint
# Randomly select datapoints to drop
drops = random.sample(cluster, n_drop)
else:
# If the cluster is too small, adjust the number to keep at least one datapoint
drops = random.sample(cluster, len(cluster) - 1) # Keep at least one
near_duplicates_idx_to_drop.extend(drops)
# Determine the specific indices to drop
label_issues_idx_to_drop = label_issues_idx[:num_label_issues_to_drop]
outliers_idx_to_drop = outliers_idx[:num_outliers_to_drop]
# Combine the indices to drop
idx_to_drop = list(set(label_issues_idx_to_drop + near_duplicates_idx_to_drop + outliers_idx_to_drop))
# Drop the rows from the training data
train_features_cleaned = train_features.drop(idx_to_drop).reset_index(drop=True)
train_labels_cleaned = train_labels.drop(idx_to_drop).reset_index(drop=True)
return train_features_cleaned, train_labels_cleaned
[35]:
from itertools import product
# List of possible values for each data edit parameter to search over (finer grid will yield better results but longer runtimes)
param_grid = {
'drop_label_issue': [0.2, 0.5, 0.7, 1.0],
'drop_near_duplicate': [0.0, 0.2, 0.5],
'drop_outlier': [0.2, 0.5, 0.7],
}
# Generate all combinations of parameters
param_combinations = list(product(param_grid['drop_label_issue'], param_grid['drop_near_duplicate'], param_grid['drop_outlier']))
[36]:
best_score = 0
best_params = None
for drop_label_issue, drop_near_duplicate, drop_outlier in param_combinations:
# Preprocess the data for the current combination of parameters
train_features_preprocessed, train_labels_preprocessed = edit_data(
train_features_v2, train_labels_v2, label_issues, near_duplicates, outliers,
drop_label_issue, drop_near_duplicate, drop_outlier)
# Train and evaluate the model
model = XGBClassifier(tree_method="hist", enable_categorical=True, random_state=SEED)
model.fit(train_features_preprocessed, train_labels_preprocessed)
predictions = model.predict(test_features)
accuracy = accuracy_score(test_labels.astype(int), predictions.astype(int))
# Update the best score and parameters if the current model is better
if accuracy > best_score:
best_score = accuracy
best_params = {'drop_label_issue': drop_label_issue, 'drop_near_duplicate': drop_near_duplicate, 'drop_outlier': drop_outlier}
# Print the best parameters and score
print(f"Best parameters found in search: {best_params}")
Best parameters found in search: {'drop_label_issue': 0.5, 'drop_near_duplicate': 0.0, 'drop_outlier': 0.7}
[37]:
print(
f"Accuracy of model fit to optimally cleaned training data, measured on clean test data: {round(best_score*100,1)}%"
)
Accuracy of model fit to optimally cleaned training data, measured on clean test data: 82.1%
9. Conclusion#
This tutorial demonstrated how you can properly use cleanlab to improve your own ML model. When dealing with noisy data, you should first manually curate your test data to ensure reliable model evaluation. After that, you can algorithmically curate your training data. We demonstrated a simple hyperparameter tuning technique to identify effective training data edits that produce an accurate model. As well as how cleanlab can help catch fundamental problems in the overall train/test setup like duplicates/leakage and data drift.
Note that we never evaluated different models with different test set versions (which does not yield meaningful comparisons). We curated the test data to be as high-quality as possible and then based all model evaluations on this fixed version of the test data.
For brevity, this tutorial focused mostly around label issues and data pruning strategies. For classification tasks where you already have high-quality test data and solely want to handle label errors in your training data: cleanlab’s CleanLearning
class offers an alternative convenience method to train a robust ML model. You can achieve better results by considering additional data issues beyond label errors and curation strategies like fixing incorrect values – this is all
streamlined via the intelligent data correction interface of Cleanlab Studio.