Find Noisy Labels in Regression Datasets#
This 5-minute quickstart tutorial uses cleanlab to find potentially incorrect numeric values in a dataset column by means of a regression model. Unlike classification models, regression predicts numeric quantities such as price, income, age,… Response values in regression datasets may be corrupted due to: data entry or measurement errors, noise from sensors or other processes, or broken data pipelines. To find corrupted values in a numeric column, we treat it as the target value, i.e. label, to be predicted by a regression model and then use cleanlab to decide when the model predictions are trustworthy while deviating from the observed label value.
In this tutorial, we consider a student grades dataset, which records three exam grades and some optional notes for over 900 students, each being assigned a final score. Combined with any regression model of your choosing, cleanlab automatically identifies examples in this dataset that have incorrect final scores.
Overview of what we’ll do in this tutorial:
Fit a simple Gradient Boosting model (any other model could be used) on the exam-score and notes (covariates) in order to compute out-of-sample predictions of the final grade (the response variable in our regression).
Use cleanlab’s
CleanLearning.find_label_issues()
method to identify potentially incorrect final grade values based on outputs from this regression model.Train a more robust version of the same model after dropping the identified label errors using CleanLearning.
Run an alternative workflow to detect errors via cleanlab’s
Datalab
audit, which can simultaneously estimate many other types of data issues.
Quickstart
Already have an sklearn-compatible regression model
, features/covariates X
, and a label/target variable y
? Run the code below to train your model
and identify potentially incorrect y
values in your dataset.
from cleanlab.regression.learn import CleanLearning
cl = CleanLearning(model)
cl.fit(X, y)
label_issues = cl.get_label_issues()
preds = cl.predict(X_test) # predictions from a version of your model trained on auto-cleaned data
Is your model/data not compatible with CleanLearning
? You can instead run cross-validation on your model to get out-of-sample predictions
. With that, run the code below to find data and label issues in your regression dataset:
from cleanlab import Datalab
# Assuming your dataset has a label column named 'label'
lab = Datalab(dataset, label_name='label', task='regression')
# To detect more data issue types, optionally supply `features` (numeric dataset values or model embeddings of the data)
lab.find_issues(pred_probs=predictions, features=features)
lab.report()
1. Install required dependencies#
You can use pip
to install all packages required for this tutorial as follows:
!pip install scikit-learn matplotlib>=3.6.0
!pip install cleanlab[datalab]
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
# !pip install git+https://github.com/cleanlab/cleanlab.git
[2]:
import pandas as pd
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import r2_score
from cleanlab.regression.learn import CleanLearning
2. Load and process the data#
[4]:
train_data = pd.read_csv("https://s.cleanlab.ai/student_grades_r/train.csv")
test_data = pd.read_csv("https://s.cleanlab.ai/student_grades_r/test.csv")
train_data.head()
[4]:
exam_1 | exam_2 | exam_3 | notes | final_score | true_final_score | |
---|---|---|---|---|---|---|
0 | 72 | 81 | 80 | NaN | 73.3 | 73.3 |
1 | 89 | 62 | 93 | NaN | 83.8 | 83.8 |
2 | 97 | 0 | 94 | NaN | 73.5 | 73.5 |
3 | 80 | 76 | 96 | missed class frequently -10 | 78.6 | 78.6 |
4 | 67 | 87 | 95 | missed homework frequently -10 | 74.1 | 74.1 |
In the DataFrame above, final_score
represents the noisy scores and true_final_score
represents the ground truth. Note that ground truth is usually not available in real-world datasets, and is just added in this tutorial dataset for demonstration purposes.
We show a 3D scatter plot of the exam grades, with the color hue corresponding to the final score for each student. Incorrect datapoints are marked with an X.
See the code to visualize the data. (click to expand)
# Note: This pulldown content is for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
def plot_data(train_data, errors_idx):
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
x, y, z = train_data["exam_1"], train_data["exam_2"], train_data["exam_3"]
labels = train_data["final_score"]
img = ax.scatter(x, y, z, c=labels, cmap="jet")
fig.colorbar(img)
ax.plot(
x.iloc[errors_idx],
y.iloc[errors_idx],
z.iloc[errors_idx],
"x",
markeredgecolor="black",
markersize=10,
markeredgewidth=2.5,
alpha=0.8,
label="Label Errors"
)
ax.legend()
[6]:
errors_mask = train_data["final_score"] != train_data["true_final_score"]
errors_idx = np.where(errors_mask == 1)
plot_data(train_data, errors_idx)
Next we preprocess the data by applying one-hot encoding to features with categorical data (this is optional if your regression model can work directly with categorical features).
[7]:
feature_columns = ["exam_1", "exam_2", "exam_3", "notes"]
predicted_column = "final_score"
X_train_raw, y_train = train_data[feature_columns], train_data[predicted_column]
X_test_raw, y_test = test_data[feature_columns], test_data[predicted_column]
[8]:
categorical_features = ["notes"]
X_train = pd.get_dummies(X_train_raw, columns=categorical_features)
X_test = pd.get_dummies(X_test_raw, columns=categorical_features)
Bringing Your Own Data (BYOD)?
Assign your data’s features to variable X
and the target values to variable y
instead, then continue with the rest of the tutorial.
3. Define a regression model and use cleanlab to find potential label errors#
We’ll first demonstrate regression with noisy labels via the CleanLearning
class that can wrap any scikit-learn compatible regression model you have. CleanLearning
uses your model to estimate label issues (i.e. noisy y
-values) and train a more robust version of the same model when the original data contains noisy labels.
Here we define a CleanLearning
object with a histogram-based gradient boosting model (sklearn version of XGBoost) and use the find_label_issues
method to find potential errors in our dataset’s numeric label column. Any other sklearn-compatible regression model could be used, such as LinearRegression
or RandomForestRegressor
(or you can easily wrap arbitrary custom models to be compatible with the sklearn API).
[9]:
model = HistGradientBoostingRegressor()
cl = CleanLearning(model)
[10]:
label_issues = cl.find_label_issues(X_train, y_train)
CleanLearning
internally fits multiple copies of our regression model via cross-validation and bootstrapping in order to compute predictions and uncertainty estimates for the dataset. These are used to identify label issues (i.e. likely corrupted y
-values).
This method returns a Dataframe containing a label quality score (between 0 and 1) for each example in your dataset. Lower scores indicate examples more likely to be mislabeled with an erroneous y
value. The Dataframe also contains a boolean column specifying whether or not each example is identified to have a label issue (indicating its y
-value appears potentially corrupted).
[11]:
label_issues.head()
[11]:
is_label_issue | label_quality | given_label | predicted_label | |
---|---|---|---|---|
0 | False | 0.385101 | 73.3 | 76.499503 |
1 | False | 0.698255 | 83.8 | 82.776647 |
2 | True | 0.109373 | 73.5 | 63.170547 |
3 | False | 0.481096 | 78.6 | 75.984759 |
4 | False | 0.645270 | 74.1 | 75.795928 |
We can get the subset of examples flagged with label issues, and also sort by label quality score to find the indices of the 10 most likely mislabeled examples in our regression dataset.
[12]:
identified_issues = label_issues[label_issues["is_label_issue"] == True]
lowest_quality_labels = label_issues["label_quality"].argsort()[:10].to_numpy()
[13]:
print(
f"cleanlab found {len(identified_issues)} potential label errors in the dataset.\n"
f"Here are indices of the top 10 most likely errors: \n {lowest_quality_labels}"
)
cleanlab found 141 potential label errors in the dataset.
Here are indices of the top 10 most likely errors:
[659 367 56 318 305 560 657 688 117 160]
Let’s review some of the values most likely to be erroneous. To help us inspect these datapoints, we define a method to print any example from the dataset, together with its given (original) label and the suggested alternative label predicted by your regression model.
[14]:
def view_datapoint(index):
given_labels = label_issues["given_label"]
predicted_labels = label_issues["predicted_label"].round(1)
return pd.concat(
[X_train_raw, given_labels, predicted_labels], axis=1
).iloc[index]
[15]:
view_datapoint(lowest_quality_labels[:5])
[15]:
exam_1 | exam_2 | exam_3 | notes | given_label | predicted_label | |
---|---|---|---|---|---|---|
659 | 67 | 93 | 93 | NaN | 17.4 | 84.1 |
367 | 78 | 0 | 86 | NaN | 0.0 | 56.7 |
56 | 75 | 83 | 69 | NaN | 8.9 | 71.7 |
318 | 41 | 88 | 98 | missed class frequently -10 | 0.0 | 71.9 |
305 | 97 | 0 | 90 | NaN | 19.1 | 61.6 |
These are very clear errors that cleanlab has identified in this data! Note that the given_label
does not correctly reflect the final grade that these student should be getting.
cleanlab has shortlisted the most likely label errors to speed up your data cleaning process. With this list, you can decide whether to fix these label issues or remove erroneous examples from the dataset.
4. Train a more robust model from noisy labels#
Fixing the label issues manually may be time-consuming, but cleanlab can filter these noisy examples and train a model on the remaining clean data for you automatically.
To establish a baseline, let’s first train and evaluate our original Gradient Boosting model.
[17]:
baseline_model = HistGradientBoostingRegressor()
baseline_model.fit(X_train, y_train)
preds_og = baseline_model.predict(X_test)
r2_og = r2_score(y_test, preds_og)
print(f"r-squared score of original model: {r2_og:.3f}")
r-squared score of original model: 0.838
Now that we have a baseline, let’s check if using CleanLearning
improves our test accuracy.
CleanLearning
provides a wrapper that can be applied to any scikit-learn compatible model. The resulting model object can be used in the same manner, but it will now train more robustly if the data has noisy labels.
We can use the same CleanLearning
object defined above, and pass the label issues we already computed into .fit()
via the label_issues
argument. This accelerates things; if we did not provide the label issues, then they would be re-estimated via cross-validation. After the issues are estimated, CleanLearning
simply removes the examples with label issues and retrains your model on the remaining clean data.
[18]:
found_label_issues = cl.get_label_issues()
cl.fit(X_train, y_train, label_issues=found_label_issues)
preds_cl = cl.predict(X_test)
r2_cl = r2_score(y_test, preds_cl)
print(f"r-squared score of cleanlab's model: {r2_cl:.3f}")
r-squared score of cleanlab's model: 0.926
We can see that the coefficient of determination (r-squared score) of the test set improved as a result of the data cleaning. Note that this will not always be the case, especially when we are evaluating on test data that are themselves noisy. The best practice is to run cleanlab to identify potential label issues and then manually review them, before blindly trusting any evaluation metrics. In particular, the most effort should be made to ensure high-quality test data, which is supposed to reflect the expected performance of our model during deployment.
5. Other ways to find noisy labels in regression datasets#
The CleanLearning
workflow above requires a sklearn-compatible model. If your model or data format is not compatible with the requirements for using CleanLearning
, you can instead run cross-validation on your regression model to get out-of-sample predictions, and then use the Datalab
audit to estimate label quality scores for each example in your dataset.
This approach requires two inputs:
labels
: numpy array of given labels in the dataset.predictions
: numpy array of predictions for each example in the dataset from your favorite model (these should be out-of-sample predictions to get the best results).
[19]:
# Get out-of-sample predictions using cross-validation:
model = HistGradientBoostingRegressor()
predictions = cross_val_predict(estimator=model, X=X_train, y=y_train)
[20]:
from cleanlab import Datalab
lab = Datalab(
data=train_data.drop(columns=["true_final_score"]),
label_name="final_score",
task="regression",
)
lab.find_issues(
pred_probs=predictions,
issue_types={"label": {}}, # specify we're only interested in label issues here
)
Finding label issues ...
Audit complete. 50 issues found in the dataset.
[21]:
label_issues = lab.get_issues("label")
label_issues.sort_values("label_score").head()
[21]:
is_label_issue | label_score | given_label | predicted_label | |
---|---|---|---|---|
318 | True | 1.968627e-09 | 0.0 | 78.228799 |
659 | True | 2.646674e-08 | 17.4 | 86.402962 |
56 | True | 4.323818e-08 | 8.9 | 75.952758 |
160 | True | 2.422144e-07 | 0.0 | 60.456908 |
367 | True | 8.465815e-07 | 0.0 | 55.753968 |
As before, these label quality scores are continuous values in the range [0,1] where 1 represents a clean label (given label appears correct) and 0 a represents dirty label (given label appears corrupted, i.e. the numeric value may be incorrect). You can sort examples by their label quality scores to inspect the most-likely corrupted datapoints.
If possible, we recommend you use CleanLearning
to wrap your regression model (over providing its pre-computed predictions) for the most accurate label error detection (that properly accounts for aleatoric/epistemic uncertainty in the regression model). To understand how these approaches work, refer to our paper: Detecting Errors in Numerical Data via any Regression Model
You can alternatively provide features
to Datalab
instead of pre-computed predictions. These are (preprocessed) numeric dataset covariates, aka independent variables to the regression model (such as neural network embeddings of your raw data). Internally, this is equivalent to using CleanLearning
to find label issues if you also possible provide your sklearn-compatible regression model to Datalab.find_issues
. But you can simultaneously detect many more types of issues in your
dataset beyond mislabeling via Datalab (simply drop the issue_types
argument below).
[23]:
lab = Datalab(
data=train_data.drop(columns=["true_final_score"]),
label_name="final_score",
task="regression",
)
lab.find_issues(
features=X_train,
issue_types={ # Optional drop this to simultaneously detect many types of data/label issues
"label": {
# Optional: Specify which type of sklearn-compatible regression model is used to find label errors
"clean_learning_kwargs": {"model": HistGradientBoostingRegressor()}
}
},
)
Finding label issues ...
Audit complete. 141 issues found in the dataset.
[24]:
label_issues = lab.get_issues("label")
label_issues.sort_values("label_score").head()
[24]:
is_label_issue | label_score | given_label | predicted_label | |
---|---|---|---|---|
659 | True | 5.791186e-12 | 17.4 | 84.110719 |
367 | True | 6.485156e-10 | 0.0 | 56.670640 |
56 | True | 1.225300e-09 | 8.9 | 71.749976 |
318 | True | 1.499679e-09 | 0.0 | 71.947007 |
305 | True | 4.067882e-08 | 19.1 | 61.648396 |
While this tutorial focused on label issues, cleanlab’s Datalab
object can automatically detect many other types of issues in your dataset (outliers, near duplicates, etc). Simply remove the issue_types
argument from the above call to Datalab.find_issues()
above and Datalab
will more comprehensively audit your dataset (a default regression model will be used if you don’t specify the model type). Refer to our Datalab quickstart tutorial to
learn how to interpret the results (the interpretation remains mostly the same across different types of ML tasks).
Summary: To detect many types of issues in your regression dataset, we recommend using Datalab
with provided features
plus the best regression model you know for your data. If your goal is to train a robust regression model with noisy data rather than detect data/label issues, then use CleanLearning
. Alternatively, if you don’t have a sklearn-compatible regression model or already have pre-computed predictions from the model you’d like to rely on, you can pass these predictions
into Datalab
directly to find issues based on them instead of providing a regression model.