Classification with Tabular Data using Scikit-Learn and Cleanlab#
In this 5-minute quickstart tutorial, we use cleanlab with scikit-learn models to find potential label errors in a classification dataset with tabular (numeric/categorical) features. Tabular (or structured) data are typically organized in a row/column format and stored in a SQL database or file types like: CSV, Excel, or Parquet. Here we study the German Credit dataset which contains 1,000 individuals described by 20 features, each labeled as either “good” or “bad” credit risk. cleanlab automatically shortlists hundreds of examples from this dataset that confuse our ML model; many of which are potential label errors (due to annotator mistakes), edge cases, and otherwise ambiguous examples.
Overview of what we’ll do in this tutorial:
Build a simple credit risk classifier with scikit-learn’s HistGradientBoostingClassifier.
Use this classifier to compute out-of-sample predicted probabilities,
pred_probs
, via cross validation.Identify potential label errors in the data with cleanlab’s
find_label_issues
method.Train a robust version of the same histogram-based gradient boosting model via cleanlab’s
CleanLearning
wrapper.
Quickstart
Already have an sklearn compatible model
, tabular data
and given labels
? Run the code below to train your model
and get label issues.
from cleanlab.classification import CleanLearning
cl = CleanLearning(model)
_ = cl.fit(train_data, labels)
label_issues = cl.get_label_issues()
preds = cl.predict(test_data) # predictions from a version of your model
# trained on auto-cleaned data
Is your model/data not compatible with CleanLearning
? You can instead run cross-validation on your model to get out-of-sample pred_probs
. Then run the code below to get label issue indices ranked by their inferred severity.
from cleanlab.filter import find_label_issues
ranked_label_issues = find_label_issues(
labels,
pred_probs,
return_indices_ranked_by="self_confidence",
)
1. Install required dependencies#
You can use pip
to install all packages required for this tutorial as follows:
!pip install sklearn
!pip install cleanlab
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
# !pip install git+https://github.com/cleanlab/cleanlab.git
[2]:
import random
import numpy as np
SEED = 100
np.random.seed(SEED)
random.seed(SEED)
2. Load and process the data#
We first load the data features and labels (which are possibly noisy).
[3]:
from sklearn.datasets import fetch_openml
data = fetch_openml("credit-g", version=1) # get the credit data from OpenML
X_raw = data.data # features (pandas DataFrame)
labels_raw = data.target # labels (pandas Series)
/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/sklearn/datasets/_openml.py:968: FutureWarning: The default value of `parser` will change from `'liac-arff'` to `'auto'` in 1.4. You can set `parser='auto'` to silence this warning. Therefore, an `ImportError` will be raised from 1.4 if the dataset is dense and pandas is not installed. Note that the pandas parser may return different data types. See the Notes Section in fetch_openml's API doc for details.
warn(
Next we preprocess the data. Here we apply one-hot encoding to features with categorical data, and standardize features with numeric data. We also perform label encoding on the labels — “bad” is encoded as 0 and “good” is encoded as 1.
[4]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
cat_features = X_raw.select_dtypes("category").columns
X_encoded = pd.get_dummies(X_raw, columns=cat_features, drop_first=True)
num_features = X_raw.select_dtypes("float64").columns
scaler = StandardScaler()
X_scaled = X_encoded.copy()
X_scaled[num_features] = scaler.fit_transform(X_encoded[num_features])
labels = labels_raw.map({"bad": 0, "good": 1}) # encode labels as integers
Bringing Your Own Data (BYOD)?
You can easily replace the above with your own tabular dataset, and continue with the rest of the tutorial.
labels
) should be represented as integer indices 0, 1, …, num_classes - 1.labels
might look like: np.array([2,0,0,1,2,0,1])
3. Select a classification model and compute out-of-sample predicted probabilities#
Here we use a simple histogram-based gradient boosting model (similar to XGBoost), but you can choose any suitable scikit-learn model for this tutorial.
[5]:
from sklearn.ensemble import HistGradientBoostingClassifier
clf = HistGradientBoostingClassifier()
To find potential labeling errors, cleanlab requires a probabilistic prediction from your model for every datapoint. However, these predictions will be overfitted (and thus unreliable) for examples the model was previously trained on. cleanlab is intended to only be used with out-of-sample predicted probabilities, i.e., on examples held out from the model during the training.
K-fold cross-validation is a straightforward way to produce out-of-sample predicted probabilities for every datapoint in the dataset by training K copies of our model on different data subsets and using each copy to predict on the subset of data it did not see during training. An additional benefit of cross-validation is that it provides a more reliable evaluation of our model than a single training/validation split. We can obtain cross-validated out-of-sample predicted probabilities from any classifier via a simple scikit-learn wrapper:
[6]:
from sklearn.model_selection import cross_val_predict
num_crossval_folds = 3 # for efficiency; values like 5 or 10 will generally work better
pred_probs = cross_val_predict(
clf,
X_scaled,
labels,
cv=num_crossval_folds,
method="predict_proba",
)
4. Use cleanlab to find label issues#
Based on the given labels and out-of-sample predicted probabilities, cleanlab can quickly help us identify poorly labeled instances in our data table. For a dataset with N examples from K classes, the labels should be a 1D array of length N and predicted probabilities should be a 2D (N x K) array. Here we request that the indices of the identified label issues be sorted by cleanlab’s self-confidence score, which measures the quality of each given label via the probability assigned to it in our model’s prediction.
[7]:
from cleanlab.filter import find_label_issues
ranked_label_issues = find_label_issues(
labels=labels, pred_probs=pred_probs, return_indices_ranked_by="self_confidence"
)
print(f"Cleanlab found {len(ranked_label_issues)} potential label errors.")
Cleanlab found 195 potential label errors.
Let’s review some of the most likely label errors:
[8]:
X_raw.iloc[ranked_label_issues].assign(label=labels_raw.iloc[ranked_label_issues]).head()
[8]:
checking_status | duration | credit_history | purpose | credit_amount | savings_status | employment | installment_commitment | personal_status | other_parties | ... | property_magnitude | age | other_payment_plans | housing | existing_credits | job | num_dependents | own_telephone | foreign_worker | label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
949 | no checking | 24.0 | existing paid | radio/tv | 3621.0 | 100<=X<500 | >=7 | 2.0 | male single | none | ... | car | 31.0 | none | own | 2.0 | skilled | 1.0 | none | yes | bad |
175 | no checking | 30.0 | all paid | used car | 7485.0 | no known savings | unemployed | 4.0 | female div/dep/mar | none | ... | real estate | 53.0 | bank | own | 1.0 | high qualif/self emp/mgmt | 1.0 | yes | yes | bad |
647 | no checking | 12.0 | existing paid | new car | 1386.0 | 500<=X<1000 | 1<=X<4 | 2.0 | female div/dep/mar | none | ... | life insurance | 26.0 | none | own | 1.0 | skilled | 1.0 | none | yes | bad |
278 | no checking | 6.0 | existing paid | furniture/equipment | 4611.0 | <100 | <1 | 1.0 | female div/dep/mar | none | ... | life insurance | 32.0 | none | own | 1.0 | skilled | 1.0 | none | yes | bad |
424 | 0<=X<200 | 12.0 | existing paid | furniture/equipment | 2762.0 | no known savings | >=7 | 1.0 | female div/dep/mar | none | ... | life insurance | 25.0 | bank | own | 1.0 | skilled | 1.0 | yes | yes | bad |
5 rows × 21 columns
These examples appear the most suspicious to our model and should be carefully re-examined. Perhaps the original annotators missed something when deciding on the labels for these individuals. This is a straightforward approach to visualize the rows in a data table that might be mislabeled.
5. Train a more robust model from noisy labels#
Following proper ML practice, let’s split our data into train and test sets.
[9]:
from sklearn.model_selection import train_test_split
X_train, X_test, labels_train, labels_test = train_test_split(
X_encoded,
labels,
test_size=0.25,
random_state=SEED,
)
We again standardize the numeric features, this time fitting the scaling parameters solely on the training set.
[10]:
scaler = StandardScaler()
X_train[num_features] = scaler.fit_transform(X_train[num_features])
X_test[num_features] = scaler.transform(X_test[num_features])
X_train = X_train.to_numpy()
labels_train = labels_train.to_numpy()
X_test = X_test.to_numpy()
labels_test = labels_test.to_numpy()
Let’s now train and evaluate the original gradient boosting model.
[11]:
from sklearn.metrics import accuracy_score
clf.fit(X_train, labels_train)
acc_og = clf.score(X_test, labels_test)
print(f"Test accuracy of original model: {acc_og}")
Test accuracy of original model: 0.748
cleanlab provides a wrapper class that can be easily applied to any scikit-learn compatible model. Once wrapped, the resulting model can still be used in the exact same manner, but it will now train more robustly if the data have noisy labels.
[12]:
from cleanlab.classification import CleanLearning
clf = HistGradientBoostingClassifier() # Note we first re-initialize clf
cl = CleanLearning(clf) # cl has same methods/attributes as clf
The following operations take place when we train the cleanlab-wrapped model: The original model is trained in a cross-validated fashion to produce out-of-sample predicted probabilities. Then, these predicted probabilities are used to identify label issues, which are then removed from the dataset. Finally, the original model is trained on the remaining clean subset of the data once more.
[13]:
_ = cl.fit(X_train, labels_train)
We can get predictions from the resulting model and evaluate them, just like how we did it for the original scikit-learn model.
[14]:
preds = cl.predict(X_test)
acc_cl = accuracy_score(labels_test, preds)
print(f"Test accuracy of cleanlab-trained model: {acc_cl}")
Test accuracy of cleanlab-trained model: 0.752
We can see that the test set accuracy slightly improved as a result of the data cleaning. Note that this will not always be the case, especially when we evaluate on test data that are themselves noisy. The best practice is to run cleanlab to identify potential label issues and then manually review them, before blindly trusting any accuracy metrics. In particular, the most effort should be made to ensure high-quality test data, which is supposed to reflect the expected performance of our model during deployment.