Classification with Tabular Data using Scikit-Learn and Datalab#

In this 5-minute quickstart tutorial, we use cleanlab with scikit-learn models to find potential label errors in a classification dataset with tabular (numeric/categorical) features. Tabular (or structured) data are typically organized in a row/column format and stored in a SQL database or file types like: CSV, Excel, or Parquet. Here we consider a Student Grades dataset, which contains over 900 individuals who have three exam grades and some optional notes, each being assigned a letter grade (their class label). cleanlab automatically identifies hundreds of examples in this dataset that were mislabeled with the incorrect final grade selected. This tutorial will teach you how to use this package to detect incorrect information in your own tabular datasets.

Overview of what we’ll do in this tutorial:

  • Train a classifier model (here scikit-learn’s HistGradientBoostingClassifier, although any model could be used) and use this classifier to compute (out-of-sample) predicted class probabilities via cross-validation.

  • Create a K nearest neighbours (KNN) graph between the examples in the dataset.

  • Identify issues in the dataset with cleanlab’s Datalab audit applied to the predictions and KNN graph.

Quickstart

Already have (out-of-sample) pred_probs from a model trained on your original data labels? Have a knn_graph computed between dataset examples (reflecting similarity in their feature values)? Run the code below to find issues in your dataset.

from cleanlab import Datalab

lab = Datalab(data=your_dataset, label_name="column_name_of_labels")
lab.find_issues(pred_probs=your_pred_probs, knn_graph=knn_graph)

lab.get_issues()

1. Install required dependencies#

You can use pip to install all packages required for this tutorial as follows:

!pip install sklearn datasets
!pip install "cleanlab[datalab]"
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+https://github.com/cleanlab/cleanlab.git
[2]:
import random
import numpy as np
import pandas as pd

from sklearn.model_selection import cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.neighbors import NearestNeighbors

from cleanlab import Datalab

SEED = 100  # for reproducibility
np.random.seed(SEED)
random.seed(SEED)

2. Load and process the data#

We first load the data features and labels (which are possibly noisy).

[3]:
grades_data = pd.read_csv("https://s.cleanlab.ai/grades-tabular-demo-v2.csv")
grades_data.head()
[3]:
stud_ID exam_1 exam_2 exam_3 notes letter_grade
0 f48f73 53.00 77.00 9.00 3 C
1 0bd4e7 81.00 64.00 80.00 great participation +10 B
2 0bd4e7 81.00 64.00 80.00 great participation +10 B
3 cb9d7a 0.61 0.94 0.78 NaN C
4 9acca4 48.00 90.00 9.00 1 C
[4]:
X_raw = grades_data[["exam_1", "exam_2", "exam_3", "notes"]]
labels = grades_data["letter_grade"]

Next we preprocess the data. Here we apply one-hot encoding to columns with categorical values and standardize the values in numeric columns.

[5]:
cat_features = ["notes"]
X_encoded = pd.get_dummies(X_raw, columns=cat_features, drop_first=True)

numeric_features = ["exam_1", "exam_2", "exam_3"]
scaler = StandardScaler()
X_processed = X_encoded.copy()
X_processed[numeric_features] = scaler.fit_transform(X_encoded[numeric_features])

Bringing Your Own Data (BYOD)?

Assign your data’s features to variable X and its labels to variable labels instead.

3. Select a classification model and compute out-of-sample predicted probabilities#

Here we use a simple histogram-based gradient boosting model (similar to XGBoost), but you can choose any suitable scikit-learn model for this tutorial.

[6]:
clf = HistGradientBoostingClassifier()

To find potential labeling errors, cleanlab requires a probabilistic prediction from your model for every datapoint. However, these predictions will be overfitted (and thus unreliable) for examples the model was previously trained on. For the best results, cleanlab should be applied with out-of-sample predicted class probabilities, i.e., on examples held out from the model during the training.

K-fold cross-validation is a straightforward way to produce out-of-sample predicted probabilities for every datapoint in the dataset by training K copies of our model on different data subsets and using each copy to predict on the subset of data it did not see during training. We can implement this via the cross_val_predict method from scikit-learn:

[7]:
num_crossval_folds = 5
pred_probs = cross_val_predict(
    clf,
    X_processed,
    labels,
    cv=num_crossval_folds,
    method="predict_proba",
)

4. Construct K nearest neighbours graph#

The KNN graph reflects how close each example is when compared to other examples in our dataset (in the numerical space of preprocessed feature values). This similarity information is used by Datalab to identify issues like outliers in our data. For tabular data, think carefully about the most appropriate way to define the similarity between two examples.

Here we use the NearestNeighbors class in sklearn to easily compute this graph (with similarity defined by the Euclidean distance between feature values). The graph should be represented as a sparse matrix with nonzero entries indicating nearest neighbors of each example and their distance.

[8]:
KNN = NearestNeighbors(metric='euclidean')
KNN.fit(X_processed.values)

knn_graph = KNN.kneighbors_graph(mode="distance")

5. Use cleanlab to find label issues#

Based on the given labels, predicted probabilities, and KNN graph, cleanlab can quickly help us identify suspicious values in our grades table.

We use cleanlab’s Datalab class which has several ways of loading the data. In this case, we’ll simply wrap the dataset (features and noisy labels) in a dictionary that is used instantiate a Datalab object such that it can audit our dataset for various types of issues.

[9]:
data = {"X": X_processed.values, "y": labels}

lab = Datalab(data, label_name="y")
lab.find_issues(pred_probs=pred_probs, knn_graph=knn_graph)
Finding label issues ...
Finding outlier issues ...
Finding near_duplicate issues ...
Audit complete. 357 issues found in the dataset.
[10]:
lab.report()
Here is a summary of the different kinds of issues found in the data:

    issue_type  num_issues
         label         294
       outlier          46
near_duplicate          17

Dataset Information: num_examples: 941, num_classes: 5


----------------------- label issues -----------------------

About this issue:
        Examples whose given label is estimated to be potentially incorrect
    (e.g. due to annotation error) are flagged as having label issues.


Number of examples with this issue: 294
Overall dataset quality in terms of this issue: 0.6578

Examples representing most severe instances of this issue:
     is_label_issue  label_score given_label predicted_label
3              True     0.000005           C               F
886            True     0.000059           D               B
709            True     0.000104           F               C
723            True     0.000169           A               C
689            True     0.000181           B               D


---------------------- outlier issues ----------------------

About this issue:
        Examples that are very different from the rest of the dataset
    (i.e. potentially out-of-distribution or rare/anomalous instances).


Number of examples with this issue: 46
Overall dataset quality in terms of this issue: 0.7154

Examples representing most severe instances of this issue:
   is_outlier_issue  outlier_score
3              True       0.012085
7              True       0.061510
0              True       0.115512
4              True       0.124391
8              True       0.214163


------------------ near_duplicate issues -------------------

About this issue:
        A (near) duplicate issue refers to two or more examples in
    a dataset that are extremely similar to each other, relative
    to the rest of the dataset.  The examples flagged with this issue
    may be exactly duplicated, or lie atypically close together when
    represented as vectors (i.e. feature embeddings).


Number of examples with this issue: 17
Overall dataset quality in terms of this issue: 0.2169

Examples representing most severe instances of this issue:
     is_near_duplicate_issue  near_duplicate_score   near_duplicate_sets  distance_to_nearest_neighbor
690                     True                   0.0                 [246]                           0.0
185                     True                   0.0                 [582]                           0.0
691                     True                   0.0  [294, 251, 820, 845]                           0.0
168                     True                   0.0                 [915]                           0.0
187                     True                   0.0        [27, 924, 704]                           0.0

Label issues#

The above report shows that cleanlab identified many label issues in the data. We can see which examples are estimated to be mislabeled (as well as a numeric quality score quantifying how likely their label is correct) via the get_issues method.

[11]:
issue_results = lab.get_issues("label")
issue_results.head()
[11]:
is_label_issue label_score given_label predicted_label
0 True 0.000842 C F
1 False 0.555944 B B
2 False 0.555944 B B
3 True 0.000005 C F
4 True 0.004374 C D

To review the most severe label issues, sort the DataFrame above by the label_score column (a lower score represents that the label is less likely to be correct).

Let’s review some of the most likely label errors:

[12]:
sorted_issues = issue_results.sort_values("label_score").index

X_raw.iloc[sorted_issues].assign(
    given_label=labels.iloc[sorted_issues],
    predicted_label=issue_results["predicted_label"].iloc[sorted_issues]
).head()
[12]:
exam_1 exam_2 exam_3 notes given_label predicted_label
3 0.61 0.94 0.78 NaN C F
886 89.00 95.00 73.00 NaN D B
709 64.00 70.00 86.00 NaN F C
723 53.00 89.00 78.00 NaN A C
689 77.00 51.00 70.00 NaN B D

The dataframe above shows the original label (given_label) for examples that cleanlab finds most likely to be mislabeled, as well as an alternative predicted_label for each example.

These examples have been labeled incorrectly and should be carefully re-examined - a student with grades of 89, 95 and 73 surely does not deserve a D!

Outlier issues#

According to the report, our dataset contains some outliers. We can see which examples are outliers (and a numeric quality score quantifying how typical each example appears to be) via get_issues. We sort the resulting DataFrame by cleanlab’s outlier quality score to see the most severe outliers in our dataset.

[13]:
outlier_results = lab.get_issues("outlier")
sorted_outliers= outlier_results.sort_values("outlier_score").index

X_raw.iloc[sorted_outliers].head()
[13]:
exam_1 exam_2 exam_3 notes
3 0.61 0.94 0.78 NaN
7 100.00 100.00 1.00 NaN
0 53.00 77.00 9.00 3
4 48.00 90.00 9.00 1
8 0.00 56.00 96.00 <p style="font-size: 18px; color: #ff00ff; bac...

The student at index 3 has fractional exam scores, which is likely a error. We also see that the students at index 0 and 4 have numerical values in their notes section, which is also probably unintended. Lastly, we see that the student at index 8 has a html string in their notes section, definitely a mistake!

Near-duplicate issues#

According to the report, our dataset contains some sets of nearly duplicated examples. We can see which examples are (nearly) duplicated (and a numeric quality score quantifying how dissimilar each example is from its nearest neighbor in the dataset) via get_issues. We sort the resulting DataFrame by cleanlab’s near-duplicate quality score to see the examples in our dataset that are most nearly duplicated.

[14]:
duplicate_results = lab.get_issues("near_duplicate")
duplicate_results.sort_values("near_duplicate_score").head()
[14]:
is_near_duplicate_issue near_duplicate_score near_duplicate_sets distance_to_nearest_neighbor
690 True 0.0 [246] 0.0
185 True 0.0 [582] 0.0
691 True 0.0 [294, 251, 820, 845] 0.0
168 True 0.0 [915] 0.0
187 True 0.0 [27, 924, 704] 0.0
[15]:
duplicate_results.iloc[[691, 294, 251, 820, 845]]
[15]:
is_near_duplicate_issue near_duplicate_score near_duplicate_sets distance_to_nearest_neighbor
691 True 0.000000 [294, 251, 820, 845] 0.000000
294 True 0.000000 [691, 251, 820, 845] 0.000000
251 False 0.042610 [820, 294, 691, 845] 0.042636
820 False 0.042610 [251, 294, 691, 845] 0.042636
845 False 0.083468 [820, 251, 294, 691, 800] 0.083663

The results above show which examples cleanlab considers nearly duplicated (rows where is_near_duplicate_issue == True). Here, we see many examples that cleanlab has flagged as being nearly duplicated. Let’s view these examples to see how similar they are, starting with the top one. We compare this example (student 690) against the example cleanlab has identified in the near_duplicate_sets (student 246).

[16]:
X_raw.iloc[[690, 246]]
[16]:
exam_1 exam_2 exam_3 notes
690 78.0 58.0 86.0 great final presentation +10
246 78.0 58.0 86.0 great final presentation +10

These examples are exact duplicates! Perhaps the same information was accidentally recorded twice in this data.

For example 691, cleanlab identified four examples (294, 251, 820 and 845) that are near duplicates of it, let’s check them out:

[17]:
X_raw.iloc[[691, 294, 251, 820, 845]]
[17]:
exam_1 exam_2 exam_3 notes
691 95.0 94.0 89.0 NaN
294 95.0 94.0 89.0 NaN
251 96.0 94.0 89.0 NaN
820 96.0 95.0 89.0 NaN
845 96.0 96.0 88.0 NaN

These students are indeed very similar to one another! Including near/exact duplicates in a dataset may have unintended effects on models; be wary about splitting them across training/test sets.

This tutorial highlighted a straightforward approach to detect potentially incorrect information in any tabular dataset. Just use Datalab with any ML model – the better the model, the more accurate the data errors detected by Datalab will be!