Detecting Issues in Tabular Data (Numeric/Categorical columns) with Datalab#

In this 5-minute quickstart tutorial, we use Datalab to detect various issues in a classification dataset with tabular (numeric/categorical) features. Tabular (or structured) data are typically organized in a row/column format and stored in a SQL database or file types like: CSV, Excel, or Parquet. Here we consider a Student Grades dataset, which contains over 900 individuals who have three exam grades and some optional notes, each being assigned a letter grade (their class label). cleanlab automatically identifies hundreds of examples in this dataset that were mislabeled with the incorrect final grade selected. You can run the same code from this tutorial to detect incorrect information in your own tabular classification datasets.

Overview of what we’ll do in this tutorial:

Train a classifier model (here scikit-learn’s HistGradientBoostingClassifier, although any model could be used) and use this classifier to compute (out-of-sample) predicted class probabilities via cross-validation.
Create a K nearest neighbours (KNN) graph between the examples in the dataset.
Identify issues in the dataset with cleanlab’s Datalab audit applied to the predictions and KNN graph.

Quickstart

Already have (out-of-sample) pred_probs from a model trained on your original data labels? Have a knn_graph computed between dataset examples (reflecting similarity in their feature values)? Run the code below to find issues in your dataset.

from cleanlab import Datalab

lab = Datalab(data=your_dataset, label_name="column_name_of_labels")
lab.find_issues(pred_probs=your_pred_probs, knn_graph=knn_graph)

lab.get_issues()

1. Install required dependencies#

You can use pip to install all packages required for this tutorial as follows:

!pip install sklearn datasets
!pip install "cleanlab[datalab]"
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+https://github.com/cleanlab/cleanlab.git

[2]:

import random
import numpy as np
import pandas as pd

from sklearn.model_selection import cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.neighbors import NearestNeighbors

from cleanlab import Datalab

SEED = 100  # for reproducibility
np.random.seed(SEED)
random.seed(SEED)

2. Load and process the data#

We first load the data features and labels (which are possibly noisy).

[3]:

grades_data = pd.read_csv("https://s.cleanlab.ai/grades-tabular-demo-v2.csv")
grades_data.head()

[3]:

	stud_ID	exam_1	exam_2	exam_3	notes	letter_grade
0	f48f73	53.00	77.00	9.00	3	C
1	0bd4e7	81.00	64.00	80.00	great participation +10	B
2	0bd4e7	81.00	64.00	80.00	great participation +10	B
3	cb9d7a	0.61	0.94	0.78	NaN	C
4	9acca4	48.00	90.00	9.00	1	C

[4]:

X_raw = grades_data[["exam_1", "exam_2", "exam_3", "notes"]]
labels = grades_data["letter_grade"]

Next we preprocess the data. Here we apply one-hot encoding to columns with categorical values and standardize the values in numeric columns.

[5]:

cat_features = ["notes"]
X_encoded = pd.get_dummies(X_raw, columns=cat_features, drop_first=True)

numeric_features = ["exam_1", "exam_2", "exam_3"]
scaler = StandardScaler()
X_processed = X_encoded.copy()
X_processed[numeric_features] = scaler.fit_transform(X_encoded[numeric_features])

Bringing Your Own Data (BYOD)?

Assign your data’s features to variable X and its labels to variable labels instead.

3. Select a classification model and compute out-of-sample predicted probabilities#

Here we use a simple histogram-based gradient boosting model (similar to XGBoost), but you can choose any suitable scikit-learn model for this tutorial.

[6]:

clf = HistGradientBoostingClassifier()

To find potential labeling errors, cleanlab requires a probabilistic prediction from your model for every datapoint. However, these predictions will be overfitted (and thus unreliable) for examples the model was previously trained on. For the best results, cleanlab should be applied with out-of-sample predicted class probabilities, i.e., on examples held out from the model during the training.

K-fold cross-validation is a straightforward way to produce out-of-sample predicted probabilities for every datapoint in the dataset by training K copies of our model on different data subsets and using each copy to predict on the subset of data it did not see during training. Make sure that the columns of your pred_probs are properly ordered with respect to the ordering of classes, which for Datalab is: lexicographically sorted by class name. We can implement this via the cross_val_predict method from scikit-learn.

[7]:

num_crossval_folds = 5
pred_probs = cross_val_predict(
    clf,
    X_processed,
    labels,
    cv=num_crossval_folds,
    method="predict_proba",
)

4. Construct K nearest neighbours graph#

The KNN graph reflects how close each example is when compared to other examples in our dataset (in the numerical space of preprocessed feature values). This similarity information is used by Datalab to identify issues like outliers in our data. For tabular data, think carefully about the most appropriate way to define the similarity between two examples.

Here we use the NearestNeighbors class in sklearn to easily compute this graph (with similarity defined by the Euclidean distance between feature values). The graph should be represented as a sparse matrix with nonzero entries indicating nearest neighbors of each example and their distance.

[8]:

KNN = NearestNeighbors(metric='euclidean')
KNN.fit(X_processed.values)

knn_graph = KNN.kneighbors_graph(mode="distance")

5. Use cleanlab to find label issues#

Based on the given labels, predicted probabilities, and KNN graph, cleanlab can quickly help us identify suspicious values in our grades table.

We use cleanlab’s Datalab class which has several ways of loading the data. In this case, we’ll simply wrap the dataset (features and noisy labels) in a dictionary that is used instantiate a Datalab object such that it can audit our dataset for various types of issues.

[9]:

data = {"X": X_processed.values, "y": labels}

lab = Datalab(data, label_name="y")
lab.find_issues(pred_probs=pred_probs, knn_graph=knn_graph)

Finding label issues ...
Finding outlier issues ...
Finding near_duplicate issues ...
Finding non_iid issues ...
Finding class_imbalance issues ...
Finding underperforming_group issues ...

Audit complete. 358 issues found in the dataset.

[10]:

lab.report()

Here is a summary of the different kinds of issues found in the data:

    issue_type  num_issues
         label         294
       outlier          46
near_duplicate          17
       non_iid           1

Dataset Information: num_examples: 941, num_classes: 5


----------------------- label issues -----------------------

About this issue:
        Examples whose given label is estimated to be potentially incorrect
    (e.g. due to annotation error) are flagged as having label issues.


Number of examples with this issue: 294
Overall dataset quality in terms of this issue: 0.7109

Examples representing most severe instances of this issue:
     is_label_issue  label_score given_label predicted_label
3              True     0.000005           C               F
886            True     0.000059           D               B
709            True     0.000104           F               C
723            True     0.000169           A               C
689            True     0.000181           B               D


---------------------- outlier issues ----------------------

About this issue:
        Examples that are very different from the rest of the dataset
    (i.e. potentially out-of-distribution or rare/anomalous instances).


Number of examples with this issue: 46
Overall dataset quality in terms of this issue: 0.3590

Examples representing most severe instances of this issue:
   is_outlier_issue  outlier_score
3              True   3.051882e-07
7              True   7.683133e-05
0              True   6.536582e-04
4              True   8.406589e-04
8              True   5.324246e-03


------------------ near_duplicate issues -------------------

About this issue:
        A (near) duplicate issue refers to two or more examples in
    a dataset that are extremely similar to each other, relative
    to the rest of the dataset.  The examples flagged with this issue
    may be exactly duplicated, or lie atypically close together when
    represented as vectors (i.e. feature embeddings).


Number of examples with this issue: 17
Overall dataset quality in terms of this issue: 0.6165

Examples representing most severe instances of this issue:
     is_near_duplicate_issue  near_duplicate_score near_duplicate_sets  distance_to_nearest_neighbor
12                      True                   0.0        [2, 1, 6, 9]                           0.0
582                     True                   0.0               [185]                           0.0
185                     True                   0.0               [582]                           0.0
187                     True                   0.0                [27]                           0.0
898                     True                   0.0               [637]                           0.0


---------------------- non_iid issues ----------------------

About this issue:
        Whether the dataset exhibits statistically significant
    violations of the IID assumption like:
    changepoints or shift, drift, autocorrelation, etc.
    The specific violation considered is whether the
    examples are ordered such that almost adjacent examples
    tend to have more similar feature values.


Number of examples with this issue: 1
Overall dataset quality in terms of this issue: 0.0014

Examples representing most severe instances of this issue:
     is_non_iid_issue  non_iid_score
595              True       0.702427
147             False       0.711186
157             False       0.721394
771             False       0.731979
898             False       0.740335

Additional Information:
p-value: 0.0014153602099278074

Label issues#

The above report shows that cleanlab identified many label issues in the data. We can see which examples are estimated to be mislabeled (as well as a numeric quality score quantifying how likely their label is correct) via the get_issues method.

[11]:

issue_results = lab.get_issues("label")
issue_results.head()

[11]:

	is_label_issue	label_score	given_label	predicted_label
0	True	0.000842	C	F
1	False	0.555944	B	B
2	False	0.555944	B	B
3	True	0.000005	C	F
4	True	0.004374	C	D

To review the most severe label issues, sort the DataFrame above by the label_score column (a lower score represents that the label is less likely to be correct).

Let’s review some of the most likely label errors:

[12]:

sorted_issues = issue_results.sort_values("label_score").index

X_raw.iloc[sorted_issues].assign(
    given_label=labels.iloc[sorted_issues],
    predicted_label=issue_results["predicted_label"].iloc[sorted_issues]
).head()

[12]:

	exam_1	exam_2	exam_3	notes	given_label	predicted_label
3	0.61	0.94	0.78	NaN	C	F
886	89.00	95.00	73.00	NaN	D	B
709	64.00	70.00	86.00	NaN	F	C
723	53.00	89.00	78.00	NaN	A	C
689	77.00	51.00	70.00	NaN	B	D

The dataframe above shows the original label (given_label) for examples that cleanlab finds most likely to be mislabeled, as well as an alternative predicted_label for each example.

These examples have been labeled incorrectly and should be carefully re-examined - a student with grades of 89, 95 and 73 surely does not deserve a D!

Outlier issues#

According to the report, our dataset contains some outliers. We can see which examples are outliers (and a numeric quality score quantifying how typical each example appears to be) via get_issues. We sort the resulting DataFrame by cleanlab’s outlier quality score to see the most severe outliers in our dataset.

[13]:

outlier_results = lab.get_issues("outlier")
sorted_outliers= outlier_results.sort_values("outlier_score").index

X_raw.iloc[sorted_outliers].head()

[13]:

	exam_1	exam_2	exam_3	notes
3	0.61	0.94	0.78	NaN
7	100.00	100.00	1.00	NaN
0	53.00	77.00	9.00	3
4	48.00	90.00	9.00	1
8	0.00	56.00	96.00	<p style="font-size: 18px; color: #ff00ff; bac...

The student at index 3 has fractional exam scores, which is likely a error. We also see that the students at index 0 and 4 have numerical values in their notes section, which is also probably unintended. Lastly, we see that the student at index 8 has a html string in their notes section, definitely a mistake!

Near-duplicate issues#

According to the report, our dataset contains some sets of nearly duplicated examples. We can see which examples are (nearly) duplicated (and a numeric quality score quantifying how dissimilar each example is from its nearest neighbor in the dataset) via get_issues. We sort the resulting DataFrame by cleanlab’s near-duplicate quality score to see the examples in our dataset that are most nearly duplicated.

[14]:

duplicate_results = lab.get_issues("near_duplicate")
duplicate_results.sort_values("near_duplicate_score").head()

[14]:

	is_near_duplicate_issue	near_duplicate_sets
12	True	[2, 1, 6, 9]
582	True	[185]
185	True	[582]
187	True	[27]
898	True	[637]

The results above show which examples cleanlab considers nearly duplicated (rows where is_near_duplicate_issue == True). Here, we see some examples that cleanlab has flagged as being nearly duplicated. Let’s view these examples to see how similar they are, starting with the top one.

We compare this example (student 690) against the example cleanlab has identified in the near_duplicate_sets (student 246).

[15]:

X_raw.iloc[[690, 246]]

[15]:

	exam_1	exam_2	exam_3	notes
690	78.0	58.0	86.0	great final presentation +10
246	78.0	58.0	86.0	great final presentation +10

These examples are exact duplicates! Perhaps the same information was accidentally recorded twice in this data.

Similarly, let’s take a look at example 185 and the identified near duplicate, example 582:

[16]:

X_raw.iloc[[185, 582]]

[16]:

	exam_1	exam_2	exam_3	notes
185	90.0	67.0	77.0	missed class frequently -10
582	90.0	67.0	77.0	missed class frequently -10

We identified another exact duplicate in our dataset! Including near/exact duplicates in a dataset may have unintended effects on models; be wary about splitting them across training/test sets. Learn more about handling near duplicates detected in a dataset from the FAQ.

This tutorial highlighted a straightforward approach to detect potentially incorrect information in any tabular dataset. Just use Datalab with any ML model – the better the model, the more accurate the data errors detected by Datalab will be!

Easy Mode#

Cleanlab is most effective when you run this code with a good ML model. Try to produce the best ML model you can for your data (instead of the basic model from this tutorial). If you don’t know the best ML model for your data, try Cleanlab Studio which will automatically produce one for you. Super easy to use, Cleanlab Studio is no-code platform for data-centric AI that automatically: detects data issues (more types of issues than this cleanlab package), helps you quickly correct these data issues, confidently labels large subsets of an unlabeled dataset, and provides other smart metadata about each of your data points – all powered by a system that automatically trains/deploys the best ML model for your data. Try it for free!