Detecting Issues in Tabular Data (Numeric/Categorical columns) with Datalab#
In this 5-minute quickstart tutorial, we use Datalab to detect various issues in a classification dataset with tabular (numeric/categorical) features. Tabular (or structured) data are typically organized in a row/column format and stored in a SQL database or file types like: CSV, Excel, or Parquet. Here we consider a Student Grades dataset, which contains over 900 individuals who have three exam grades and some optional notes, each being assigned a letter grade (their class label). cleanlab automatically identifies hundreds of examples in this dataset that were mislabeled with the incorrect final grade selected. You can run the same code from this tutorial to detect incorrect information in your own tabular classification datasets.
Overview of what we’ll do in this tutorial:
- Train a classifier model (here scikit-learn’s HistGradientBoostingClassifier, although any model could be used) and use this classifier to compute (out-of-sample) predicted class probabilities via cross-validation. 
- Create a K nearest neighbours (KNN) graph between the examples in the dataset. 
- Identify issues in the dataset with cleanlab’s - Datalabaudit applied to the predictions and KNN graph.
Quickstart
Already have (out-of-sample) pred_probs from a model trained on your original data labels? Have a knn_graph computed between dataset examples (reflecting similarity in their feature values)? Run the code below to find issues in your dataset.
from cleanlab import Datalab
lab = Datalab(data=your_dataset, label_name="column_name_of_labels")
lab.find_issues(pred_probs=your_pred_probs, knn_graph=knn_graph)
lab.get_issues()
1. Install required dependencies#
You can use pip to install all packages required for this tutorial as follows:
!pip install "cleanlab[datalab]"
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+https://github.com/cleanlab/cleanlab.git
[2]:
import random
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.neighbors import NearestNeighbors
from cleanlab import Datalab
SEED = 100  # for reproducibility
np.random.seed(SEED)
random.seed(SEED)
2. Load and process the data#
We first load the data features and labels (which are possibly noisy).
[3]:
grades_data = pd.read_csv("https://s.cleanlab.ai/grades-tabular-demo-v2.csv")
grades_data.head()
[3]:
| stud_ID | exam_1 | exam_2 | exam_3 | notes | letter_grade | |
|---|---|---|---|---|---|---|
| 0 | f48f73 | 53.00 | 77.00 | 9.00 | 3 | C | 
| 1 | 0bd4e7 | 81.00 | 64.00 | 80.00 | great participation +10 | B | 
| 2 | 0bd4e7 | 81.00 | 64.00 | 80.00 | great participation +10 | B | 
| 3 | cb9d7a | 0.61 | 0.94 | 0.78 | NaN | C | 
| 4 | 9acca4 | 48.00 | 90.00 | 9.00 | 1 | C | 
[4]:
X_raw = grades_data[["exam_1", "exam_2", "exam_3", "notes"]]
labels = grades_data["letter_grade"]
Next we preprocess the data. Here we apply one-hot encoding to columns with categorical values and standardize the values in numeric columns.
[5]:
cat_features = ["notes"]
X_encoded = pd.get_dummies(X_raw, columns=cat_features, drop_first=True)
numeric_features = ["exam_1", "exam_2", "exam_3"]
scaler = StandardScaler()
X_processed = X_encoded.copy()
X_processed[numeric_features] = scaler.fit_transform(X_encoded[numeric_features])
Bringing Your Own Data (BYOD)?
Assign your data’s features to variable X and its labels to variable labels instead.
3. Select a classification model and compute out-of-sample predicted probabilities#
Here we use a simple histogram-based gradient boosting model (similar to XGBoost), but you can choose any suitable scikit-learn model for this tutorial.
[6]:
clf = HistGradientBoostingClassifier()
To find potential labeling errors, cleanlab requires a probabilistic prediction from your model for every datapoint. However, these predictions will be overfitted (and thus unreliable) for examples the model was previously trained on. For the best results, cleanlab should be applied with out-of-sample predicted class probabilities, i.e., on examples held out from the model during the training.
K-fold cross-validation is a straightforward way to produce out-of-sample predicted probabilities for every datapoint in the dataset by training K copies of our model on different data subsets and using each copy to predict on the subset of data it did not see during training. Make sure that the columns of your pred_probs are properly ordered with respect to the ordering of classes, which for Datalab is: lexicographically sorted by class name. We can implement this via the
cross_val_predict method from scikit-learn.
[7]:
num_crossval_folds = 5
pred_probs = cross_val_predict(
    clf,
    X_processed,
    labels,
    cv=num_crossval_folds,
    method="predict_proba",
)
4. Construct K nearest neighbours graph#
The KNN graph reflects how close each example is when compared to other examples in our dataset (in the numerical space of preprocessed feature values). This similarity information is used by Datalab to identify issues like outliers in our data. For tabular data, think carefully about the most appropriate way to define the similarity between two examples.
Here we use the NearestNeighbors class in sklearn to easily compute this graph (with similarity defined by the Euclidean distance between feature values). The graph should be represented as a sparse matrix with nonzero entries indicating nearest neighbors of each example and their distance.
[8]:
KNN = NearestNeighbors(metric='euclidean')
KNN.fit(X_processed.values)
knn_graph = KNN.kneighbors_graph(mode="distance")
5. Use cleanlab to find label issues#
Based on the given labels, predicted probabilities, and KNN graph, cleanlab can quickly help us identify suspicious values in our grades table.
We use cleanlab’s Datalab class which has several ways of loading the data. In this case, we’ll simply wrap the dataset (features and noisy labels) in a dictionary that is used instantiate a Datalab object such that it can audit our dataset for various types of issues.
[9]:
data = {"X": X_processed.values, "y": labels}
lab = Datalab(data, label_name="y")
lab.find_issues(pred_probs=pred_probs, knn_graph=knn_graph)
Finding label issues ...
Finding outlier issues ...
Finding near_duplicate issues ...
Finding non_iid issues ...
Finding class_imbalance issues ...
Finding underperforming_group issues ...
Audit complete. 358 issues found in the dataset.
[10]:
lab.report()
Dataset Information: num_examples: 941, num_classes: 5
Here is a summary of various issues found in your data:
    issue_type  num_issues
         label         294
       outlier          46
near_duplicate          17
       non_iid           1
Learn about each issue: https://docs.cleanlab.ai/stable/cleanlab/datalab/guide/issue_type_description.html
See which examples in your dataset exhibit each issue via: `datalab.get_issues(<ISSUE_NAME>)`
Data indices corresponding to top examples of each issue are shown below.
----------------------- label issues -----------------------
About this issue:
        Examples whose given label is estimated to be potentially incorrect
    (e.g. due to annotation error) are flagged as having label issues.
Number of examples with this issue: 294
Overall dataset quality in terms of this issue: 0.7109
Examples representing most severe instances of this issue:
     is_label_issue  label_score given_label predicted_label
3              True     0.000005           C               F
886            True     0.000059           D               B
709            True     0.000104           F               C
723            True     0.000169           A               C
689            True     0.000181           B               D
---------------------- outlier issues ----------------------
About this issue:
        Examples that are very different from the rest of the dataset
    (i.e. potentially out-of-distribution or rare/anomalous instances).
Number of examples with this issue: 46
Overall dataset quality in terms of this issue: 0.3590
Examples representing most severe instances of this issue:
   is_outlier_issue  outlier_score
3              True   3.051882e-07
7              True   7.683133e-05
0              True   6.536582e-04
4              True   8.406589e-04
8              True   5.324246e-03
------------------ near_duplicate issues -------------------
About this issue:
        A (near) duplicate issue refers to two or more examples in
    a dataset that are extremely similar to each other, relative
    to the rest of the dataset.  The examples flagged with this issue
    may be exactly duplicated, or lie atypically close together when
    represented as vectors (i.e. feature embeddings).
Number of examples with this issue: 17
Overall dataset quality in terms of this issue: 0.6165
Examples representing most severe instances of this issue:
     is_near_duplicate_issue  near_duplicate_score near_duplicate_sets  distance_to_nearest_neighbor
12                      True                   0.0        [2, 1, 6, 9]                           0.0
582                     True                   0.0               [185]                           0.0
185                     True                   0.0               [582]                           0.0
187                     True                   0.0                [27]                           0.0
898                     True                   0.0               [637]                           0.0
---------------------- non_iid issues ----------------------
About this issue:
        Whether the dataset exhibits statistically significant
    violations of the IID assumption like:
    changepoints or shift, drift, autocorrelation, etc.
    The specific violation considered is whether the
    examples are ordered such that almost adjacent examples
    tend to have more similar feature values.
Number of examples with this issue: 1
Overall dataset quality in terms of this issue: 0.0000
Examples representing most severe instances of this issue:
     is_non_iid_issue  non_iid_score
865              True       0.515002
837             False       0.556480
622             False       0.593068
329             False       0.593207
920             False       0.618041
Additional Information:
p-value: 1.4386345844794593e-05
Label issues#
The above report shows that cleanlab identified many label issues in the data. We can see which examples are estimated to be mislabeled (as well as a numeric quality score quantifying how likely their label is correct) via the get_issues method.
[11]:
issue_results = lab.get_issues("label")
issue_results.head()
[11]:
| is_label_issue | label_score | given_label | predicted_label | |
|---|---|---|---|---|
| 0 | True | 0.000842 | C | F | 
| 1 | False | 0.555944 | B | B | 
| 2 | False | 0.555944 | B | B | 
| 3 | True | 0.000005 | C | F | 
| 4 | True | 0.004374 | C | D | 
To review the most severe label issues, sort the DataFrame above by the label_score column (a lower score represents that the label is less likely to be correct).
Let’s review some of the most likely label errors:
[12]:
sorted_issues = issue_results.sort_values("label_score").index
X_raw.iloc[sorted_issues].assign(
    given_label=labels.iloc[sorted_issues],
    predicted_label=issue_results["predicted_label"].iloc[sorted_issues]
).head()
[12]:
| exam_1 | exam_2 | exam_3 | notes | given_label | predicted_label | |
|---|---|---|---|---|---|---|
| 3 | 0.61 | 0.94 | 0.78 | NaN | C | F | 
| 886 | 89.00 | 95.00 | 73.00 | NaN | D | B | 
| 709 | 64.00 | 70.00 | 86.00 | NaN | F | C | 
| 723 | 53.00 | 89.00 | 78.00 | NaN | A | C | 
| 689 | 77.00 | 51.00 | 70.00 | NaN | B | D | 
The dataframe above shows the original label (given_label) for examples that cleanlab finds most likely to be mislabeled, as well as an alternative predicted_label for each example.
These examples have been labeled incorrectly and should be carefully re-examined - a student with grades of 89, 95 and 73 surely does not deserve a D!
Outlier issues#
According to the report, our dataset contains some outliers. We can see which examples are outliers (and a numeric quality score quantifying how typical each example appears to be) via get_issues. We sort the resulting DataFrame by cleanlab’s outlier quality score to see the most severe outliers in our dataset.
[13]:
outlier_results = lab.get_issues("outlier")
sorted_outliers= outlier_results.sort_values("outlier_score").index
X_raw.iloc[sorted_outliers].head()
[13]:
| exam_1 | exam_2 | exam_3 | notes | |
|---|---|---|---|---|
| 3 | 0.61 | 0.94 | 0.78 | NaN | 
| 7 | 100.00 | 100.00 | 1.00 | NaN | 
| 0 | 53.00 | 77.00 | 9.00 | 3 | 
| 4 | 48.00 | 90.00 | 9.00 | 1 | 
| 8 | 0.00 | 56.00 | 96.00 | <p style="font-size: 18px; color: #ff00ff; bac... | 
The student at index 3 has fractional exam scores, which is likely a error. We also see that the students at index 0 and 4 have numerical values in their notes section, which is also probably unintended. Lastly, we see that the student at index 8 has a html string in their notes section, definitely a mistake!
Near-duplicate issues#
According to the report, our dataset contains some sets of nearly duplicated examples. We can see which examples are (nearly) duplicated (and a numeric quality score quantifying how dissimilar each example is from its nearest neighbor in the dataset) via get_issues. We sort the resulting DataFrame by cleanlab’s near-duplicate quality score to see the examples in our dataset that are most nearly duplicated.
[14]:
duplicate_results = lab.get_issues("near_duplicate")
duplicate_results.sort_values("near_duplicate_score").head()
[14]:
| is_near_duplicate_issue | near_duplicate_score | near_duplicate_sets | distance_to_nearest_neighbor | |
|---|---|---|---|---|
| 12 | True | 0.0 | [2, 1, 6, 9] | 0.0 | 
| 582 | True | 0.0 | [185] | 0.0 | 
| 185 | True | 0.0 | [582] | 0.0 | 
| 187 | True | 0.0 | [27] | 0.0 | 
| 898 | True | 0.0 | [637] | 0.0 | 
The results above show which examples cleanlab considers nearly duplicated (rows where is_near_duplicate_issue == True). Here, we see some examples that cleanlab has flagged as being nearly duplicated. Let’s view these examples to see how similar they are
Using the one of the lowest-scoring examples, let’s compare it against the identified near-duplicate sets.
[15]:
# Identify the row with the lowest near_duplicate_score
lowest_scoring_duplicate = duplicate_results["near_duplicate_score"].idxmin()
# Extract the indices of the lowest scoring duplicate and its near duplicate sets
indices_to_display = [lowest_scoring_duplicate] + duplicate_results.loc[lowest_scoring_duplicate, "near_duplicate_sets"].tolist()
# Display the relevant rows from the original dataset
X_raw.iloc[indices_to_display]
[15]:
| exam_1 | exam_2 | exam_3 | notes | |
|---|---|---|---|---|
| 1 | 81.0 | 64.0 | 80.0 | great participation +10 | 
| 2 | 81.0 | 64.0 | 80.0 | great participation +10 | 
| 12 | 81.0 | 64.0 | 80.0 | great participation +10 | 
| 6 | 81.0 | 64.0 | 80.0 | great participation +10 | 
| 9 | 81.0 | 64.0 | 80.0 | great participation +10 | 
These examples are exact duplicates! Perhaps the same information was accidentally recorded multiple times in this data.
Similarly, let’s take a look at another example and the identified near-duplicate sets:
[16]:
# Identify the next row not in the previous near duplicate set
second_lowest_scoring_duplicate = duplicate_results["near_duplicate_score"].drop(indices_to_display).idxmin()
# Extract the indices of the second lowest scoring duplicate and its near duplicate sets
next_indices_to_display = [second_lowest_scoring_duplicate] + duplicate_results.loc[second_lowest_scoring_duplicate, "near_duplicate_sets"].tolist()
# Display the relevant rows from the original dataset
X_raw.iloc[next_indices_to_display]
[16]:
| exam_1 | exam_2 | exam_3 | notes | |
|---|---|---|---|---|
| 27 | 86.0 | 80.0 | 89.0 | NaN | 
| 187 | 86.0 | 80.0 | 89.0 | NaN | 
We identified another set of exact duplicates in our dataset! Including near/exact duplicates in a dataset may have unintended effects on models; be wary about splitting them across training/test sets. Learn more about handling near duplicates detected in a dataset from the FAQ.
This tutorial highlighted a straightforward approach to detect potentially incorrect information in any tabular dataset. Just use Datalab with any ML model – the better the model, the more accurate the data errors detected by Datalab will be!
Spending too much time on data quality?#
Using this open-source package effectively can require significant ML expertise and experimentation, plus handling detected data issues can be cumbersome.
That’s why we built Cleanlab Studio – an automated platform to find and fix issues in your dataset, 100x faster and more accurately. Cleanlab Studio automatically runs optimized data quality algorithms from this package on top of cutting-edge AutoML & Foundation models fit to your data, and helps you fix detected issues via a smart data correction interface. Try it for free!
