Detecting Issues in Tabular Data (Numeric/Categorical columns) with Datalab#
In this 5-minute quickstart tutorial, we use Datalab to detect various issues in a classification dataset with tabular (numeric/categorical) features. Tabular (or structured) data are typically organized in a row/column format and stored in a SQL database or file types like: CSV, Excel, or Parquet. Here we consider a Student Grades dataset, which contains over 900 individuals who have three exam grades and some optional notes, each being assigned a letter grade (their class label). cleanlab automatically identifies hundreds of examples in this dataset that were mislabeled with the incorrect final grade selected. You can run the same code from this tutorial to detect incorrect information in your own tabular classification datasets.
Overview of what we’ll do in this tutorial:
Train a classifier model (here scikit-learn’s HistGradientBoostingClassifier, although any model could be used) and use this classifier to compute (out-of-sample) predicted class probabilities via cross-validation.
Create a K nearest neighbours (KNN) graph between the examples in the dataset.
Identify issues in the dataset with cleanlab’s
Datalab
audit applied to the predictions and KNN graph.
Quickstart
Already have (out-of-sample) pred_probs
from a model trained on your original data labels? Have a knn_graph
computed between dataset examples (reflecting similarity in their feature values)? Run the code below to find issues in your dataset.
from cleanlab import Datalab
lab = Datalab(data=your_dataset, label_name="column_name_of_labels")
lab.find_issues(pred_probs=your_pred_probs, knn_graph=knn_graph)
lab.get_issues()
1. Install required dependencies#
You can use pip
to install all packages required for this tutorial as follows:
!pip install sklearn datasets
!pip install "cleanlab[datalab]"
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
# !pip install git+https://github.com/cleanlab/cleanlab.git
[2]:
import random
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.neighbors import NearestNeighbors
from cleanlab import Datalab
SEED = 100 # for reproducibility
np.random.seed(SEED)
random.seed(SEED)
2. Load and process the data#
We first load the data features and labels (which are possibly noisy).
[3]:
grades_data = pd.read_csv("https://s.cleanlab.ai/grades-tabular-demo-v2.csv")
grades_data.head()
[3]:
stud_ID | exam_1 | exam_2 | exam_3 | notes | letter_grade | |
---|---|---|---|---|---|---|
0 | f48f73 | 53.00 | 77.00 | 9.00 | 3 | C |
1 | 0bd4e7 | 81.00 | 64.00 | 80.00 | great participation +10 | B |
2 | 0bd4e7 | 81.00 | 64.00 | 80.00 | great participation +10 | B |
3 | cb9d7a | 0.61 | 0.94 | 0.78 | NaN | C |
4 | 9acca4 | 48.00 | 90.00 | 9.00 | 1 | C |
[4]:
X_raw = grades_data[["exam_1", "exam_2", "exam_3", "notes"]]
labels = grades_data["letter_grade"]
Next we preprocess the data. Here we apply one-hot encoding to columns with categorical values and standardize the values in numeric columns.
[5]:
cat_features = ["notes"]
X_encoded = pd.get_dummies(X_raw, columns=cat_features, drop_first=True)
numeric_features = ["exam_1", "exam_2", "exam_3"]
scaler = StandardScaler()
X_processed = X_encoded.copy()
X_processed[numeric_features] = scaler.fit_transform(X_encoded[numeric_features])
Bringing Your Own Data (BYOD)?
Assign your data’s features to variable X
and its labels to variable labels
instead.
3. Select a classification model and compute out-of-sample predicted probabilities#
Here we use a simple histogram-based gradient boosting model (similar to XGBoost), but you can choose any suitable scikit-learn model for this tutorial.
[6]:
clf = HistGradientBoostingClassifier()
To find potential labeling errors, cleanlab requires a probabilistic prediction from your model for every datapoint. However, these predictions will be overfitted (and thus unreliable) for examples the model was previously trained on. For the best results, cleanlab should be applied with out-of-sample predicted class probabilities, i.e., on examples held out from the model during the training.
K-fold cross-validation is a straightforward way to produce out-of-sample predicted probabilities for every datapoint in the dataset by training K copies of our model on different data subsets and using each copy to predict on the subset of data it did not see during training. Make sure that the columns of your pred_probs
are properly ordered with respect to the ordering of classes, which for Datalab is: lexicographically sorted by class name. We can implement this via the
cross_val_predict
method from scikit-learn.
[7]:
num_crossval_folds = 5
pred_probs = cross_val_predict(
clf,
X_processed,
labels,
cv=num_crossval_folds,
method="predict_proba",
)
4. Construct K nearest neighbours graph#
The KNN graph reflects how close each example is when compared to other examples in our dataset (in the numerical space of preprocessed feature values). This similarity information is used by Datalab to identify issues like outliers in our data. For tabular data, think carefully about the most appropriate way to define the similarity between two examples.
Here we use the NearestNeighbors
class in sklearn to easily compute this graph (with similarity defined by the Euclidean distance between feature values). The graph should be represented as a sparse matrix with nonzero entries indicating nearest neighbors of each example and their distance.
[8]:
KNN = NearestNeighbors(metric='euclidean')
KNN.fit(X_processed.values)
knn_graph = KNN.kneighbors_graph(mode="distance")
5. Use cleanlab to find label issues#
Based on the given labels, predicted probabilities, and KNN graph, cleanlab can quickly help us identify suspicious values in our grades table.
We use cleanlab’s Datalab
class which has several ways of loading the data. In this case, we’ll simply wrap the dataset (features and noisy labels) in a dictionary that is used instantiate a Datalab
object such that it can audit our dataset for various types of issues.
[9]:
data = {"X": X_processed.values, "y": labels}
lab = Datalab(data, label_name="y")
lab.find_issues(pred_probs=pred_probs, knn_graph=knn_graph)
Finding label issues ...
Finding outlier issues ...
Finding near_duplicate issues ...
Finding non_iid issues ...
Finding class_imbalance issues ...
Finding underperforming_group issues ...
Audit complete. 358 issues found in the dataset.
[10]:
lab.report()
Here is a summary of the different kinds of issues found in the data:
issue_type num_issues
label 294
outlier 46
near_duplicate 17
non_iid 1
Dataset Information: num_examples: 941, num_classes: 5
----------------------- label issues -----------------------
About this issue:
Examples whose given label is estimated to be potentially incorrect
(e.g. due to annotation error) are flagged as having label issues.
Number of examples with this issue: 294
Overall dataset quality in terms of this issue: 0.7109
Examples representing most severe instances of this issue:
is_label_issue label_score given_label predicted_label
3 True 0.000005 C F
886 True 0.000059 D B
709 True 0.000104 F C
723 True 0.000169 A C
689 True 0.000181 B D
---------------------- outlier issues ----------------------
About this issue:
Examples that are very different from the rest of the dataset
(i.e. potentially out-of-distribution or rare/anomalous instances).
Number of examples with this issue: 46
Overall dataset quality in terms of this issue: 0.3590
Examples representing most severe instances of this issue:
is_outlier_issue outlier_score
3 True 3.051882e-07
7 True 7.683133e-05
0 True 6.536582e-04
4 True 8.406589e-04
8 True 5.324246e-03
------------------ near_duplicate issues -------------------
About this issue:
A (near) duplicate issue refers to two or more examples in
a dataset that are extremely similar to each other, relative
to the rest of the dataset. The examples flagged with this issue
may be exactly duplicated, or lie atypically close together when
represented as vectors (i.e. feature embeddings).
Number of examples with this issue: 17
Overall dataset quality in terms of this issue: 0.6165
Examples representing most severe instances of this issue:
is_near_duplicate_issue near_duplicate_score near_duplicate_sets distance_to_nearest_neighbor
12 True 0.0 [2, 1, 6, 9] 0.0
582 True 0.0 [185] 0.0
185 True 0.0 [582] 0.0
187 True 0.0 [27] 0.0
898 True 0.0 [637] 0.0
---------------------- non_iid issues ----------------------
About this issue:
Whether the dataset exhibits statistically significant
violations of the IID assumption like:
changepoints or shift, drift, autocorrelation, etc.
The specific violation considered is whether the
examples are ordered such that almost adjacent examples
tend to have more similar feature values.
Number of examples with this issue: 1
Overall dataset quality in terms of this issue: 0.0014
Examples representing most severe instances of this issue:
is_non_iid_issue non_iid_score
595 True 0.702427
147 False 0.711186
157 False 0.721394
771 False 0.731979
898 False 0.740335
Additional Information:
p-value: 0.0014153602099278074
Label issues#
The above report shows that cleanlab identified many label issues in the data. We can see which examples are estimated to be mislabeled (as well as a numeric quality score quantifying how likely their label is correct) via the get_issues
method.
[11]:
issue_results = lab.get_issues("label")
issue_results.head()
[11]:
is_label_issue | label_score | given_label | predicted_label | |
---|---|---|---|---|
0 | True | 0.000842 | C | F |
1 | False | 0.555944 | B | B |
2 | False | 0.555944 | B | B |
3 | True | 0.000005 | C | F |
4 | True | 0.004374 | C | D |
To review the most severe label issues, sort the DataFrame above by the label_score
column (a lower score represents that the label is less likely to be correct).
Let’s review some of the most likely label errors:
[12]:
sorted_issues = issue_results.sort_values("label_score").index
X_raw.iloc[sorted_issues].assign(
given_label=labels.iloc[sorted_issues],
predicted_label=issue_results["predicted_label"].iloc[sorted_issues]
).head()
[12]:
exam_1 | exam_2 | exam_3 | notes | given_label | predicted_label | |
---|---|---|---|---|---|---|
3 | 0.61 | 0.94 | 0.78 | NaN | C | F |
886 | 89.00 | 95.00 | 73.00 | NaN | D | B |
709 | 64.00 | 70.00 | 86.00 | NaN | F | C |
723 | 53.00 | 89.00 | 78.00 | NaN | A | C |
689 | 77.00 | 51.00 | 70.00 | NaN | B | D |
The dataframe above shows the original label (given_label
) for examples that cleanlab finds most likely to be mislabeled, as well as an alternative predicted_label
for each example.
These examples have been labeled incorrectly and should be carefully re-examined - a student with grades of 89, 95 and 73 surely does not deserve a D!
Outlier issues#
According to the report, our dataset contains some outliers. We can see which examples are outliers (and a numeric quality score quantifying how typical each example appears to be) via get_issues
. We sort the resulting DataFrame by cleanlab’s outlier quality score to see the most severe outliers in our dataset.
[13]:
outlier_results = lab.get_issues("outlier")
sorted_outliers= outlier_results.sort_values("outlier_score").index
X_raw.iloc[sorted_outliers].head()
[13]:
exam_1 | exam_2 | exam_3 | notes | |
---|---|---|---|---|
3 | 0.61 | 0.94 | 0.78 | NaN |
7 | 100.00 | 100.00 | 1.00 | NaN |
0 | 53.00 | 77.00 | 9.00 | 3 |
4 | 48.00 | 90.00 | 9.00 | 1 |
8 | 0.00 | 56.00 | 96.00 | <p style="font-size: 18px; color: #ff00ff; bac... |
The student at index 3 has fractional exam scores, which is likely a error. We also see that the students at index 0 and 4 have numerical values in their notes section, which is also probably unintended. Lastly, we see that the student at index 8 has a html string in their notes section, definitely a mistake!
Near-duplicate issues#
According to the report, our dataset contains some sets of nearly duplicated examples. We can see which examples are (nearly) duplicated (and a numeric quality score quantifying how dissimilar each example is from its nearest neighbor in the dataset) via get_issues
. We sort the resulting DataFrame by cleanlab’s near-duplicate quality score to see the examples in our dataset that are most nearly duplicated.
[14]:
duplicate_results = lab.get_issues("near_duplicate")
duplicate_results.sort_values("near_duplicate_score").head()
[14]:
is_near_duplicate_issue | near_duplicate_score | near_duplicate_sets | distance_to_nearest_neighbor | |
---|---|---|---|---|
12 | True | 0.0 | [2, 1, 6, 9] | 0.0 |
582 | True | 0.0 | [185] | 0.0 |
185 | True | 0.0 | [582] | 0.0 |
187 | True | 0.0 | [27] | 0.0 |
898 | True | 0.0 | [637] | 0.0 |
The results above show which examples cleanlab considers nearly duplicated (rows where is_near_duplicate_issue == True
). Here, we see some examples that cleanlab has flagged as being nearly duplicated. Let’s view these examples to see how similar they are, starting with the top one.
We compare this example (student 690) against the example cleanlab has identified in the near_duplicate_sets
(student 246).
[15]:
X_raw.iloc[[690, 246]]
[15]:
exam_1 | exam_2 | exam_3 | notes | |
---|---|---|---|---|
690 | 78.0 | 58.0 | 86.0 | great final presentation +10 |
246 | 78.0 | 58.0 | 86.0 | great final presentation +10 |
These examples are exact duplicates! Perhaps the same information was accidentally recorded twice in this data.
Similarly, let’s take a look at example 185 and the identified near duplicate, example 582:
[16]:
X_raw.iloc[[185, 582]]
[16]:
exam_1 | exam_2 | exam_3 | notes | |
---|---|---|---|---|
185 | 90.0 | 67.0 | 77.0 | missed class frequently -10 |
582 | 90.0 | 67.0 | 77.0 | missed class frequently -10 |
We identified another exact duplicate in our dataset! Including near/exact duplicates in a dataset may have unintended effects on models; be wary about splitting them across training/test sets. Learn more about handling near duplicates detected in a dataset from the FAQ.
This tutorial highlighted a straightforward approach to detect potentially incorrect information in any tabular dataset. Just use Datalab with any ML model – the better the model, the more accurate the data errors detected by Datalab will be!
Easy Mode#
Cleanlab is most effective when you run this code with a good ML model. Try to produce the best ML model you can for your data (instead of the basic model from this tutorial). If you don’t know the best ML model for your data, try Cleanlab Studio which will automatically produce one for you. Super easy to use, Cleanlab Studio is no-code platform for data-centric AI that automatically: detects data issues (more types of issues than this cleanlab package), helps you quickly correct these data issues, confidently labels large subsets of an unlabeled dataset, and provides other smart metadata about each of your data points – all powered by a system that automatically trains/deploys the best ML model for your data. Try it for free!