Text Classification with TensorFlow, Keras, and Cleanlab#

In this quick-start tutorial, we use cleanlab to find potential label errors in the IMDb movie review text classification dataset. This dataset contains 50,000 text reviews, each labeled with a binary sentiment polarity label indicating whether the review is positive (1) or negative (0). cleanlab will shortlist hundreds of examples that confuse our ML model the most; many of which are potential label errors, edge cases, or otherwise ambiguous examples.

Overview of what we’ll do in this tutorial:

  • Build a simple TensorFlow & Keras neural net and wrap it with SciKeras to make it scikit-learn compatible.

  • Use this classifier to compute out-of-sample predicted probabilities, pred_probs, via cross validation.

  • Identify potential label errors in the data with cleanlab’s find_label_issues method.

  • Train a more robust version of the same neural net via cleanlab’s CleanLearning wrapper.

Quickstart

Already have an sklearn compatible model, text data and given labels? Run the code below to train your model and get label issues.

from cleanlab.classification import CleanLearning

cl = CleanLearning(model)
_ = cl.fit(train_data, labels)
label_issues = cl.get_label_issues()
preds = cl.predict(test_data) # predictions from a version of your model
                              # trained on auto-cleaned data

Is your model/data not compatible with CleanLearning? You can instead run cross-validation on your model to get out-of-sample pred_probs. Then run the code below to get label issue indices ranked by their inferred severity.

from cleanlab.filter import find_label_issues

ranked_label_issues = find_label_issues(
    labels,
    pred_probs,
    return_indices_ranked_by="self_confidence",
)

1. Install required dependencies#

You can use pip to install all packages required for this tutorial as follows:

!pip install sklearn tensorflow tensorflow-datasets scikeras
!pip install cleanlab
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+https://github.com/cleanlab/cleanlab.git
[2]:
import re
import string
import pandas as pd
from sklearn.metrics import accuracy_score, log_loss
from sklearn.model_selection import cross_val_predict
import tensorflow as tf
from tensorflow.keras import layers
import tensorflow_datasets as tfds
from scikeras.wrappers import KerasClassifier

SEED = 123456  # for reproducibility

2. Load and preprocess the IMDb text dataset#

This dataset is provided in TensorFlow’s Datasets.

[4]:
%%capture

raw_full_ds = tfds.load(
    name="imdb_reviews", split=("train+test"), batch_size=-1, as_supervised=True
)
raw_full_texts, full_labels = tfds.as_numpy(raw_full_ds)
[5]:
num_classes = len(set(full_labels))
print(f"Classes: {set(full_labels)}")
Classes: {0, 1}

Let’s print the first example.

[6]:
i = 0
print(f"Example Label: {full_labels[i]}")
print(f"Example Text: {raw_full_texts[i]}")
Example Label: 0
Example Text: b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."

The data are stored as two numpy arrays:

  1. raw_full_texts for the movie reviews in text format,

  2. full_labels for the labels.

Bringing Your Own Data (BYOD)?

You can easily replace the above with your own text dataset, and continue with the rest of the tutorial.

Your classes (and entries of full_labels) should be represented as integer indices 0, 1, …, num_classes - 1. For example, if your dataset has 7 examples from 3 classes, full_labels might be: np.array([2,0,0,1,2,0,1])

Define a function to preprocess the text data by:

  1. Converting it to lower case

  2. Removing the HTML break tags: <br />

  3. Removing any punctuation marks

[7]:
def preprocess_text(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(stripped_html, f"[{re.escape(string.punctuation)}]", "")

We use a TextVectorization layer to preprocess, tokenize, and vectorize our text data, thus making it suitable as input for a neural network.

[8]:
max_features = 10000
sequence_length = 250

vectorize_layer = layers.TextVectorization(
    standardize=preprocess_text,
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

Adapting vectorize_layer to the text data creates a mapping of each token (i.e. word) to an integer index. Subsequently, we can vectorize our text data by using this mapping. Finally, we’ll also convert our text data into a numpy array as required by cleanlab.

[9]:
%%capture

vectorize_layer.adapt(raw_full_texts)
full_texts = vectorize_layer(raw_full_texts)
full_texts = full_texts.numpy()

3. Define a classification model and compute out-of-sample predicted probabilities#

Here, we build a simple neural network for classification with TensorFlow and Keras.

[10]:
def get_net():
    net = tf.keras.Sequential(
        [
            tf.keras.Input(shape=(None,), dtype="int64"),
            layers.Embedding(max_features + 1, 16),
            layers.Dropout(0.2),
            layers.GlobalAveragePooling1D(),
            layers.Dropout(0.2),
            layers.Dense(num_classes),
            layers.Softmax()
        ]
    )  # outputs probability that text belongs to class 1

    net.compile(
        optimizer="adam",
        loss=tf.keras.losses.SparseCategoricalCrossentropy(),
        metrics=tf.keras.metrics.CategoricalAccuracy(),
    )
    return net

As some of cleanlab’s feature requires scikit-learn compatibility, we will need to adapt the above TensorFlow & Keras neural net accordingly. SciKeras is a convenient package that makes this really easy.

[11]:
model = KerasClassifier(get_net(), epochs=10)

To identify label issues, cleanlab requires a probabilistic prediction from your model for every datapoint that should be considered. However these predictions will be overfit (and thus unreliable) for datapoints the model was previously trained on. cleanlab is intended to only be used with out-of-sample predicted probabilities, i.e. on datapoints held-out from the model during the training.

K-fold cross-validation is a straightforward way to produce out-of-sample predicted probabilities for every datapoint in the dataset, by training K copies of our model on different data subsets and using each copy to predict on the subset of data it did not see during training. We can obtain cross-validated out-of-sample predicted probabilities from any classifier via a scikit-learn simple wrapper:

[12]:
num_crossval_folds = 3  # for efficiency; values like 5 or 10 will generally work better
pred_probs = cross_val_predict(
    model,
    full_texts,
    full_labels,
    cv=num_crossval_folds,
    method="predict_proba",
)
INFO:tensorflow:Assets written to: ram:///tmp/tmplbt_4o_t/assets
Epoch 1/10
1042/1042 [==============================] - 4s 3ms/step - loss: 0.5955 - categorical_accuracy: 0.4451
Epoch 2/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.3968 - categorical_accuracy: 0.4869
Epoch 3/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.3183 - categorical_accuracy: 0.4902
Epoch 4/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.2795 - categorical_accuracy: 0.4928
Epoch 5/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.2533 - categorical_accuracy: 0.4937
Epoch 6/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.2339 - categorical_accuracy: 0.4956
Epoch 7/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.2185 - categorical_accuracy: 0.4958
Epoch 8/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.2075 - categorical_accuracy: 0.4948
Epoch 9/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.1961 - categorical_accuracy: 0.4956
Epoch 10/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.1885 - categorical_accuracy: 0.4957
521/521 [==============================] - 1s 935us/step
INFO:tensorflow:Assets written to: ram:///tmp/tmpifbw_anv/assets
Epoch 1/10
1042/1042 [==============================] - 4s 3ms/step - loss: 0.5966 - categorical_accuracy: 0.4631
Epoch 2/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.3980 - categorical_accuracy: 0.4871
Epoch 3/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.3184 - categorical_accuracy: 0.4897
Epoch 4/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.2799 - categorical_accuracy: 0.4912
Epoch 5/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.2532 - categorical_accuracy: 0.4935
Epoch 6/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.2333 - categorical_accuracy: 0.4929
Epoch 7/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.2187 - categorical_accuracy: 0.4943
Epoch 8/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.2064 - categorical_accuracy: 0.4944
Epoch 9/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.1955 - categorical_accuracy: 0.4957
Epoch 10/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.1872 - categorical_accuracy: 0.4955
521/521 [==============================] - 1s 920us/step
INFO:tensorflow:Assets written to: ram:///tmp/tmp1gtftt__/assets
Epoch 1/10
1042/1042 [==============================] - 4s 3ms/step - loss: 0.5977 - categorical_accuracy: 0.3976
Epoch 2/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.3960 - categorical_accuracy: 0.4897
Epoch 3/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.3171 - categorical_accuracy: 0.4923
Epoch 4/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.2767 - categorical_accuracy: 0.4929
Epoch 5/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.2519 - categorical_accuracy: 0.4932
Epoch 6/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.2333 - categorical_accuracy: 0.4939
Epoch 7/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.2174 - categorical_accuracy: 0.4956
Epoch 8/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.2051 - categorical_accuracy: 0.4948
Epoch 9/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.1944 - categorical_accuracy: 0.4945
Epoch 10/10
1042/1042 [==============================] - 3s 3ms/step - loss: 0.1852 - categorical_accuracy: 0.4944
521/521 [==============================] - 1s 925us/step

An additional benefit of cross-validation is that it facilitates more reliable evaluation of our model than a single training/validation split.

[13]:
loss = log_loss(full_labels, pred_probs)  # score to evaluate probabilistic predictions, lower values are better
print(f"Cross-validated estimate of log loss: {loss:.3f}")
Cross-validated estimate of log loss: 0.289

4. Use cleanlab to find potential label errors#

Based on the given labels and out-of-sample predicted probabilities, cleanlab can quickly help us identify label issues in our dataset. For a dataset with N examples from K classes, the labels should be a 1D array of length N and predicted probabilities should be a 2D (N x K) array. Here we request that the indices of the identified label issues should be sorted by cleanlab’s self-confidence score, which measures the quality of each given label via the probability assigned it in our model’s prediction.

[14]:
from cleanlab.filter import find_label_issues

ranked_label_issues = find_label_issues(
    labels=full_labels, pred_probs=pred_probs, return_indices_ranked_by="self_confidence"
)

Let’s review some of the most likely label errors:

[15]:
print(
    f"cleanlab found {len(ranked_label_issues)} potential label errors.\n"
    f"Here are indices of the top 10 most likely errors: \n {ranked_label_issues[:10]}"
)
cleanlab found 2588 potential label errors.
Here are indices of the top 10 most likely errors:
 [10404 44582 30151 43777 16633 13853 21165 21348 22370 13912]

To help us inspect these datapoints, we define a method to print any example from the dataset. We then display some of the top-ranked label issues identified by cleanlab:

[16]:
def print_as_df(index):
    return pd.DataFrame(
        {"texts": raw_full_texts[index], "labels": full_labels[index]},
        [index]
    )

Here’s a review labeled as positive (1), but it should be negative (0). Some noteworthy snippets extracted from the review text:

  • “…incredibly awful score…”

  • “…worst Foley work ever done.”

  • “…script is incomprehensible…”

  • “…editing is just bizarre.”

  • “…atrocious pan and scan…”

  • “…incoherent mess…”

  • “…amateur directing there.”

[17]:
print_as_df(44582)
[17]:
texts labels
44582 b'This movie is stuffed full of stock Horror movie goodies: chained lunatics, pre-meditated murder, a mad (vaguely lesbian) female scientist with an even madder father who wears a mask because of his horrible disfigurement, poisoning, spooky castles, werewolves (male and female), adultery, slain lovers, Tibetan mystics, the half-man/half-plant victim of some unnamed experiment, grave robbing, mind control, walled up bodies, a car crash on a lonely road, electrocution, knights in armour - the lot, all topped off with an incredibly awful score and some of the worst Foley work ever done.<br /><br />The script is incomprehensible (even by badly dubbed Spanish Horror movie standards) and some of the editing is just bizarre. In one scene where the lead female evil scientist goes to visit our heroine in her bedroom for one of the badly dubbed: "That is fantastical. I do not understand. Explain to me again how this is..." exposition scenes that litter this movie, there is a sudden hand held cutaway of the girl\'s thighs as she gets out of bed for no apparent reason at all other than to cover a cut in the bad scientist\'s "Mwahaha! All your werewolfs belong mine!" speech. Though why they went to the bother I don\'t know because there are plenty of other jarring jump cuts all over the place - even allowing for the atrocious pan and scan of the print I saw.<br /><br />The Director was, according to one interview with the star, drunk for most of the shoot and the film looks like it. It is an incoherent mess. It\'s made even more incoherent by the inclusion of werewolf rampage footage from a different film The Mark of the Wolf Man (made 4 years earlier, featuring the same actor but playing the part with more aggression and with a different shirt and make up - IS there a word in Spanish for "Continuity"?) and more padding of another actor in the wolfman get-up ambling about in long shot.<br /><br />The music is incredibly bad varying almost at random from full orchestral creepy house music, to bosannova, to the longest piano and gong duet ever recorded. (Thinking about it, it might not have been a duet. It might have been a solo. The piano part was so simple it could have been picked out with one hand while the player whacked away at the gong with the other.) <br /><br />This is one of the most bewilderedly trance-state inducing bad movies of the year so far for me. Enjoy.<br /><br />Favourite line: "Ilona! This madness and perversity will turn against you!" How true.<br /><br />Favourite shot: The lover, discovering his girlfriend slain, dropping the candle in a cartoon-like demonstration of surprise. Rank amateur directing there.' 1

Here’s a review labeled as positive (1), but it should be negative (0). Some noteworthy snippets extracted from the review text:

  • “…film seems cheap.”

  • “…unbelievably bad…”

  • “…cinematography is badly lit…”

  • “…everything looking grainy and ugly.”

  • “…sound is so terrible…”

[18]:
print_as_df(10404)
[18]:
texts labels
10404 b'This low-budget erotic thriller that has some good points, but a lot more bad one. The plot revolves around a female lawyer trying to clear her lover who is accused of murdering his wife. Being a soft-core film, that entails her going undercover at a strip club and having sex with possible suspects. As plots go for this type of genre, not to bad. The script is okay, and the story makes enough sense for someone up at 2 AM watching this not to notice too many plot holes. But everything else in the film seems cheap. The lead actors aren\'t that bad, but pretty much all the supporting ones are unbelievably bad (one girl seems like she is drunk and/or high). The cinematography is badly lit, with everything looking grainy and ugly. The sound is so terrible that you can barely hear what people are saying. The worst thing in this movie is the reason you\'re watching it-the sex. The reason people watch these things is for hot sex scenes featuring really hot girls in Red Shoe Diary situations. The sex scenes aren\'t hot they\'re sleazy, shot in that porno style where everything is just a master shot of two people going at it. The woman also look like they are refuges from a porn shoot. I\'m not trying to be rude or mean here, but they all have that breast implants and a burned out/weathered look. Even the title, "Deviant Obsession", sounds like a Hardcore flick. Not that I don\'t have anything against porn - in fact I love it. But I want my soft-core and my hard-core separate. What ever happened to actresses like Shannon Tweed, Jacqueline Lovell, Shannon Whirry and Kim Dawson? Women that could act and who would totally arouse you? And what happened to B erotic thrillers like Body Chemistry, Nighteyes and even Stripped to Kill. Sure, none of these where masterpieces, but at least they felt like movies. Plus, they were pushing the envelope, going beyond Hollywood\'s relatively prude stance on sex, sexual obsessions and perversions. Now they just make hard-core films without the hard-core sex.' 1

Here’s a review labeled as positive (1), but it should be negative (0). Some noteworthy snippets extracted from the review text:

  • “…hard to imagine a boring shark movie…”

  • Poor focus in some scenes made the production seems amateurish.”

  • “…do nothing to take advantage of…”

  • “…far too few scenes of any depth or variety.”

  • “…just look flat…no contrast of depth…”

  • “…introspective and dull…constant disappointment.”

[19]:
print_as_df(30151)
[19]:
texts labels
30151 b'Like the gentle giants that make up the latter half of this film\'s title, Michael Oblowitz\'s latest production has grace, but it\'s also slow and ponderous. The producer\'s last outing, "Mosquitoman-3D" had the same problem. It\'s hard to imagine a boring shark movie, but they somehow managed it. The only draw for Hammerhead: Shark Frenzy was it\'s passable animatronix, which is always fun when dealing with wondrous worlds beneath the ocean\'s surface. But even that was only passable. Poor focus in some scenes made the production seems amateurish. With Dolphins and Whales, the technology is all but wasted. Cloudy scenes and too many close-ups of the film\'s giant subjects do nothing to take advantage of IMAX\'s stunning 3D capabilities. There are far too few scenes of any depth or variety. Close-ups of these awesome creatures just look flat and there is often only one creature in the cameras field, so there is no contrast of depth. Michael Oblowitz is trying to follow in his father\'s footsteps, but when you\'ve got Shark-Week on cable, his introspective and dull treatment of his subjects is a constant disappointment.' 1

cleanlab has shortlisted the most likely label errors to speed up your data cleaning process. With this list, you can decide whether to fix these label issues or remove ambiguous examples from the dataset.

5. Train a more robust model from noisy labels#

Fixing the label issues manually may be time-consuming, but at least cleanlab can filter these noisy examples and train a model on the remaining clean data for you automatically. To demonstrate this, we first reload the dataset, this time with separate train and test splits.

[20]:
raw_train_ds = tfds.load(name="imdb_reviews", split="train", batch_size=-1, as_supervised=True)
raw_test_ds = tfds.load(name="imdb_reviews", split="test", batch_size=-1, as_supervised=True)

raw_train_texts, train_labels = tfds.as_numpy(raw_train_ds)
raw_test_texts, test_labels = tfds.as_numpy(raw_test_ds)

We featurize the raw text using the same vectorize_layer as before, but first, reset its state and adapt it only on the train set (as is proper ML practice). We finally convert the vectorized text data in the train/test sets into numpy arrays.

[21]:
vectorize_layer.reset_state()
vectorize_layer.adapt(raw_train_texts)

train_texts = vectorize_layer(raw_train_texts)
test_texts = vectorize_layer(raw_test_texts)

train_texts = train_texts.numpy()
test_texts = test_texts.numpy()

Let’s now train and evaluate our original neural network model.

[22]:
model = KerasClassifier(get_net(), epochs=10)
model.fit(train_texts, train_labels)

preds = model.predict(test_texts)
acc_og = accuracy_score(test_labels, preds)
print(f"\n Test accuracy of original neural net: {acc_og}")
Epoch 1/10
782/782 [==============================] - 3s 3ms/step - loss: 0.6233 - categorical_accuracy: 0.4679
Epoch 2/10
782/782 [==============================] - 2s 3ms/step - loss: 0.4392 - categorical_accuracy: 0.4874
Epoch 3/10
782/782 [==============================] - 2s 3ms/step - loss: 0.3453 - categorical_accuracy: 0.4928
Epoch 4/10
782/782 [==============================] - 2s 3ms/step - loss: 0.2984 - categorical_accuracy: 0.4941
Epoch 5/10
782/782 [==============================] - 2s 3ms/step - loss: 0.2669 - categorical_accuracy: 0.4940
Epoch 6/10
782/782 [==============================] - 2s 3ms/step - loss: 0.2440 - categorical_accuracy: 0.4956
Epoch 7/10
782/782 [==============================] - 2s 3ms/step - loss: 0.2256 - categorical_accuracy: 0.4938
Epoch 8/10
782/782 [==============================] - 2s 3ms/step - loss: 0.2092 - categorical_accuracy: 0.4962
Epoch 9/10
782/782 [==============================] - 2s 3ms/step - loss: 0.1968 - categorical_accuracy: 0.4963
Epoch 10/10
782/782 [==============================] - 2s 3ms/step - loss: 0.1833 - categorical_accuracy: 0.4968
782/782 [==============================] - 1s 893us/step

 Test accuracy of original neural net: 0.8738

cleanlab provides a wrapper class that can easily be applied to any scikit-learn compatible model. Once wrapped, the resulting model can still be used in the exact same manner, but it will now train more robustly if the data have noisy labels.

[23]:
from cleanlab.classification import CleanLearning

model = KerasClassifier(get_net(), epochs=10)  # Note we first re-instantiate the model
cl = CleanLearning(clf=model, seed=SEED)  # cl has same methods/attributes as model

When we train the cleanlab-wrapped model, the following operations take place: The original model is trained in a cross-validated fashion to produce out-of-sample predicted probabilities. Then, these predicted probabilities are used to identify label issues, which are then removed from the dataset. Finally, the original model is trained once more on the remaining clean subset of the data.

[24]:
_ = cl.fit(train_texts, train_labels)
INFO:tensorflow:Assets written to: ram:///tmp/tmpk9louz4g/assets
Epoch 1/10
625/625 [==============================] - 2s 3ms/step - loss: 0.6444 - categorical_accuracy: 0.4897
Epoch 2/10
625/625 [==============================] - 2s 3ms/step - loss: 0.4863 - categorical_accuracy: 0.4859
Epoch 3/10
625/625 [==============================] - 2s 3ms/step - loss: 0.3807 - categorical_accuracy: 0.4920
Epoch 4/10
625/625 [==============================] - 2s 3ms/step - loss: 0.3243 - categorical_accuracy: 0.4929
Epoch 5/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2878 - categorical_accuracy: 0.4927
Epoch 6/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2611 - categorical_accuracy: 0.4960
Epoch 7/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2391 - categorical_accuracy: 0.4945
Epoch 8/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2226 - categorical_accuracy: 0.4961
Epoch 9/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2062 - categorical_accuracy: 0.4965
Epoch 10/10
625/625 [==============================] - 2s 3ms/step - loss: 0.1920 - categorical_accuracy: 0.4963
157/157 [==============================] - 0s 934us/step
INFO:tensorflow:Assets written to: ram:///tmp/tmp1_0xjlr_/assets
Epoch 1/10
625/625 [==============================] - 2s 3ms/step - loss: 0.6442 - categorical_accuracy: 0.4123
Epoch 2/10
625/625 [==============================] - 2s 3ms/step - loss: 0.4836 - categorical_accuracy: 0.4827
Epoch 3/10
625/625 [==============================] - 2s 3ms/step - loss: 0.3773 - categorical_accuracy: 0.4904
Epoch 4/10
625/625 [==============================] - 2s 3ms/step - loss: 0.3216 - categorical_accuracy: 0.4933
Epoch 5/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2846 - categorical_accuracy: 0.4938
Epoch 6/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2583 - categorical_accuracy: 0.4938
Epoch 7/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2368 - categorical_accuracy: 0.4956
Epoch 8/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2187 - categorical_accuracy: 0.4956
Epoch 9/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2043 - categorical_accuracy: 0.4965
Epoch 10/10
625/625 [==============================] - 2s 3ms/step - loss: 0.1917 - categorical_accuracy: 0.4953
157/157 [==============================] - 0s 1ms/step
INFO:tensorflow:Assets written to: ram:///tmp/tmpy1uu8i75/assets
Epoch 1/10
625/625 [==============================] - 2s 3ms/step - loss: 0.6448 - categorical_accuracy: 0.4358
Epoch 2/10
625/625 [==============================] - 2s 3ms/step - loss: 0.4857 - categorical_accuracy: 0.4897
Epoch 3/10
625/625 [==============================] - 2s 3ms/step - loss: 0.3797 - categorical_accuracy: 0.4917
Epoch 4/10
625/625 [==============================] - 2s 3ms/step - loss: 0.3237 - categorical_accuracy: 0.4940
Epoch 5/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2872 - categorical_accuracy: 0.4953
Epoch 6/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2610 - categorical_accuracy: 0.4967
Epoch 7/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2402 - categorical_accuracy: 0.4956
Epoch 8/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2215 - categorical_accuracy: 0.4970
Epoch 9/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2059 - categorical_accuracy: 0.4972
Epoch 10/10
625/625 [==============================] - 2s 3ms/step - loss: 0.1940 - categorical_accuracy: 0.4966
157/157 [==============================] - 0s 942us/step
INFO:tensorflow:Assets written to: ram:///tmp/tmpxdxujm99/assets
Epoch 1/10
625/625 [==============================] - 2s 3ms/step - loss: 0.6455 - categorical_accuracy: 0.4367
Epoch 2/10
625/625 [==============================] - 2s 3ms/step - loss: 0.4855 - categorical_accuracy: 0.4849
Epoch 3/10
625/625 [==============================] - 2s 3ms/step - loss: 0.3778 - categorical_accuracy: 0.4904
Epoch 4/10
625/625 [==============================] - 2s 3ms/step - loss: 0.3234 - categorical_accuracy: 0.4934
Epoch 5/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2863 - categorical_accuracy: 0.4938
Epoch 6/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2588 - categorical_accuracy: 0.4958
Epoch 7/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2391 - categorical_accuracy: 0.4959
Epoch 8/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2219 - categorical_accuracy: 0.4965
Epoch 9/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2081 - categorical_accuracy: 0.4960
Epoch 10/10
625/625 [==============================] - 2s 3ms/step - loss: 0.1925 - categorical_accuracy: 0.4956
157/157 [==============================] - 0s 941us/step
INFO:tensorflow:Assets written to: ram:///tmp/tmp5kuxn0c6/assets
Epoch 1/10
625/625 [==============================] - 2s 3ms/step - loss: 0.6451 - categorical_accuracy: 0.4056
Epoch 2/10
625/625 [==============================] - 2s 3ms/step - loss: 0.4852 - categorical_accuracy: 0.4873
Epoch 3/10
625/625 [==============================] - 2s 3ms/step - loss: 0.3770 - categorical_accuracy: 0.4908
Epoch 4/10
625/625 [==============================] - 2s 3ms/step - loss: 0.3214 - categorical_accuracy: 0.4929
Epoch 5/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2848 - categorical_accuracy: 0.4969
Epoch 6/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2575 - categorical_accuracy: 0.4963
Epoch 7/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2356 - categorical_accuracy: 0.4956
Epoch 8/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2191 - categorical_accuracy: 0.4970
Epoch 9/10
625/625 [==============================] - 2s 3ms/step - loss: 0.2026 - categorical_accuracy: 0.4976
Epoch 10/10
625/625 [==============================] - 2s 3ms/step - loss: 0.1898 - categorical_accuracy: 0.4952
157/157 [==============================] - 0s 973us/step
Epoch 1/10
744/744 [==============================] - 3s 3ms/step - loss: 0.6462 - categorical_accuracy: 0.4647
Epoch 2/10
744/744 [==============================] - 2s 3ms/step - loss: 0.4202 - categorical_accuracy: 0.4854
Epoch 3/10
744/744 [==============================] - 2s 3ms/step - loss: 0.2948 - categorical_accuracy: 0.4926
Epoch 4/10
744/744 [==============================] - 2s 3ms/step - loss: 0.2294 - categorical_accuracy: 0.4940
Epoch 5/10
744/744 [==============================] - 2s 3ms/step - loss: 0.1864 - categorical_accuracy: 0.4939
Epoch 6/10
744/744 [==============================] - 2s 3ms/step - loss: 0.1568 - categorical_accuracy: 0.4955
Epoch 7/10
744/744 [==============================] - 2s 3ms/step - loss: 0.1333 - categorical_accuracy: 0.4976
Epoch 8/10
744/744 [==============================] - 2s 3ms/step - loss: 0.1150 - categorical_accuracy: 0.4975
Epoch 9/10
744/744 [==============================] - 2s 3ms/step - loss: 0.1003 - categorical_accuracy: 0.4964
Epoch 10/10
744/744 [==============================] - 2s 3ms/step - loss: 0.0868 - categorical_accuracy: 0.4974

We can get predictions from the resulting cleanlab model and evaluate them, just like we did for our original neural network.

[25]:
pred_labels = cl.predict(test_texts)
acc_cl = accuracy_score(test_labels, pred_labels)
print(f"Test accuracy of cleanlab's neural net: {acc_cl}")
782/782 [==============================] - 1s 861us/step
Test accuracy of cleanlab's neural net: 0.87548

We can see that the test set accuracy slightly improved as a result of the data cleaning. Note that this will not always be the case, especially when we are evaluating on test data that are themselves noisy. The best practice is to run cleanlab to identify potential label issues and then manually review them, before blindly trusting any accuracy metrics. In particular, the most effort should be made to ensure high-quality test data, which is supposed to reflect the expected performance of our model during deployment.