Audio Classification with SpeechBrain and Cleanlab#

In this 5-minute quickstart tutorial, we use cleanlab to find label issues in the Spoken Digit dataset (it’s like MNIST for audio). The dataset contains 2,500 audio clips with English pronunciations of the digits 0 to 9 (these are the class labels to predict from the audio).

Overview of what we’ll do in this tutorial:

  • Extract features from audio clips (.wav files) using a pre-trained Pytorch model from HuggingFace that was previously fit to the VoxCeleb speech dataset.

  • Train a cross-validated linear model using the extracted features and generate out-of-sample predicted probabilities.

  • Use cleanlab to identify a list of audio clips with potential label errors.


Already have a model? Run cross-validation to get out-of-sample pred_probs and then the code below to get label issue indices ranked by their inferred severity.

from cleanlab.filter import find_label_issues

ranked_label_issues = find_label_issues(

1. Install dependencies and import them#

You can use pip to install all packages required for this tutorial as follows:

!pip install speechbrain tensorflow sklearn tensorflow_io
!pip install cleanlab
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+

Let’s import some of the packages needed throughout this tutorial.

import os
import pandas as pd
import numpy as np
import random
import tensorflow as tf
import torch

SEED = 456  # ensure reproducibility

2. Load the data#

We must first fetch the dataset. To run the below command, you’ll need to have wget installed; alternatively you can manually navigate to the link in your browser and download from there.


!mkdir spoken_digits
!tar -xf v1.0.9.tar.gz -C spoken_digits

The audio data are .wav files in the recordings/ folder. Note that the label for each audio clip (i.e. digit from 0 to 9) is indicated in the prefix of the file name (e.g. 6_nicolas_32.wav has the label 6). If instead applying cleanlab to your own dataset, its classes should be represented as integer indices 0, 1, …, num_classes - 1.

DATA_PATH = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/"

# Get list of .wav file names
# os.listdir order is nondeterministic, so for reproducibility,
# we sort first and then do a deterministic shuffle
file_names = sorted(i for i in os.listdir(DATA_PATH) if i.endswith(".wav"))

file_paths = [os.path.join(DATA_PATH, name) for name in file_names]

# Check out first 3 files

Let’s listen to some example audio clips from the dataset. We introduce a display_example function to process the .wav file so we can listen to it in this notebook (can skip these details).

See the implementation of display_example (click to expand)

# Note: This pulldown content is for, if running on local Jupyter or Colab, please ignore it.

import tensorflow_io as tfio
from pathlib import Path
from IPython import display

# Utility function for loading audio files and making sure the sample rate is correct.
def load_wav_16k_mono(filename):
    """Load a WAV file, convert it to a float tensor, resample to 16 kHz single-channel audio."""
    file_contents =
    wav, sample_rate =, desired_channels=1)
    wav = tf.squeeze(wav, axis=-1)
    sample_rate = tf.cast(sample_rate, dtype=tf.int64)
    wav =, rate_in=sample_rate, rate_out=16000)
    return wav

def display_example(wav_file_name, audio_rate=16000):
    """Allows us to listen to any wav file and displays its given label in the dataset."""
    wav_file_example = load_wav_16k_mono(wav_file_name)
    label = Path(wav_file_name).parts[-1].split("_")[0]
    print(f"Given label for this example: {label}")
    display.display(display.Audio(wav_file_example, rate=audio_rate))

Click the play button below to listen to this example .wav file. Feel free to change the wav_file_name_example variable below to listen to other audio clips in the dataset.

wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_jackson_43.wav"  # change this to hear other examples
Given label for this example: 7

3. Use pre-trained SpeechBrain model to featurize audio#

The SpeechBrain package offers many Pytorch neural networks that have been pretrained for speech recognition tasks. Here we instantiate an audio feature extractor using SpeechBrain’s EncoderClassifier. We’ll use the “spkrec-xvect-voxceleb” network which has been pre-trained on the VoxCeleb speech dataset.


from speechbrain.pretrained import EncoderClassifier

feature_extractor = EncoderClassifier.from_hparams(
  # run_opts={"device":"cuda"}  # Uncomment this to run on GPU if you have one (optional)

Next, we run the audio clips through the pre-trained model to extract vector features (aka embeddings).

# Create dataframe with .wav file names
df = pd.DataFrame(file_paths, columns=["wav_audio_file_path"])
df["label"] = x: int(Path(x).parts[-1].split("_")[0]))
# Note: Classes must be represented as integer indices 0, 1, ..., num_classes - 1
# Eg. for dataset with 7 examples from 3 classes, labels might be: np.array([2,0,0,1,2,0,1])
wav_audio_file_path label
0 spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_george_26.wav 7
1 spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_24.wav 0
2 spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_6.wav 0
import torchaudio

def extract_audio_embeddings(model, wav_audio_file_path: str) -> tuple:
    """Feature extractor that embeds audio into a vector."""
    signal, fs = torchaudio.load(wav_audio_file_path)  # Reformat audio signal into a tensor
    embeddings = model.encode_batch(
    )  # Pass tensor through pretrained neural net and extract representation
    return embeddings
# Extract audio embeddings
embeddings_list = []
for i, file_name in enumerate(df.wav_audio_file_path):  # for each .wav file name
    embeddings = extract_audio_embeddings(feature_extractor, file_name)

embeddings_array = np.squeeze(np.array(embeddings_list))

Now we have our features in a 2D numpy array. Each row in the array corresponds to an audio clip. We’re now able to represent each audio clip as a 512-dimensional feature vector!

print("Shape of array: ", embeddings_array.shape)
[[-14.196308     7.319454    12.47899    ...   2.289091     2.817013
  -10.892642  ]
 [-24.898056     5.2561927   12.559636   ...  -3.5597174    9.6206665
  -10.285249  ]
 [-21.709625     7.5033684    7.913807   ...  -6.819826     3.1831462
  -17.208761  ]
 [-16.08425      6.321053    12.005463   ...   1.216175     9.478231
  -10.682177  ]
 [-15.053815     5.2424726    1.091422   ...  -0.78335106   9.039538
  -23.569181  ]
 [-19.76109      1.1258249   16.75323    ...   3.3508852   11.598273
  -16.237118  ]]
Shape of array:  (2500, 512)

4. Fit linear model and compute out-of-sample predicted probabilities#

A typical way to leverage pretrained networks for a particular classification task is to add a linear output layer and fine-tune the network parameters on the new data. However this can be computationally intensive. Alternatively, we can freeze the pretrained weights of the network and only train the output layer without having to rely on GPU(s). Here we do this conveniently by fitting a scikit-learn linear model on top of the extracted network embeddings.

To identify label issues, cleanlab requires a probabilistic prediction from your model for every datapoint that should be considered. However these predictions will be overfit (and thus unreliable) for datapoints the model was previously trained on. cleanlab is intended to only be used with out-of-sample predicted probabilities, i.e. on datapoints held-out from the model during the training.

K-fold cross-validation is a straightforward way to produce out-of-sample predicted probabilities for every datapoint in the dataset, by training K copies of our model on different data subsets and using each copy to predict on the subset of data it did not see during training. An additional benefit of cross-validation is that it provides more reliable evaluation of our model than a single training/validation split. We can obtain cross-validated out-of-sample predicted probabilities from any classifier via the cross_val_predict wrapper provided in scikit-learn.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

model = LogisticRegression(C=0.01, max_iter=1000, tol=1e-1, random_state=SEED)

num_crossval_folds = 5  # can decrease this value to reduce runtime, or increase it to get better results
pred_probs = cross_val_predict(
    estimator=model, X=embeddings_array, y=df.label.values, cv=num_crossval_folds, method="predict_proba"

For each audio clip, the corresponding predicted probabilities in pred_probs are produced by a copy of our LogisticRegression model that has never been trained on this audio clip. Hence we call these predictions out-of-sample. An additional benefit of cross-validation is that it provides more reliable evaluation of our model than a single training/validation split.

from sklearn.metrics import accuracy_score

predicted_labels = pred_probs.argmax(axis=1)
cv_accuracy = accuracy_score(df.label.values, predicted_labels)
print(f"Cross-validated estimate of accuracy on held-out data: {cv_accuracy}")
Cross-validated estimate of accuracy on held-out data: 0.9772

5. Use cleanlab to find label issues#

Based on the given labels and out-of-sample predicted probabilities, cleanlab can quickly help us identify label issues in our dataset. For a dataset with N examples from K classes, the labels should be a 1D array of length N and predicted probabilities should be a 2D (N x K) array. Here we request that the indices of the identified label issues should be sorted by cleanlab’s self-confidence score, which measures the quality of each given label via the probability assigned it in our model’s prediction.

import cleanlab

label_issues_indices = cleanlab.filter.find_label_issues(
    return_indices_ranked_by="self_confidence",  # ranks the label issues

[1946  469  516 1871 1955 2132]

The examples flagged by cleanlab are those worth inspecting more closely.

wav_audio_file_path label
1946 spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_14.wav 6
469 spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_35.wav 6
516 spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_36.wav 6
1871 spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_theo_27.wav 6
1955 spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/4_george_31.wav 4
2132 spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_nicolas_8.wav 6

Let’s listen to some audio clips below of label issues that were identified in this list.

In this example, the given label is 6 but it sounds like 8.

wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_14.wav"
Given label for this example: 6

In the three examples below, the given label is 6 but they sound quite ambiguous.

wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_36.wav"
Given label for this example: 6
wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_35.wav"
Given label for this example: 6
wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_nicolas_8.wav"
Given label for this example: 6

You can see that even widely-used datasets like Spoken Digit contain problematic labels. Never blindly trust your data! You should always check it for potential issues, many of which can be easily identified by cleanlab.