Warning

Parts of this site uses JavaScript, but your browser does not support it.

Detecting Issues in an Audio Dataset with Datalab#

In this 5-minute quickstart tutorial, we use cleanlab to find label issues in the Spoken Digit dataset (it’s like MNIST for audio). The dataset contains 2,500 audio clips with English pronunciations of the digits 0 to 9 (these are the class labels to predict from the audio).

Overview of what we’ll do in this tutorial:

Extract features from audio clips (.wav files) using a pre-trained Pytorch model from HuggingFace that was previously fit to the VoxCeleb speech dataset.
Train a cross-validated linear model using the extracted features and generate out-of-sample predicted probabilities.
Apply cleanlab’s Datalab audit to these predictions in order to identify which audio clips in the dataset are likely mislabeled.

Quickstart

Already have a model? Run cross-validation to get out-of-sample pred_probs, and then run the code below to audit your dataset and identify any potential issues.

from cleanlab import Datalab

lab = Datalab(data=your_dataset, label_name="column_name_of_labels")
lab.find_issues(pred_probs=your_pred_probs, issue_types={"label":{}})

lab.get_issues("label")

1. Install dependencies and import them#

You can use pip to install all packages required for this tutorial as follows:

!pip install huggingface_hub==0.17.0 speechbrain==0.5.13
!pip install "cleanlab[datalab]"
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
#     !pip install git+https://github.com/cleanlab/cleanlab.git

Let’s import some of the packages needed throughout this tutorial.

[2]:

import os
import pandas as pd
import numpy as np
import random
import torch
import torchaudio
import torchaudio

from cleanlab import Datalab

SEED = 456  # ensure reproducibility

2. Load the data#

We must first fetch the dataset. To run the below command, you’ll need to have wget installed; alternatively you can manually navigate to the link in your browser and download from there.

[4]:

%%capture

!wget https://github.com/Jakobovski/free-spoken-digit-dataset/archive/v1.0.9.tar.gz
!mkdir spoken_digits
!tar -xf v1.0.9.tar.gz -C spoken_digits

The audio data are .wav files in the recordings/ folder. Note that the label for each audio clip (i.e. digit from 0 to 9) is indicated in the prefix of the file name (e.g. 6_nicolas_32.wav has the label 6). If instead applying cleanlab to your own dataset, its classes should be represented as integer indices 0, 1, …, num_classes - 1.

[5]:

DATA_PATH = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/"

# Get list of .wav file names
# os.listdir order is nondeterministic, so for reproducibility,
# we sort first and then do a deterministic shuffle
file_names = sorted(i for i in os.listdir(DATA_PATH) if i.endswith(".wav"))
random.Random(SEED).shuffle(file_names)

file_paths = [os.path.join(DATA_PATH, name) for name in file_names]

# Check out first 3 files
file_paths[:3]

[5]:

['spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_george_26.wav',
 'spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_24.wav',
 'spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_6.wav']

Let’s listen to some example audio clips from the dataset. We introduce a display_example function to process the .wav file so we can listen to it in this notebook (can skip these details).

See the implementation of display_example (click to expand)

# Note: This pulldown content is for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.

import torch
import torchaudio
from pathlib import Path
from IPython import display

# Utility function for loading audio files and making sure the sample rate is correct.
def load_wav_16k_mono(filename):
    """Load a WAV file, convert it to a float tensor, resample to 16 kHz single-channel audio."""
    # Load audio file with torchaudio
    waveform, sample_rate = torchaudio.load(filename)

    # Convert to mono if stereo
    if waveform.shape[0] > 1:
        waveform = torch.mean(waveform, dim=0, keepdim=True)

    # Resample to 16kHz if needed
    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(sample_rate, 16000)
        waveform = resampler(waveform)

    return waveform.squeeze()


def display_example(wav_file_name, audio_rate=16000):
    """Allows us to listen to any wav file and displays its given label in the dataset."""
    wav_file_example = load_wav_16k_mono(wav_file_name)
    label = Path(wav_file_name).parts[-1].split("_")[0]
    print(f"Given label for this example: {label}")
    display.display(display.Audio(wav_file_example.numpy(), rate=audio_rate))

Click the play button below to listen to this example .wav file. Feel free to change the wav_file_name_example variable below to listen to other audio clips in the dataset.

[7]:

wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_jackson_43.wav"  # change this to hear other examples
display_example(wav_file_name_example)

Given label for this example: 7

3. Use pre-trained SpeechBrain model to featurize audio#

The SpeechBrain package offers many Pytorch neural networks that have been pretrained for speech recognition tasks. Here we instantiate an audio feature extractor using SpeechBrain’s EncoderClassifier. We’ll use the “spkrec-xvect-voxceleb” network which has been pre-trained on the VoxCeleb speech dataset.

[8]:

%%capture

from speechbrain.pretrained import EncoderClassifier

feature_extractor = EncoderClassifier.from_hparams(
  "speechbrain/spkrec-xvect-voxceleb",
  # run_opts={"device":"cuda"}  # Uncomment this to run on GPU if you have one (optional)
)

Next, we run the audio clips through the pre-trained model to extract vector features (aka embeddings).

[9]:

# Create dataframe with .wav file names
df = pd.DataFrame(file_paths, columns=["wav_audio_file_path"])
df["label"] = df.wav_audio_file_path.map(lambda x: int(Path(x).parts[-1].split("_")[0]))
df.head(3)

[9]:

	wav_audio_file_path	label
0	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_george_26.wav	7
1	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_24.wav	0
2	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_6.wav	0

[10]:

def extract_audio_embeddings(model, wav_audio_file_path: str) -> tuple:
    """Feature extractor that embeds audio into a vector."""
    signal, fs = torchaudio.load(wav_audio_file_path)  # Load audio signal as tensor
    embeddings = model.encode_batch(
        signal
    )  # Pass tensor through pretrained neural net and extract representation
    return embeddings

[11]:

# Extract audio embeddings
embeddings_list = []
for i, file_name in enumerate(df.wav_audio_file_path):  # for each .wav file name
    embeddings = extract_audio_embeddings(feature_extractor, file_name)
    embeddings_list.append(embeddings.cpu().numpy())

embeddings_array = np.squeeze(np.array(embeddings_list))

Now we have our features in a 2D numpy array. Each row in the array corresponds to an audio clip. We’re now able to represent each audio clip as a 512-dimensional feature vector!

[12]:

print(embeddings_array)
print("Shape of array: ", embeddings_array.shape)

[[-14.196311     7.319459    12.478975   ...   2.2890875    2.8170238
  -10.89265   ]
 [-24.898056     5.256195    12.559641   ...  -3.559721     9.62067
  -10.285245  ]
 [-21.709627     7.5033693    7.913803   ...  -6.819831     3.1831515
  -17.208763  ]
 ...
 [-16.084257     6.3210397   12.005453   ...   1.216152     9.478235
  -10.6821785 ]
 [-15.053807     5.242471     1.091424   ...  -0.78334856   9.03954
  -23.569176  ]
 [-19.761097     1.1258295   16.753237   ...   3.3508866   11.598274
  -16.23712   ]]
Shape of array:  (2500, 512)

4. Fit linear model and compute out-of-sample predicted probabilities#

A typical way to leverage pretrained networks for a particular classification task is to add a linear output layer and fine-tune the network parameters on the new data. However this can be computationally intensive. Alternatively, we can freeze the pretrained weights of the network and only train the output layer without having to rely on GPU(s). Here we do this conveniently by fitting a scikit-learn linear model on top of the extracted network embeddings.

To identify label issues, cleanlab requires a probabilistic prediction from your model for every datapoint that should be considered. However these predictions will be overfit (and thus unreliable) for datapoints the model was previously trained on. cleanlab is intended to only be used with out-of-sample predicted probabilities, i.e. on datapoints held-out from the model during the training.

K-fold cross-validation is a straightforward way to produce out-of-sample predicted probabilities for every datapoint in the dataset, by training K copies of our model on different data subsets and using each copy to predict on the subset of data it did not see during training. An additional benefit of cross-validation is that it provides more reliable evaluation of our model than a single training/validation split. We can obtain cross-validated out-of-sample predicted probabilities from any classifier via the cross_val_predict wrapper provided in scikit-learn. Make sure that the columns of your pred_probs are properly ordered with respect to the ordering of classes, which for Datalab is: lexicographically sorted by class name.

[13]:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

model = LogisticRegression(C=0.01, max_iter=1000, tol=1e-2, random_state=SEED)

num_crossval_folds = 5  # can decrease this value to reduce runtime, or increase it to get better results
pred_probs = cross_val_predict(
    estimator=model, X=embeddings_array, y=df.label.values, cv=num_crossval_folds, method="predict_proba"
)

For each audio clip, the corresponding predicted probabilities in pred_probs are produced by a copy of our LogisticRegression model that has never been trained on this audio clip. Hence we call these predictions out-of-sample. An additional benefit of cross-validation is that it provides more reliable evaluation of our model than a single training/validation split.

[14]:

from sklearn.metrics import accuracy_score

predicted_labels = pred_probs.argmax(axis=1)
cv_accuracy = accuracy_score(df.label.values, predicted_labels)
print(f"Cross-validated estimate of accuracy on held-out data: {cv_accuracy}")

Cross-validated estimate of accuracy on held-out data: 0.9708

5. Use cleanlab to find label issues#

Based on the given labels, out-of-sample predicted probabilities and features, cleanlab can quickly help us identify label issues in our dataset. For a dataset with N examples from K classes, the labels should be a 1D array of length N and predicted probabilities should be a 2D (N x K) array.

Here, we use cleanlab to find potential label errors in our data. Datalab has several ways of loading the data. In this case, we can just pass the DataFrame created above to instantiate the object. We will then pass in the predicted probabilites to the find_issues() method so that Datalab can use them to find potential label errors in our data.

[15]:

lab = Datalab(df, label_name="label")
lab.find_issues(pred_probs=pred_probs, issue_types={"label":{}})

Finding label issues ...

Audit complete. 7 issues found in the dataset.

We can view the results of running Datalab by calling the report method:

[16]:

lab.report()

Dataset Information: num_examples: 2500, num_classes: 10

Here is a summary of various issues found in your data:

issue_type  num_issues
     label           7

Learn about each issue: https://docs.cleanlab.ai/stable/cleanlab/datalab/guide/issue_type_description.html
See which examples in your dataset exhibit each issue via: `datalab.get_issues(<ISSUE_NAME>)`

Data indices corresponding to top examples of each issue are shown below.


----------------------- label issues -----------------------

About this issue:
        Examples whose given label is estimated to be potentially incorrect
    (e.g. due to annotation error) are flagged as having label issues.


Number of examples with this issue: 7
Overall dataset quality in terms of this issue: 0.9976

Examples representing most severe instances of this issue:
      is_label_issue  label_score  given_label  predicted_label
986             True     0.002161            6                3
176             True     0.002483            7                8
2318           False     0.004411            3                6
1005           False     0.004857            0                9
1871            True     0.007494            6                8

We observe from the report that cleanlab has found some label issues in our dataset. Let us investigate these examples further.

We can view the more details about the label quality for each example using the get_issues method, specifying label as the issue type.

[17]:

label_issues = lab.get_issues("label")
label_issues.head()

[17]:

	is_label_issue	label_score	given_label	predicted_label
0	False	0.040587	7	6
1	False	0.999207	0	0
2	False	0.999377	0	0
3	False	0.975220	8	8
4	False	0.999367	5	5

This method returns a dataframe containing a label quality score for each example. These numeric scores lie between 0 and 1, where lower scores indicate examples more likely to be mislabeled. The dataframe also contains a boolean column specifying whether or not each example is identified to have a label issue (indicating it is likely mislabeled).

We can then filter for the examples that have been identified as a label error:

[18]:

identified_label_issues = label_issues[label_issues["is_label_issue"] == True]
lowest_quality_labels = identified_label_issues.sort_values("label_score").index

print(f"Here are indices of the most likely errors: \n {lowest_quality_labels.values}")

Here are indices of the most likely errors:
 [ 986  176 1871  516 1946  469 2132]

These examples flagged by cleanlab are those worth inspecting more closely.

[19]:

df.iloc[lowest_quality_labels]

[19]:

	wav_audio_file_path	label
986	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_25.wav	6
176	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_nicolas_43.wav	7
1871	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_theo_27.wav	6
516	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_36.wav	6
1946	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_14.wav	6
469	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_35.wav	6
2132	spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_nicolas_8.wav	6

Let’s listen to some audio clips below of label issues that were identified in this list.

In this example, the given label is 6 but it sounds like 8.

[20]:

wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_14.wav"
display_example(wav_file_name_example)

Given label for this example: 6

In the three examples below, the given label is 6 but they sound quite ambiguous.

[21]:

wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_36.wav"
display_example(wav_file_name_example)

Given label for this example: 6

[22]:

wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_35.wav"
display_example(wav_file_name_example)

Given label for this example: 6

[23]:

wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_nicolas_8.wav"
display_example(wav_file_name_example)

Given label for this example: 6

You can see that even widely-used datasets like Spoken Digit contain problematic labels. Never blindly trust your data! You should always check it for potential issues, many of which can be easily identified by cleanlab.