Detecting Issues in an Audio Dataset with Datalab#
In this 5-minute quickstart tutorial, we use cleanlab to find label issues in the Spoken Digit dataset (it’s like MNIST for audio). The dataset contains 2,500 audio clips with English pronunciations of the digits 0 to 9 (these are the class labels to predict from the audio).
Overview of what we’ll do in this tutorial:
Extract features from audio clips (.wav files) using a pre-trained Pytorch model from HuggingFace that was previously fit to the VoxCeleb speech dataset.
Train a cross-validated linear model using the extracted features and generate out-of-sample predicted probabilities.
Apply cleanlab’s
Datalab
audit to these predictions in order to identify which audio clips in the dataset are likely mislabeled.
Quickstart
Already have a model
? Run cross-validation to get out-of-sample pred_probs
, and then run the code below to audit your dataset and identify any potential issues.
from cleanlab import Datalab
lab = Datalab(data=your_dataset, label_name="column_name_of_labels")
lab.find_issues(pred_probs=your_pred_probs, issue_types={"label":{}})
lab.get_issues("label")
1. Install dependencies and import them#
You can use pip
to install all packages required for this tutorial as follows:
!pip install tensorflow==2.12.1 tensorflow_io==0.32.0 huggingface_hub==0.17.0 speechbrain==0.5.13
!pip install "cleanlab[datalab]"
# Make sure to install the version corresponding to this tutorial
# E.g. if viewing master branch documentation:
# !pip install git+https://github.com/cleanlab/cleanlab.git
Let’s import some of the packages needed throughout this tutorial.
[2]:
import os
import pandas as pd
import numpy as np
import random
import tensorflow as tf
import torch
from cleanlab import Datalab
SEED = 456 # ensure reproducibility
2. Load the data#
We must first fetch the dataset. To run the below command, you’ll need to have wget
installed; alternatively you can manually navigate to the link in your browser and download from there.
[4]:
%%capture
!wget https://github.com/Jakobovski/free-spoken-digit-dataset/archive/v1.0.9.tar.gz
!mkdir spoken_digits
!tar -xf v1.0.9.tar.gz -C spoken_digits
The audio data are .wav files in the recordings/
folder. Note that the label for each audio clip (i.e. digit from 0 to 9) is indicated in the prefix of the file name (e.g. 6_nicolas_32.wav
has the label 6). If instead applying cleanlab to your own dataset, its classes should be represented as integer indices 0, 1, …, num_classes - 1.
[5]:
DATA_PATH = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/"
# Get list of .wav file names
# os.listdir order is nondeterministic, so for reproducibility,
# we sort first and then do a deterministic shuffle
file_names = sorted(i for i in os.listdir(DATA_PATH) if i.endswith(".wav"))
random.Random(SEED).shuffle(file_names)
file_paths = [os.path.join(DATA_PATH, name) for name in file_names]
# Check out first 3 files
file_paths[:3]
[5]:
['spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_george_26.wav',
'spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_24.wav',
'spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_6.wav']
Let’s listen to some example audio clips from the dataset. We introduce a display_example
function to process the .wav file so we can listen to it in this notebook (can skip these details).
See the implementation of display_example
(click to expand)
display_example
(click to expand)# Note: This pulldown content is for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.
import tensorflow_io as tfio
from pathlib import Path
from IPython import display
# Utility function for loading audio files and making sure the sample rate is correct.
@tf.function
def load_wav_16k_mono(filename):
"""Load a WAV file, convert it to a float tensor, resample to 16 kHz single-channel audio."""
file_contents = tf.io.read_file(filename)
wav, sample_rate = tf.audio.decode_wav(file_contents, desired_channels=1)
wav = tf.squeeze(wav, axis=-1)
sample_rate = tf.cast(sample_rate, dtype=tf.int64)
wav = tfio.audio.resample(wav, rate_in=sample_rate, rate_out=16000)
return wav
def display_example(wav_file_name, audio_rate=16000):
"""Allows us to listen to any wav file and displays its given label in the dataset."""
wav_file_example = load_wav_16k_mono(wav_file_name)
label = Path(wav_file_name).parts[-1].split("_")[0]
print(f"Given label for this example: {label}")
display.display(display.Audio(wav_file_example, rate=audio_rate))
Click the play button below to listen to this example .wav file. Feel free to change the wav_file_name_example
variable below to listen to other audio clips in the dataset.
[7]:
wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_jackson_43.wav" # change this to hear other examples
display_example(wav_file_name_example)
Given label for this example: 7
3. Use pre-trained SpeechBrain model to featurize audio#
The SpeechBrain package offers many Pytorch neural networks that have been pretrained for speech recognition tasks. Here we instantiate an audio feature extractor using SpeechBrain’s EncoderClassifier
. We’ll use the “spkrec-xvect-voxceleb” network which has been pre-trained on the VoxCeleb speech dataset.
[8]:
%%capture
from speechbrain.pretrained import EncoderClassifier
feature_extractor = EncoderClassifier.from_hparams(
"speechbrain/spkrec-xvect-voxceleb",
# run_opts={"device":"cuda"} # Uncomment this to run on GPU if you have one (optional)
)
Next, we run the audio clips through the pre-trained model to extract vector features (aka embeddings).
For this tutorial, ensure that you have ffmpeg
installed on your system. This is the backend used for loading the audio files.
[9]:
# Create dataframe with .wav file names
df = pd.DataFrame(file_paths, columns=["wav_audio_file_path"])
df["label"] = df.wav_audio_file_path.map(lambda x: int(Path(x).parts[-1].split("_")[0]))
df.head(3)
[9]:
wav_audio_file_path | label | |
---|---|---|
0 | spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_george_26.wav | 7 |
1 | spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_24.wav | 0 |
2 | spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/0_nicolas_6.wav | 0 |
[10]:
import torchaudio
def extract_audio_embeddings(model, wav_audio_file_path: str) -> tuple:
"""Feature extractor that embeds audio into a vector."""
signal, fs = torchaudio.load(wav_audio_file_path, backend="ffmpeg") # Reformat audio signal into a tensor
embeddings = model.encode_batch(
signal
) # Pass tensor through pretrained neural net and extract representation
return embeddings
[11]:
# Extract audio embeddings
embeddings_list = []
for i, file_name in enumerate(df.wav_audio_file_path): # for each .wav file name
embeddings = extract_audio_embeddings(feature_extractor, file_name)
embeddings_list.append(embeddings.cpu().numpy())
embeddings_array = np.squeeze(np.array(embeddings_list))
Now we have our features in a 2D numpy array. Each row in the array corresponds to an audio clip. We’re now able to represent each audio clip as a 512-dimensional feature vector!
[12]:
print(embeddings_array)
print("Shape of array: ", embeddings_array.shape)
[[-14.196311 7.319459 12.478975 ... 2.2890875 2.8170238
-10.89265 ]
[-24.898056 5.256195 12.559641 ... -3.559721 9.62067
-10.285245 ]
[-21.709627 7.5033693 7.913803 ... -6.819831 3.1831515
-17.208763 ]
...
[-16.084257 6.3210397 12.005453 ... 1.216152 9.478235
-10.6821785 ]
[-15.053807 5.242471 1.091424 ... -0.78334856 9.03954
-23.569176 ]
[-19.761097 1.1258295 16.753237 ... 3.3508866 11.598274
-16.23712 ]]
Shape of array: (2500, 512)
4. Fit linear model and compute out-of-sample predicted probabilities#
A typical way to leverage pretrained networks for a particular classification task is to add a linear output layer and fine-tune the network parameters on the new data. However this can be computationally intensive. Alternatively, we can freeze the pretrained weights of the network and only train the output layer without having to rely on GPU(s). Here we do this conveniently by fitting a scikit-learn linear model on top of the extracted network embeddings.
To identify label issues, cleanlab requires a probabilistic prediction from your model for every datapoint that should be considered. However these predictions will be overfit (and thus unreliable) for datapoints the model was previously trained on. cleanlab is intended to only be used with out-of-sample predicted probabilities, i.e. on datapoints held-out from the model during the training.
K-fold cross-validation is a straightforward way to produce out-of-sample predicted probabilities for every datapoint in the dataset, by training K copies of our model on different data subsets and using each copy to predict on the subset of data it did not see during training. An additional benefit of cross-validation is that it provides more reliable evaluation of our model than a single training/validation split. We can obtain cross-validated out-of-sample predicted probabilities from any
classifier via the cross_val_predict wrapper provided in scikit-learn. Make sure that the columns of your pred_probs
are properly ordered with respect to the ordering of classes, which for Datalab is: lexicographically sorted by class name.
[13]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
model = LogisticRegression(C=0.01, max_iter=1000, tol=1e-2, random_state=SEED)
num_crossval_folds = 5 # can decrease this value to reduce runtime, or increase it to get better results
pred_probs = cross_val_predict(
estimator=model, X=embeddings_array, y=df.label.values, cv=num_crossval_folds, method="predict_proba"
)
For each audio clip, the corresponding predicted probabilities in pred_probs
are produced by a copy of our LogisticRegression
model that has never been trained on this audio clip. Hence we call these predictions out-of-sample. An additional benefit of cross-validation is that it provides more reliable evaluation of our model than a single training/validation split.
[14]:
from sklearn.metrics import accuracy_score
predicted_labels = pred_probs.argmax(axis=1)
cv_accuracy = accuracy_score(df.label.values, predicted_labels)
print(f"Cross-validated estimate of accuracy on held-out data: {cv_accuracy}")
Cross-validated estimate of accuracy on held-out data: 0.9708
5. Use cleanlab to find label issues#
Based on the given labels, out-of-sample predicted probabilities and features, cleanlab can quickly help us identify label issues in our dataset. For a dataset with N examples from K classes, the labels should be a 1D array of length N and predicted probabilities should be a 2D (N x K) array.
Here, we use cleanlab to find potential label errors in our data. Datalab
has several ways of loading the data. In this case, we can just pass the DataFrame created above to instantiate the object. We will then pass in the predicted probabilites to the find_issues()
method so that Datalab can use them to find potential label errors in our data.
[15]:
lab = Datalab(df, label_name="label")
lab.find_issues(pred_probs=pred_probs, issue_types={"label":{}})
Finding label issues ...
Audit complete. 7 issues found in the dataset.
We can view the results of running Datalab by calling the report
method:
[16]:
lab.report()
Dataset Information: num_examples: 2500, num_classes: 10
Here is a summary of various issues found in your data:
issue_type num_issues
label 7
Learn about each issue: https://docs.cleanlab.ai/stable/cleanlab/datalab/guide/issue_type_description.html
See which examples in your dataset exhibit each issue via: `datalab.get_issues(<ISSUE_NAME>)`
Data indices corresponding to top examples of each issue are shown below.
----------------------- label issues -----------------------
About this issue:
Examples whose given label is estimated to be potentially incorrect
(e.g. due to annotation error) are flagged as having label issues.
Number of examples with this issue: 7
Overall dataset quality in terms of this issue: 0.9976
Examples representing most severe instances of this issue:
is_label_issue label_score given_label predicted_label
986 True 0.002161 6 3
176 True 0.002483 7 8
2318 False 0.004411 3 6
1005 False 0.004857 0 9
1871 True 0.007494 6 8
We observe from the report that cleanlab has found some label issues in our dataset. Let us investigate these examples further.
We can view the more details about the label quality for each example using the get_issues
method, specifying label
as the issue type.
[17]:
label_issues = lab.get_issues("label")
label_issues.head()
[17]:
is_label_issue | label_score | given_label | predicted_label | |
---|---|---|---|---|
0 | False | 0.040587 | 7 | 6 |
1 | False | 0.999207 | 0 | 0 |
2 | False | 0.999377 | 0 | 0 |
3 | False | 0.975220 | 8 | 8 |
4 | False | 0.999367 | 5 | 5 |
This method returns a dataframe containing a label quality score for each example. These numeric scores lie between 0 and 1, where lower scores indicate examples more likely to be mislabeled. The dataframe also contains a boolean column specifying whether or not each example is identified to have a label issue (indicating it is likely mislabeled).
We can then filter for the examples that have been identified as a label error:
[18]:
identified_label_issues = label_issues[label_issues["is_label_issue"] == True]
lowest_quality_labels = identified_label_issues.sort_values("label_score").index
print(f"Here are indices of the most likely errors: \n {lowest_quality_labels.values}")
Here are indices of the most likely errors:
[ 986 176 1871 516 1946 469 2132]
These examples flagged by cleanlab are those worth inspecting more closely.
[19]:
df.iloc[lowest_quality_labels]
[19]:
wav_audio_file_path | label | |
---|---|---|
986 | spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_25.wav | 6 |
176 | spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/7_nicolas_43.wav | 7 |
1871 | spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_theo_27.wav | 6 |
516 | spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_36.wav | 6 |
1946 | spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_14.wav | 6 |
469 | spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_35.wav | 6 |
2132 | spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_nicolas_8.wav | 6 |
Let’s listen to some audio clips below of label issues that were identified in this list.
In this example, the given label is 6 but it sounds like 8.
[20]:
wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_14.wav"
display_example(wav_file_name_example)
Given label for this example: 6
In the three examples below, the given label is 6 but they sound quite ambiguous.
[21]:
wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_36.wav"
display_example(wav_file_name_example)
Given label for this example: 6
[22]:
wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_yweweler_35.wav"
display_example(wav_file_name_example)
Given label for this example: 6
[23]:
wav_file_name_example = "spoken_digits/free-spoken-digit-dataset-1.0.9/recordings/6_nicolas_8.wav"
display_example(wav_file_name_example)
Given label for this example: 6
You can see that even widely-used datasets like Spoken Digit contain problematic labels. Never blindly trust your data! You should always check it for potential issues, many of which can be easily identified by cleanlab.