data#

Classes and methods for datasets that are loaded into Datalab.

Exceptions:

DataFormatError(data)

Exception raised when the data is not in a supported format.

DatasetDictError()

Exception raised when a DatasetDict is passed to Datalab.

DatasetLoadError(dataset_type)

Exception raised when a dataset cannot be loaded.

Classes:

Data(data, task[, label_name])

Class that holds and validates datasets for Datalab.

Label(*, data[, label_name, map_to_int])

Class to represent labels in a dataset.

MultiLabel(data, label_name, map_to_int)

MultiClass(data, label_name, map_to_int)

exception cleanlab.datalab.internal.data.DataFormatError(data)[source]#

Bases: ValueError

Exception raised when the data is not in a supported format.

add_note()#

Exception.add_note(note) – add a note to the exception

args#
with_traceback()#

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception cleanlab.datalab.internal.data.DatasetDictError[source]#

Bases: ValueError

Exception raised when a DatasetDict is passed to Datalab.

Usually, this means that a dataset identifier was passed to Datalab, but the dataset is a DatasetDict, which contains multiple splits of the dataset.

add_note()#

Exception.add_note(note) – add a note to the exception

args#
with_traceback()#

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception cleanlab.datalab.internal.data.DatasetLoadError(dataset_type)[source]#

Bases: ValueError

Exception raised when a dataset cannot be loaded.

Parameters:

dataset_type (type) – The type of dataset that failed to load.

add_note()#

Exception.add_note(note) – add a note to the exception

args#
with_traceback()#

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class cleanlab.datalab.internal.data.Data(data, task, label_name=None)[source]#

Bases: object

Class that holds and validates datasets for Datalab.

Internally, the data is stored as a datasets.Dataset object and the labels are integers (ranging from 0 to K-1, where K is the number of classes) stored in a numpy array.

Parameters:
  • data (Union[Dataset, DataFrame, Dict[str, Any], List[Dict[str, Any]], str]) –

    Dataset to be audited by Datalab. Several formats are supported, which will internally be converted to a Dataset object.

    Supported formats:
    • datasets.Dataset

    • pandas.DataFrame

    • dict
      • keys are strings

      • values are arrays or lists of equal length

    • list
      • list of dictionaries with the same keys

    • str
      • path to a local file
        • Text (.txt)

        • CSV (.csv)

        • JSON (.json)

      • or a dataset identifier on the Hugging Face Hub

    It checks if the string is a path to a file that exists locally, and if not, it assumes it is a dataset identifier on the Hugging Face Hub.

  • label_name (Union[str, List[str]]) – Name of the label column in the dataset.

  • task (Task) –

    The task associated with the dataset. This is used to determine how to to format the labels.

    Note:

    • If the task is a classification task, the labels

    will be mapped to integers, e.g. [0, 1, …, K-1] where K is the number of classes. If the task is a regression task, the labels will not be mapped to integers.

    • If the task is a multilabel task, the labels will be formatted as a list of lists, e.g. [[0, 1], [1, 2], [0, 2]] where each sublist contains the labels for a single example. If the task is not a multilabel task, the labels will be formatted as a 1D numpy array.

Warning

Optional dependencies:

  • datasets :

    Dataset, DatasetDict and load_dataset are imported from datasets. This is an optional dependency of cleanlab, but is required for Datalab to work.

Attributes:

class_names

has_labels

Check if labels are available.

property class_names: List[str]#
property has_labels: bool#

Check if labels are available.

class cleanlab.datalab.internal.data.Label(*, data, label_name=None, map_to_int=True)[source]#

Bases: ABC

Class to represent labels in a dataset.

It stores the labels as a numpy array and maps them to integers if necessary. If a mapping is not necessary, e.g. for regression tasks, the mapping will be an empty dictionary.

Parameters:
  • data (Dataset) – A Hugging Face Dataset object.

  • label_name (str) – Name of the label column in the dataset.

  • map_to_int (bool) – Whether to map the labels to integers, e.g. [0, 1, …, K-1] where K is the number of classes. If False, the labels are not mapped to integers, e.g. for regression tasks.

Attributes:

class_names

A list of class names that are present in the dataset.

is_available

Check if labels are available.

property class_names: List[str]#

A list of class names that are present in the dataset.

Without labels, this will return an empty list.

property is_available: bool#

Check if labels are available.

class cleanlab.datalab.internal.data.MultiLabel(data, label_name, map_to_int)[source]#

Bases: Label

Attributes:

class_names

A list of class names that are present in the dataset.

is_available

Check if labels are available.

property class_names: List[str]#

A list of class names that are present in the dataset.

Without labels, this will return an empty list.

property is_available: bool#

Check if labels are available.

class cleanlab.datalab.internal.data.MultiClass(data, label_name, map_to_int)[source]#

Bases: Label

Attributes:

class_names

A list of class names that are present in the dataset.

is_available

Check if labels are available.

property class_names: List[str]#

A list of class names that are present in the dataset.

Without labels, this will return an empty list.

property is_available: bool#

Check if labels are available.