data#

Classes and methods for datasets that are loaded into Datalab.

Exceptions:

`DataFormatError`(data)	Exception raised when the data is not in a supported format.
`DatasetDictError`()	Exception raised when a DatasetDict is passed to Datalab.
`DatasetLoadError`(dataset_type)	Exception raised when a dataset cannot be loaded.

Classes:

`Data`(data, task[, label_name])	Class that holds and validates datasets for Datalab.
`Label`(*, data[, label_name, map_to_int])	Class to represent labels in a dataset.
`MultiLabel`(data, label_name, map_to_int)
`MultiClass`(data, label_name, map_to_int)

exception cleanlab.datalab.internal.data.DataFormatError(data)[source]#

Bases: ValueError

Exception raised when the data is not in a supported format.

add_note()#: Exception.add_note(note) – add a note to the exception

args#

with_traceback()#: Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception cleanlab.datalab.internal.data.DatasetDictError[source]#

Bases: ValueError

Exception raised when a DatasetDict is passed to Datalab.

Usually, this means that a dataset identifier was passed to Datalab, but the dataset is a DatasetDict, which contains multiple splits of the dataset.

add_note()#: Exception.add_note(note) – add a note to the exception

args#

with_traceback()#: Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception cleanlab.datalab.internal.data.DatasetLoadError(dataset_type)[source]#

Bases: ValueError

Exception raised when a dataset cannot be loaded.

Parameters:: dataset_type (type) – The type of dataset that failed to load.

add_note()#: Exception.add_note(note) – add a note to the exception

args#

with_traceback()#: Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class cleanlab.datalab.internal.data.Data(data, task, label_name=None)[source]#

Bases: object

Class that holds and validates datasets for Datalab.

Internally, the data is stored as a datasets.Dataset object and the labels are integers (ranging from 0 to K-1, where K is the number of classes) stored in a numpy array.

Parameters:

data (Union[Dataset, DataFrame, Dict[str, Any], List[Dict[str, Any]], str]) –
Dataset to be audited by Datalab. Several formats are supported, which will internally be converted to a Dataset object.
Supported formats:
- datasets.Dataset
- pandas.DataFrame
- dict
  
  keys are strings
  
  values are arrays or lists of equal length
- list
  
  list of dictionaries with the same keys
- str
  
  path to a local file
  
  Text (.txt)
  
  CSV (.csv)
  
  JSON (.json)
  
  or a dataset identifier on the Hugging Face Hub
It checks if the string is a path to a file that exists locally, and if not, it assumes it is a dataset identifier on the Hugging Face Hub.
label_name (Union[str, List[str]]) – Name of the label column in the dataset.
task (Task) –
The task associated with the dataset. This is used to determine how to to format the labels.

Note:
- If the task is a classification task, the labels
will be mapped to integers, e.g. [0, 1, …, K-1] where K is the number of classes. If the task is a regression task, the labels will not be mapped to integers.
- If the task is a multilabel task, the labels will be formatted as a list of lists, e.g. [[0, 1], [1, 2], [0, 2]] where each sublist contains the labels for a single example. If the task is not a multilabel task, the labels will be formatted as a 1D numpy array.

Warning

Optional dependencies:

datasets :
Dataset, DatasetDict and load_dataset are imported from datasets. This is an optional dependency of cleanlab, but is required for Datalab to work.

Attributes:

`class_names`
`has_labels`	Check if labels are available.

property class_names: List[str]#

property has_labels: bool#: Check if labels are available.

class cleanlab.datalab.internal.data.Label(*, data, label_name=None, map_to_int=True)[source]#

Bases: ABC

Class to represent labels in a dataset.

It stores the labels as a numpy array and maps them to integers if necessary. If a mapping is not necessary, e.g. for regression tasks, the mapping will be an empty dictionary.

Parameters:

data (Dataset) – A Hugging Face Dataset object.
label_name (str) – Name of the label column in the dataset.
map_to_int (bool) – Whether to map the labels to integers, e.g. [0, 1, …, K-1] where K is the number of classes. If False, the labels are not mapped to integers, e.g. for regression tasks.

Attributes:

`class_names`	A list of class names that are present in the dataset.
`is_available`	Check if labels are available.

property class_names: List[str]#

A list of class names that are present in the dataset.

Without labels, this will return an empty list.

property is_available: bool#: Check if labels are available.

class cleanlab.datalab.internal.data.MultiLabel(data, label_name, map_to_int)[source]#

Bases: Label

Attributes:

`class_names`	A list of class names that are present in the dataset.
`is_available`	Check if labels are available.

property class_names: List[str]#

A list of class names that are present in the dataset.

Without labels, this will return an empty list.

property is_available: bool#: Check if labels are available.

class cleanlab.datalab.internal.data.MultiClass(data, label_name, map_to_int)[source]#

Bases: Label

Attributes:

`class_names`	A list of class names that are present in the dataset.
`is_available`	Check if labels are available.

property class_names: List[str]#

A list of class names that are present in the dataset.

Without labels, this will return an empty list.

property is_available: bool#: Check if labels are available.