data#
Classes and methods for datasets that are loaded into Datalab.
Exceptions:
  | 
Exception raised when the data is not in a supported format.  | 
Exception raised when a DatasetDict is passed to Datalab.  | 
|
  | 
Exception raised when a dataset cannot be loaded.  | 
Classes:
  | 
Class that holds and validates datasets for Datalab.  | 
  | 
Class to represent labels in a dataset.  | 
  | 
|
  | 
- exception cleanlab.datalab.internal.data.DataFormatError(data)[source]#
 Bases:
ValueErrorException raised when the data is not in a supported format.
- add_note()#
 Exception.add_note(note) – add a note to the exception
- args#
 
- with_traceback()#
 Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception cleanlab.datalab.internal.data.DatasetDictError[source]#
 Bases:
ValueErrorException raised when a DatasetDict is passed to Datalab.
Usually, this means that a dataset identifier was passed to Datalab, but the dataset is a DatasetDict, which contains multiple splits of the dataset.
- add_note()#
 Exception.add_note(note) – add a note to the exception
- args#
 
- with_traceback()#
 Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception cleanlab.datalab.internal.data.DatasetLoadError(dataset_type)[source]#
 Bases:
ValueErrorException raised when a dataset cannot be loaded.
- Parameters:
 dataset_type (
type) – The type of dataset that failed to load.
- add_note()#
 Exception.add_note(note) – add a note to the exception
- args#
 
- with_traceback()#
 Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- class cleanlab.datalab.internal.data.Data(data, task, label_name=None)[source]#
 Bases:
objectClass that holds and validates datasets for Datalab.
Internally, the data is stored as a datasets.Dataset object and the labels are integers (ranging from 0 to K-1, where K is the number of classes) stored in a numpy array.
- Parameters:
 data (
Union[Dataset,DataFrame,Dict[str,Any],List[Dict[str,Any]],str]) –Dataset to be audited by Datalab. Several formats are supported, which will internally be converted to a Dataset object.
- Supported formats:
 datasets.Dataset
pandas.DataFrame
- dict
 keys are strings
values are arrays or lists of equal length
- list
 list of dictionaries with the same keys
- str
 - path to a local file
 Text (.txt)
CSV (.csv)
JSON (.json)
or a dataset identifier on the Hugging Face Hub
It checks if the string is a path to a file that exists locally, and if not, it assumes it is a dataset identifier on the Hugging Face Hub.
label_name (
Union[str,List[str]]) – Name of the label column in the dataset.task (
Task) –The task associated with the dataset. This is used to determine how to to format the labels.
Note:
If the task is a classification task, the labels
will be mapped to integers, e.g. [0, 1, …, K-1] where K is the number of classes. If the task is a regression task, the labels will not be mapped to integers.
If the task is a multilabel task, the labels will be formatted as a list of lists, e.g. [[0, 1], [1, 2], [0, 2]] where each sublist contains the labels for a single example. If the task is not a multilabel task, the labels will be formatted as a 1D numpy array.
Warning
Optional dependencies:
- datasets :
 Dataset, DatasetDict and load_dataset are imported from datasets. This is an optional dependency of cleanlab, but is required for
Datalabto work.
Attributes:
Check if labels are available.
- property class_names: List[str]#
 
- property has_labels: bool#
 Check if labels are available.
- class cleanlab.datalab.internal.data.Label(*, data, label_name=None, map_to_int=True)[source]#
 Bases:
ABCClass to represent labels in a dataset.
It stores the labels as a numpy array and maps them to integers if necessary. If a mapping is not necessary, e.g. for regression tasks, the mapping will be an empty dictionary.
- Parameters:
 data (
Dataset) – A Hugging Face Dataset object.label_name (
str) – Name of the label column in the dataset.map_to_int (
bool) – Whether to map the labels to integers, e.g. [0, 1, …, K-1] where K is the number of classes. If False, the labels are not mapped to integers, e.g. for regression tasks.
Attributes:
A list of class names that are present in the dataset.
Check if labels are available.
- property class_names: List[str]#
 A list of class names that are present in the dataset.
Without labels, this will return an empty list.
- property is_available: bool#
 Check if labels are available.
- class cleanlab.datalab.internal.data.MultiLabel(data, label_name, map_to_int)[source]#
 Bases:
LabelAttributes:
A list of class names that are present in the dataset.
Check if labels are available.
- property class_names: List[str]#
 A list of class names that are present in the dataset.
Without labels, this will return an empty list.
- property is_available: bool#
 Check if labels are available.
- class cleanlab.datalab.internal.data.MultiClass(data, label_name, map_to_int)[source]#
 Bases:
LabelAttributes:
A list of class names that are present in the dataset.
Check if labels are available.
- property class_names: List[str]#
 A list of class names that are present in the dataset.
Without labels, this will return an empty list.
- property is_available: bool#
 Check if labels are available.