data#

Classes and methods for datasets that are loaded into Datalab.

Exceptions:

`DataFormatError`(data)	Exception raised when the data is not in a supported format.
`DatasetDictError`()	Exception raised when a DatasetDict is passed to Datalab.
`DatasetLoadError`(dataset_type)	Exception raised when a dataset cannot be loaded.

Classes:

`Data`(data[, label_name])	Class that holds and validates datasets for Datalab.
`Label`(*, data[, label_name])	Class to represent labels in a dataset.

exception cleanlab.datalab.internal.data.DataFormatError(data)[source]#

Bases: ValueError

Exception raised when the data is not in a supported format.

args#

with_traceback()#: Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception cleanlab.datalab.internal.data.DatasetDictError[source]#

Bases: ValueError

Exception raised when a DatasetDict is passed to Datalab.

Usually, this means that a dataset identifier was passed to Datalab, but the dataset is a DatasetDict, which contains multiple splits of the dataset.

args#

with_traceback()#: Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception cleanlab.datalab.internal.data.DatasetLoadError(dataset_type)[source]#

Bases: ValueError

Exception raised when a dataset cannot be loaded.

Parameters:: dataset_type (type) – The type of dataset that failed to load.

args#

with_traceback()#: Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class cleanlab.datalab.internal.data.Data(data, label_name=None)[source]#

Bases: object

Class that holds and validates datasets for Datalab.

Internally, the data is stored as a datasets.Dataset object and the labels are integers (ranging from 0 to K-1, where K is the number of classes) stored in a numpy array.

Parameters:

data (Union[Dataset, DataFrame, Dict[str, Any], List[Dict[str, Any]], str]) –
Dataset to be audited by Datalab. Several formats are supported, which will internally be converted to a Dataset object.
Supported formats:
- datasets.Dataset
- pandas.DataFrame
- dict
  
  keys are strings
  
  values are arrays or lists of equal length
- list
  
  list of dictionaries with the same keys
- str
  
  path to a local file
  
  Text (.txt)
  
  CSV (.csv)
  
  JSON (.json)
  
  or a dataset identifier on the Hugging Face Hub
It checks if the string is a path to a file that exists locally, and if not, it assumes it is a dataset identifier on the Hugging Face Hub.
label_name (Union[str, List[str]]) – Name of the label column in the dataset.

Warning

Optional dependencies:

datasets :
Dataset, DatasetDict and load_dataset are imported from datasets. This is an optional dependency of cleanlab, but is required for Datalab to work.

Attributes:

class_names

rtype:: List[str]

has_labels

Check if labels are available.

property class_names: List[str]#

Return type:: List[str]

property has_labels: bool#

Check if labels are available.

Return type:: bool

class cleanlab.datalab.internal.data.Label(*, data, label_name=None)[source]#

Bases: object

Class to represent labels in a dataset.

Attributes:

`class_names`	A list of class names that are present in the dataset.
`is_available`	Check if labels are available.

property class_names: List[str]#

A list of class names that are present in the dataset.

Without labels, this will return an empty list.

Return type:: List[str]

property is_available: bool#

Check if labels are available.

Return type:: bool