data#
Classes and methods for datasets that are loaded into Datalab.
Exceptions:
|
Exception raised when the data is not in a supported format. |
Exception raised when a DatasetDict is passed to Datalab. |
|
|
Exception raised when a dataset cannot be loaded. |
Classes:
|
Class that holds and validates datasets for Datalab. |
|
Class to represent labels in a dataset. |
|
|
|
- exception cleanlab.datalab.internal.data.DataFormatError(data)[source]#
Bases:
ValueError
Exception raised when the data is not in a supported format.
- add_note()#
Exception.add_note(note) – add a note to the exception
- args#
- with_traceback()#
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception cleanlab.datalab.internal.data.DatasetDictError[source]#
Bases:
ValueError
Exception raised when a DatasetDict is passed to Datalab.
Usually, this means that a dataset identifier was passed to Datalab, but the dataset is a DatasetDict, which contains multiple splits of the dataset.
- add_note()#
Exception.add_note(note) – add a note to the exception
- args#
- with_traceback()#
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception cleanlab.datalab.internal.data.DatasetLoadError(dataset_type)[source]#
Bases:
ValueError
Exception raised when a dataset cannot be loaded.
- Parameters:
dataset_type (
type
) – The type of dataset that failed to load.
- add_note()#
Exception.add_note(note) – add a note to the exception
- args#
- with_traceback()#
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- class cleanlab.datalab.internal.data.Data(data, task, label_name=None)[source]#
Bases:
object
Class that holds and validates datasets for Datalab.
Internally, the data is stored as a datasets.Dataset object and the labels are integers (ranging from 0 to K-1, where K is the number of classes) stored in a numpy array.
- Parameters:
data (
Union
[Dataset
,DataFrame
,Dict
[str
,Any
],List
[Dict
[str
,Any
]],str
]) –Dataset to be audited by Datalab. Several formats are supported, which will internally be converted to a Dataset object.
- Supported formats:
datasets.Dataset
pandas.DataFrame
- dict
keys are strings
values are arrays or lists of equal length
- list
list of dictionaries with the same keys
- str
- path to a local file
Text (.txt)
CSV (.csv)
JSON (.json)
or a dataset identifier on the Hugging Face Hub
It checks if the string is a path to a file that exists locally, and if not, it assumes it is a dataset identifier on the Hugging Face Hub.
label_name (
Union[str
,List[str]]
) – Name of the label column in the dataset.task (
Task
) –The task associated with the dataset. This is used to determine how to to format the labels.
Note:
If the task is a classification task, the labels
will be mapped to integers, e.g. [0, 1, …, K-1] where K is the number of classes. If the task is a regression task, the labels will not be mapped to integers.
If the task is a multilabel task, the labels will be formatted as a list of lists, e.g. [[0, 1], [1, 2], [0, 2]] where each sublist contains the labels for a single example. If the task is not a multilabel task, the labels will be formatted as a 1D numpy array.
Warning
Optional dependencies:
- datasets :
Dataset, DatasetDict and load_dataset are imported from datasets. This is an optional dependency of cleanlab, but is required for
Datalab
to work.
Attributes:
Check if labels are available.
- property class_names: List[str]#
- property has_labels: bool#
Check if labels are available.
- class cleanlab.datalab.internal.data.Label(*, data, label_name=None, map_to_int=True)[source]#
Bases:
ABC
Class to represent labels in a dataset.
It stores the labels as a numpy array and maps them to integers if necessary. If a mapping is not necessary, e.g. for regression tasks, the mapping will be an empty dictionary.
- Parameters:
data (
Dataset
) – A Hugging Face Dataset object.label_name (
str
) – Name of the label column in the dataset.map_to_int (
bool
) – Whether to map the labels to integers, e.g. [0, 1, …, K-1] where K is the number of classes. If False, the labels are not mapped to integers, e.g. for regression tasks.
Attributes:
A list of class names that are present in the dataset.
Check if labels are available.
- property class_names: List[str]#
A list of class names that are present in the dataset.
Without labels, this will return an empty list.
- property is_available: bool#
Check if labels are available.
- class cleanlab.datalab.internal.data.MultiLabel(data, label_name, map_to_int)[source]#
Bases:
Label
Attributes:
A list of class names that are present in the dataset.
Check if labels are available.
- property class_names: List[str]#
A list of class names that are present in the dataset.
Without labels, this will return an empty list.
- property is_available: bool#
Check if labels are available.
- class cleanlab.datalab.internal.data.MultiClass(data, label_name, map_to_int)[source]#
Bases:
Label
Attributes:
A list of class names that are present in the dataset.
Check if labels are available.
- property class_names: List[str]#
A list of class names that are present in the dataset.
Without labels, this will return an empty list.
- property is_available: bool#
Check if labels are available.