datalab#
Datalab offers a unified audit to detect all kinds of issues in data and labels.
Note
Using Datalab requires additional dependencies beyond the rest of the cleanlab
package. To install them, run:
$ pip install "cleanlab[datalab]"
For the developmental version of the package, install from source:
$ pip install "git+https://github.com/cleanlab/cleanlab.git#egg=cleanlab[datalab]"
Classes:
|
A single object to automatically detect all kinds of issues in datasets. |
- class cleanlab.datalab.datalab.Datalab(data, label_name=None, image_key=None, verbosity=1)[source]#
Bases:
object
A single object to automatically detect all kinds of issues in datasets. This is how we recommend you interface with the cleanlab library if you want to audit the quality of your data and detect issues within it. If you have other specific goals (or are doing a less standard ML task not supported by Datalab), then consider using the other methods across the library. Datalab tracks intermediate state (e.g. data statistics) from certain cleanlab functions that can be re-used across other cleanlab functions for better efficiency.
- Parameters:
data (
Union[Dataset
,pd.DataFrame
,dict
,list
,str]
) –Dataset-like object that can be converted to a Hugging Face Dataset object.
It should contain the labels for all examples, identified by a
label_name
column in the Dataset object.- Supported formats:
datasets.Dataset
pandas.DataFrame
dict (keys are strings, values are arrays/lists of length
N
)list (list of dictionaries that each have the same keys)
str
path to a local file: Text (.txt), CSV (.csv), JSON (.json)
or a dataset identifier on the Hugging Face Hub
label_name (
str
, optional) – The name of the label column in the dataset.image_key (
str
, optional) – Optional key that can be specified for image datasets to point to the field containing the actual images themselves. If specified, additional image-specific issue types can be detected in the dataset. See the CleanVision package documentation for descriptions of these image-specific issue types.verbosity (
int
, optional) – The higher the verbosity level, the more information Datalab prints when auditing a dataset. Valid values are 0 through 4. Default is 1.
Examples
>>> import datasets >>> from cleanlab import Datalab >>> data = datasets.load_dataset("glue", "sst2", split="train") >>> datalab = Datalab(data, label_name="label")
Attributes:
Labels of the dataset, in a [0, 1, ..., K-1] format.
Whether the dataset has labels.
Names of the classes in the dataset.
Issues found in each example from the dataset.
Summary of issues found in the dataset and the overall severity of each type of issue.
Information and statistics about the dataset issues found.
Methods:
find_issues
(*[, pred_probs, features, ...])Checks the dataset for all sorts of common issues in real-world data (in both labels and feature values).
report
(*[, num_examples, verbosity, ...])Prints informative summary of all issues.
get_issues
([issue_name])Use this after finding issues to see which examples suffer from which types of issues.
get_issue_summary
([issue_name])Summarize the issues found in dataset of a particular type, including how severe this type of issue is overall across the dataset.
get_info
([issue_name])Get the info for the issue_name key.
Returns a list of all registered issue types.
Returns a list of the issue types that are run by default when
find_issues()
is called without specifyingissue_types
.save
(path[, force])Saves this Datalab object to file (all files are in folder at
path/
).load
(path[, data])Loads Datalab object from a previously saved folder.
- property labels: ndarray#
Labels of the dataset, in a [0, 1, …, K-1] format.
- Return type:
ndarray
- property has_labels: bool#
Whether the dataset has labels.
- Return type:
bool
- property class_names: List[str]#
Names of the classes in the dataset.
If the dataset has no labels, returns an empty list.
- Return type:
List
[str
]
- find_issues(*, pred_probs=None, features=None, knn_graph=None, issue_types=None)[source]#
Checks the dataset for all sorts of common issues in real-world data (in both labels and feature values).
You can use Datalab to find issues in your data, utilizing any model you have already trained. This method only interacts with your model via its predictions or embeddings (and other functions thereof). The more of these inputs you provide, the more types of issues Datalab can detect in your dataset/labels. If you provide a subset of these inputs, Datalab will output what insights it can based on the limited information from your model.
Note
This method acts as a wrapper around the
IssueFinder.find_issues
method, where the core logic for issue detection is implemented.Note
The issues are saved in the
self.issues
attribute, but are not returned.- Parameters:
pred_probs (
Optional
[ndarray
]) –Out-of-sample predicted class probabilities made by the model for every example in the dataset. To best detect label issues, provide this input obtained from the most accurate model you can produce.
If provided, this must be a 2D array with shape (num_examples, K) where K is the number of classes in the dataset.
features (
Optional[np.ndarray]
) –Feature embeddings (vector representations) of every example in the dataset.
If provided, this must be a 2D array with shape (num_examples, num_features).
knn_graph (
Optional
[csr_matrix
]) –Sparse matrix representing distances between examples in the dataset in a k nearest neighbor graph.
If provided, this must be a square CSR matrix with shape (num_examples, num_examples) and (k*num_examples) non-zero entries (k is the number of nearest neighbors considered for each example) evenly distributed across the rows. The non-zero entries must be the distances between the corresponding examples. Self-distances must be omitted (i.e. the diagonal must be all zeros and the k nearest neighbors of each example must not include itself).
For any duplicated examples i,j whose distance is 0, there should be an explicit zero stored in the matrix, i.e.
knn_graph[i,j] = 0
.If both
knn_graph
andfeatures
are provided, theknn_graph
will take precendence. Ifknn_graph
is not provided, it is constructed based on the providedfeatures
. If neitherknn_graph
norfeatures
are provided, certain issue types like (near) duplicates will not be considered.issue_types (
Optional
[Dict
[str
,Any
]]) –Collection specifying which types of issues to consider in audit and any non-default parameter settings to use. If unspecified, a default set of issue types and recommended parameter settings is considered.
This is a dictionary of dictionaries, where the keys are the issue types of interest and the values are dictionaries of parameter values that control how each type of issue is detected (only for advanced users). More specifically, the values are constructor keyword arguments passed to the corresponding
IssueManager
, which is responsible for detecting the particular issue type.See also
Examples
Here are some ways to provide inputs to
find_issues()
:- Passing
pred_probs
: >>> from sklearn.linear_model import LogisticRegression >>> import numpy as np >>> from cleanlab import Datalab >>> X = np.array([[0, 1], [1, 1], [2, 2], [2, 0]]) >>> y = np.array([0, 1, 1, 0]) >>> clf = LogisticRegression(random_state=0).fit(X, y) >>> pred_probs = clf.predict_proba(X) >>> lab = Datalab(data={"X": X, "y": y}, label_name="y") >>> lab.find_issues(pred_probs=pred_probs)
- Passing
- Passing
features
: >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.neighbors import NearestNeighbors >>> import numpy as np >>> from cleanlab import Datalab >>> X = np.array([[0, 1], [1, 1], [2, 2], [2, 0]]) >>> y = np.array([0, 1, 1, 0]) >>> lab = Datalab(data={"X": X, "y": y}, label_name="y") >>> lab.find_issues(features=X)
- Passing
Note
You can pass both
pred_probs
andfeatures
tofind_issues()
for a more comprehensive audit.- Passing a
knn_graph
: >>> from sklearn.neighbors import NearestNeighbors >>> import numpy as np >>> from cleanlab import Datalab >>> X = np.array([[0, 1], [1, 1], [2, 2], [2, 0]]) >>> y = np.array([0, 1, 1, 0]) >>> nbrs = NearestNeighbors(n_neighbors=2, metric="euclidean").fit(X) >>> knn_graph = nbrs.kneighbors_graph(mode="distance") >>> knn_graph # Pass this to Datalab <4x4 sparse matrix of type '<class 'numpy.float64'>' with 8 stored elements in Compressed Sparse Row format> >>> knn_graph.toarray() # DO NOT PASS knn_graph.toarray() to Datalab, only pass the sparse matrix itself array([[0. , 1. , 2.23606798, 0. ], [1. , 0. , 1.41421356, 0. ], [0. , 1.41421356, 0. , 2. ], [0. , 1.41421356, 2. , 0. ]]) >>> lab = Datalab(data={"X": X, "y": y}, label_name="y") >>> lab.find_issues(knn_graph=knn_graph)
- Passing a
- Configuring issue types:
Suppose you want to only consider label issues. Just pass a dictionary with the key “label” and an empty dictionary as the value (to use default label issue parameters).
>>> issue_types = {"label": {}} >>> # lab.find_issues(pred_probs=pred_probs, issue_types=issue_types)
If you are advanced user who wants greater control, you can pass keyword arguments to the issue manager that handles the label issues. For example, if you want to pass the keyword argument “clean_learning_kwargs” to the constructor of the
LabelIssueManager
, you would pass:>>> issue_types = { ... "label": { ... "clean_learning_kwargs": { ... "prune_method": "prune_by_noise_rate", ... }, ... }, ... } >>> # lab.find_issues(pred_probs=pred_probs, issue_types=issue_types)
- Return type:
None
- report(*, num_examples=5, verbosity=None, include_description=True, show_summary_score=False)[source]#
Prints informative summary of all issues.
- Parameters:
num_examples (
int
) – Number of examples to show for each type of issue. The report shows the topnum_examples
instances in the dataset that suffer the most from each type of issue.verbosity (
Optional
[int
]) – Higher verbosity levels add more information to the report.include_description (
bool
) – Whether or not to include a description of each issue type in the report. Consider setting this toFalse
once you’re familiar with how each issue type is defined.
See also
For advanced usage, see documentation for the
Reporter
class.- Return type:
None
- property issues: DataFrame#
Issues found in each example from the dataset.
- Return type:
DataFrame
- property issue_summary: DataFrame#
Summary of issues found in the dataset and the overall severity of each type of issue.
This is a wrapper around the
DataIssues.issue_summary
attribute.Examples
If checks for “label” and “outlier” issues were run, then the issue summary will look something like this:
>>> datalab.issue_summary issue_type score outlier 0.123 label 0.456
- Return type:
DataFrame
- property info: Dict[str, Dict[str, Any]]#
Information and statistics about the dataset issues found.
This is a wrapper around the
DataIssues.info
attribute.Examples
If checks for “label” and “outlier” issues were run, then the info will look something like this:
>>> datalab.info { "label": { "given_labels": [0, 1, 0, 1, 1, 1, 1, 1, 0, 1, ...], "predicted_label": [0, 0, 0, 1, 0, 1, 0, 1, 0, 1, ...], ..., }, "outlier": { "nearest_neighbor": [3, 7, 1, 2, 8, 4, 5, 9, 6, 0, ...], "distance_to_nearest_neighbor": [0.123, 0.789, 0.456, ...], ..., }, }
- Return type:
Dict
[str
,Dict
[str
,Any
]]
- get_issues(issue_name=None)[source]#
Use this after finding issues to see which examples suffer from which types of issues.
Note
This is a wrapper around the
DataIssues.get_issues
method.- Parameters:
issue_name (
str
orNone
) – The type of issue to focus on. IfNone
, returns full DataFrame summarizing all of the types of issues detected in each example from the dataset.- Raises:
ValueError – If
issue_name
is not a type of issue previously considered in the audit.- Return type:
DataFrame
- Returns:
specific_issues
– A DataFrame where each row corresponds to an example from the dataset and columns specify: whether this example exhibits a particular type of issue, and how severely (via a numeric quality score where lower values indicate more severe instances of the issue). The quality scores lie between 0-1 and are directly comparable between examples (for the same issue type), but not across different issue types.Additional columns may be present in the DataFrame depending on the type of issue specified.
- get_issue_summary(issue_name=None)[source]#
Summarize the issues found in dataset of a particular type, including how severe this type of issue is overall across the dataset.
Note
This is a wrapper around the
DataIssues.get_issue_summary
method.- Parameters:
issue_name (
Optional
[str
]) – Name of the issue type to summarize. IfNone
, summarizes each of the different issue types previously considered in the audit.- Return type:
DataFrame
- Returns:
issue_summary
– DataFrame where each row corresponds to a type of issue, and columns quantify: the number of examples in the dataset estimated to exhibit this type of issue, and the overall severity of the issue across the dataset (via a numeric quality score where lower values indicate that the issue is overall more severe). The quality scores lie between 0-1 and are directly comparable between multiple datasets (for the same issue type), but not across different issue types.
- get_info(issue_name=None)[source]#
Get the info for the issue_name key.
This function is used to get the info for a specific issue_name. If the info is not computed yet, it will raise an error.
Note
This is a wrapper around the
DataIssues.get_info
method.- Parameters:
issue_name (
Optional
[str
]) – The issue name for which the info is required.- Return type:
Dict
[str
,Any
]- Returns:
info
– The info for the issue_name.
- static list_possible_issue_types()[source]#
Returns a list of all registered issue types.
Any issue type that is not in this list cannot be used in the
find_issues()
method.Note
This method is a wrapper around
IssueFinder.list_possible_issue_types
.See also
REGISTRY
: All available issue types and their corresponding issue managers can be found here.- Return type:
List
[str
]
- static list_default_issue_types()[source]#
Returns a list of the issue types that are run by default when
find_issues()
is called without specifyingissue_types
.Note
This method is a wrapper around
IssueFinder.list_default_issue_types
.See also
REGISTRY
: All available issue types and their corresponding issue managers can be found here.- Return type:
List
[str
]
- save(path, force=False)[source]#
Saves this Datalab object to file (all files are in folder at
path/
). We do not guarantee saved Datalab can be loaded from future versions of cleanlab.- Parameters:
path (
str
) – Folder in which all information about this Datalab should be saved.force (
bool
) – IfTrue
, overwrites any existing files in the folder atpath
. Use this with caution!
Note
You have to save the Dataset yourself separately if you want it saved to file.
- Return type:
None
- static load(path, data=None)[source]#
Loads Datalab object from a previously saved folder.
- Parameters:
path – Path to the folder previously specified in
Datalab.save()
.data – The dataset used to originally construct the Datalab. Remember the dataset is not saved as part of the Datalab, you must save/load the data separately.
- Return type:
- Returns:
datalab
– A Datalab object that is identical to the one originally saved.