imagelab#
An internal wrapper around the Imagelab class from the CleanVision package to incorporate it into Datalab. This allows low-quality images to be detected alongside other issues in computer vision datasets. The methods/classes in this module are just intended for internal use.
Functions:
|
Creates Imagelab instance for running CleanVision checks. |
|
|
Classes:
|
Class that collects and stores information and statistics on issues found in a dataset. |
Class to visualize images corresponding to the extreme (minimum and maximum) individual scores for each of the detected correlated properties. |
|
|
Class to report spurious correlations between image features and class labels detected in the data. |
|
|
|
- cleanlab.datalab.internal.adapter.imagelab.create_imagelab(dataset, image_key)[source]#
Creates Imagelab instance for running CleanVision checks. CleanVision checks are only supported for huggingface datasets as of now.
- Parameters:
dataset (
datasets.Dataset
) – Huggingface dataset used by Imagelabimage_key (
str
) – key for image feature in the huggingface dataset
- Return type:
Optional
[Imagelab
]- Returns:
Imagelab
- class cleanlab.datalab.internal.adapter.imagelab.ImagelabDataIssuesAdapter(data, strategy)[source]#
Bases:
DataIssues
Class that collects and stores information and statistics on issues found in a dataset.
- Parameters:
data (
Data
) – The data object for which the issues are being collected.strategy (
Type
[_InfoStrategy
]) – Strategy used for processing info dictionaries.issues (
pd.DataFrame
) – Stores information about each individual issue found in the data, on a per-example basis.issue_summary (
pd.DataFrame
) – Summarizes the overall statistics for each issue type.info (
dict
) – A dictionary that contains information and statistics about the data and each issue type.
Methods:
collect_issues_from_imagelab
(imagelab, ...)Collect results from Imagelab and update datalab.issues and datalab.issue_summary
get_info
([issue_name])- rtype:
Dict
[str
,Any
]
collect_issues_from_issue_manager
(issue_manager)Collects results from an IssueManager and update the corresponding attributes of the Datalab object.
collect_statistics
(issue_manager)Update the statistics in the info dictionary.
get_issue_summary
([issue_name])Summarize the issues found in dataset of a particular type, including how severe this type of issue is overall across the dataset.
get_issues
([issue_name])Use this after finding issues to see which examples suffer from which types of issues.
Set the health score for the dataset based on the issue summary.
Attributes:
Returns the statistics dictionary.
- collect_issues_from_imagelab(imagelab, issue_types)[source]#
Collect results from Imagelab and update datalab.issues and datalab.issue_summary
- Parameters:
imagelab (
Imagelab
) – Imagelab instance that run all the checks for image issue types- Return type:
None
- collect_issues_from_issue_manager(issue_manager)#
Collects results from an IssueManager and update the corresponding attributes of the Datalab object.
This includes: - self.issues - self.issue_summary - self.info
- Parameters:
issue_manager (
IssueManager
) – IssueManager object to collect results from.- Return type:
None
- collect_statistics(issue_manager)#
Update the statistics in the info dictionary.
- Parameters:
statistics – A dictionary of statistics to add/update in the info dictionary.
- Return type:
None
Examples
A common use case is to reuse the KNN-graph across multiple issue managers. To avoid recomputing the KNN-graph for each issue manager, we can pass it as a statistic to the issue managers.
>>> from scipy.sparse import csr_matrix >>> weighted_knn_graph = csr_matrix(...) >>> issue_manager_that_computes_knn_graph = ...
- get_issue_summary(issue_name=None)#
Summarize the issues found in dataset of a particular type, including how severe this type of issue is overall across the dataset.
- Parameters:
issue_name (
Optional
[str
]) – Name of the issue type to summarize. If None, summarizes each of the different issue types previously considered in the audit.- Return type:
DataFrame
- Returns:
issue_summary
– DataFrame where each row corresponds to a type of issue, and columns quantify: the number of examples in the dataset estimated to exhibit this type of issue, and the overall severity of the issue across the dataset (via a numeric quality score where lower values indicate that the issue is overall more severe).
- get_issues(issue_name=None)#
Use this after finding issues to see which examples suffer from which types of issues.
- Parameters:
issue_name (
str
orNone
) – The type of issue to focus on. If None, returns full DataFrame summarizing all of the types of issues detected in each example from the dataset.- Raises:
ValueError – If issue_name is not a type of issue previously considered in the audit.
- Return type:
DataFrame
- Returns:
specific_issues
– A DataFrame where each row corresponds to an example from the dataset and columns specify: whether this example exhibits a particular type of issue and how severely (via a numeric quality score where lower values indicate more severe instances of the issue).Additional columns may be present in the DataFrame depending on the type of issue specified.
- set_health_score()#
Set the health score for the dataset based on the issue summary.
Currently, the health score is the mean of the scores for each issue type.
- Return type:
None
- property statistics: Dict[str, Any]#
Returns the statistics dictionary.
Shorthand for self.info[“statistics”].
- class cleanlab.datalab.internal.adapter.imagelab.CorrelationVisualizer[source]#
Bases:
object
Class to visualize images corresponding to the extreme (minimum and maximum) individual scores for each of the detected correlated properties.
Methods:
visualize
(images, title_info[, ncols, cell_size])- rtype:
None
- class cleanlab.datalab.internal.adapter.imagelab.CorrelationReporter(data_issues, imagelab)[source]#
Bases:
object
Class to report spurious correlations between image features and class labels detected in the data.
If no spurious correlations are found, the class will not report anything.
Methods:
report
()Reports spurious correlations between image features and class labels detected in the data, if any are found.
- class cleanlab.datalab.internal.adapter.imagelab.ImagelabReporterAdapter(data_issues, imagelab, task, verbosity=1, include_description=True, show_summary_score=False, show_all_issues=False)[source]#
Bases:
Reporter
Methods:
report
(num_examples)Prints a report about identified issues in the data.
get_report
(num_examples)Constructs a report about identified issues in the data.
- report(num_examples)[source]#
Prints a report about identified issues in the data.
- Parameters:
num_examples (
int
) – The number of examples to include in the report for each issue type.- Return type:
None
- get_report(num_examples)#
Constructs a report about identified issues in the data.
- Parameters:
num_examples (
int
) – The number of examples to include in the report for each issue type.- Return type:
str
- Returns:
report_str
– A string containing the report.
Examples
>>> from cleanlab.datalab.internal.report import Reporter >>> reporter = Reporter(data_issues=data_issues, include_description=False) >>> report_str = reporter.get_report(num_examples=5) >>> print(report_str)
- class cleanlab.datalab.internal.adapter.imagelab.ImagelabIssueFinderAdapter(datalab, task, verbosity)[source]#
Bases:
IssueFinder
Methods:
find_issues
(*[, pred_probs, features, ...])Checks the dataset for all sorts of common issues in real-world data (in both labels and feature values).
get_available_issue_types
(**kwargs)Returns a dictionary of issue types that can be used in
Datalab.find_issues
method.- find_issues(*, pred_probs=None, features=None, knn_graph=None, issue_types=None)[source]#
Checks the dataset for all sorts of common issues in real-world data (in both labels and feature values).
You can use Datalab to find issues in your data, utilizing any model you have already trained. This method only interacts with your model via its predictions or embeddings (and other functions thereof). The more of these inputs you provide, the more types of issues Datalab can detect in your dataset/labels. If you provide a subset of these inputs, Datalab will output what insights it can based on the limited information from your model.
Note
This method is not intended to be used directly. Instead, use the
Datalab.find_issues
method.Note
The issues are saved in the
self.datalab.data_issues.issues
attribute, but are not returned.- Parameters:
pred_probs (
Optional
[ndarray
]) –Out-of-sample predicted class probabilities made by the model for every example in the dataset. To best detect label issues, provide this input obtained from the most accurate model you can produce.
If provided for classification, this must be a 2D array with shape
(num_examples, K)
where K is the number of classes in the dataset. If provided for regression, this must be a 1D array with shape(num_examples,)
.features (
Optional[np.ndarray]
) –Feature embeddings (vector representations) of every example in the dataset.
If provided, this must be a 2D array with shape (num_examples, num_features).
knn_graph (
Optional
[csr_matrix
]) –Sparse matrix representing distances between examples in the dataset in a k nearest neighbor graph.
For details, refer to the documentation of the same argument in
Datalab.find_issues
issue_types (
Optional
[Dict
[str
,Any
]]) –Collection specifying which types of issues to consider in audit and any non-default parameter settings to use. If unspecified, a default set of issue types and recommended parameter settings is considered.
This is a dictionary of dictionaries, where the keys are the issue types of interest and the values are dictionaries of parameter values that control how each type of issue is detected (only for advanced users). More specifically, the values are constructor keyword arguments passed to the corresponding
IssueManager
, which is responsible for detecting the particular issue type.See also
- Return type:
None
- get_available_issue_types(**kwargs)#
Returns a dictionary of issue types that can be used in
Datalab.find_issues
method.