imagelab#

An internal wrapper around the Imagelab class from the CleanVision package to incorporate it into Datalab. This allows low-quality images to be detected alongside other issues in computer vision datasets. The methods/classes in this module are just intended for internal use.

Functions:

create_imagelab(dataset, image_key)

Creates Imagelab instance for running CleanVision checks.

handle_spurious_correlations(*, ...)

rtype:

Dict[str, Any]

Classes:

ImagelabDataIssuesAdapter(data, strategy)

Class that collects and stores information and statistics on issues found in a dataset.

CorrelationVisualizer()

Class to visualize images corresponding to the extreme (minimum and maximum) individual scores for each of the detected correlated properties.

CorrelationReporter(data_issues, imagelab)

Class to report spurious correlations between image features and class labels detected in the data.

ImagelabReporterAdapter(data_issues, ...[, ...])

ImagelabIssueFinderAdapter(datalab, task, ...)

cleanlab.datalab.internal.adapter.imagelab.create_imagelab(dataset, image_key)[source]#

Creates Imagelab instance for running CleanVision checks. CleanVision checks are only supported for huggingface datasets as of now.

Parameters:
  • dataset (datasets.Dataset) – Huggingface dataset used by Imagelab

  • image_key (str) – key for image feature in the huggingface dataset

Return type:

Optional[Imagelab]

Returns:

Imagelab

class cleanlab.datalab.internal.adapter.imagelab.ImagelabDataIssuesAdapter(data, strategy)[source]#

Bases: DataIssues

Class that collects and stores information and statistics on issues found in a dataset.

Parameters:
  • data (Data) – The data object for which the issues are being collected.

  • strategy (Type[_InfoStrategy]) – Strategy used for processing info dictionaries.

  • issues (pd.DataFrame) – Stores information about each individual issue found in the data, on a per-example basis.

  • issue_summary (pd.DataFrame) – Summarizes the overall statistics for each issue type.

  • info (dict) – A dictionary that contains information and statistics about the data and each issue type.

Methods:

filter_based_on_max_prevalence(...)

collect_issues_from_imagelab(imagelab, ...)

Collect results from Imagelab and update datalab.issues and datalab.issue_summary

get_info([issue_name])

rtype:

Dict[str, Any]

collect_issues_from_issue_manager(issue_manager)

Collects results from an IssueManager and update the corresponding attributes of the Datalab object.

collect_statistics(issue_manager)

Update the statistics in the info dictionary.

get_issue_summary([issue_name])

Summarize the issues found in dataset of a particular type, including how severe this type of issue is overall across the dataset.

get_issues([issue_name])

Use this after finding issues to see which examples suffer from which types of issues.

set_health_score()

Set the health score for the dataset based on the issue summary.

Attributes:

statistics

Returns the statistics dictionary.

filter_based_on_max_prevalence(issue_summary, max_num)[source]#
collect_issues_from_imagelab(imagelab, issue_types)[source]#

Collect results from Imagelab and update datalab.issues and datalab.issue_summary

Parameters:

imagelab (Imagelab) – Imagelab instance that run all the checks for image issue types

Return type:

None

get_info(issue_name=None)[source]#
Return type:

Dict[str, Any]

collect_issues_from_issue_manager(issue_manager)#

Collects results from an IssueManager and update the corresponding attributes of the Datalab object.

This includes: - self.issues - self.issue_summary - self.info

Parameters:

issue_manager (IssueManager) – IssueManager object to collect results from.

Return type:

None

collect_statistics(issue_manager)#

Update the statistics in the info dictionary.

Parameters:

statistics – A dictionary of statistics to add/update in the info dictionary.

Return type:

None

Examples

A common use case is to reuse the KNN-graph across multiple issue managers. To avoid recomputing the KNN-graph for each issue manager, we can pass it as a statistic to the issue managers.

>>> from scipy.sparse import csr_matrix
>>> weighted_knn_graph = csr_matrix(...)
>>> issue_manager_that_computes_knn_graph = ...
get_issue_summary(issue_name=None)#

Summarize the issues found in dataset of a particular type, including how severe this type of issue is overall across the dataset.

Parameters:

issue_name (Optional[str]) – Name of the issue type to summarize. If None, summarizes each of the different issue types previously considered in the audit.

Return type:

DataFrame

Returns:

issue_summary – DataFrame where each row corresponds to a type of issue, and columns quantify: the number of examples in the dataset estimated to exhibit this type of issue, and the overall severity of the issue across the dataset (via a numeric quality score where lower values indicate that the issue is overall more severe).

get_issues(issue_name=None)#

Use this after finding issues to see which examples suffer from which types of issues.

Parameters:

issue_name (str or None) – The type of issue to focus on. If None, returns full DataFrame summarizing all of the types of issues detected in each example from the dataset.

Raises:

ValueError – If issue_name is not a type of issue previously considered in the audit.

Return type:

DataFrame

Returns:

specific_issues – A DataFrame where each row corresponds to an example from the dataset and columns specify: whether this example exhibits a particular type of issue and how severely (via a numeric quality score where lower values indicate more severe instances of the issue).

Additional columns may be present in the DataFrame depending on the type of issue specified.

set_health_score()#

Set the health score for the dataset based on the issue summary.

Currently, the health score is the mean of the scores for each issue type.

Return type:

None

property statistics: Dict[str, Any]#

Returns the statistics dictionary.

Shorthand for self.info[“statistics”].

class cleanlab.datalab.internal.adapter.imagelab.CorrelationVisualizer[source]#

Bases: object

Class to visualize images corresponding to the extreme (minimum and maximum) individual scores for each of the detected correlated properties.

Methods:

visualize(images, title_info[, ncols, cell_size])

rtype:

None

visualize(images, title_info, ncols=2, cell_size=(2, 2))[source]#
Return type:

None

class cleanlab.datalab.internal.adapter.imagelab.CorrelationReporter(data_issues, imagelab)[source]#

Bases: object

Class to report spurious correlations between image features and class labels detected in the data.

If no spurious correlations are found, the class will not report anything.

Methods:

report()

Reports spurious correlations between image features and class labels detected in the data, if any are found.

report()[source]#

Reports spurious correlations between image features and class labels detected in the data, if any are found.

Return type:

None

class cleanlab.datalab.internal.adapter.imagelab.ImagelabReporterAdapter(data_issues, imagelab, task, verbosity=1, include_description=True, show_summary_score=False, show_all_issues=False)[source]#

Bases: Reporter

Methods:

report(num_examples)

Prints a report about identified issues in the data.

get_report(num_examples)

Constructs a report about identified issues in the data.

report(num_examples)[source]#

Prints a report about identified issues in the data.

Parameters:

num_examples (int) – The number of examples to include in the report for each issue type.

Return type:

None

get_report(num_examples)#

Constructs a report about identified issues in the data.

Parameters:

num_examples (int) – The number of examples to include in the report for each issue type.

Return type:

str

Returns:

report_str – A string containing the report.

Examples

>>> from cleanlab.datalab.internal.report import Reporter
>>> reporter = Reporter(data_issues=data_issues, include_description=False)
>>> report_str = reporter.get_report(num_examples=5)
>>> print(report_str)
class cleanlab.datalab.internal.adapter.imagelab.ImagelabIssueFinderAdapter(datalab, task, verbosity)[source]#

Bases: IssueFinder

Methods:

find_issues(*[, pred_probs, features, ...])

Checks the dataset for all sorts of common issues in real-world data (in both labels and feature values).

get_available_issue_types(**kwargs)

Returns a dictionary of issue types that can be used in Datalab.find_issues method.

find_issues(*, pred_probs=None, features=None, knn_graph=None, issue_types=None)[source]#

Checks the dataset for all sorts of common issues in real-world data (in both labels and feature values).

You can use Datalab to find issues in your data, utilizing any model you have already trained. This method only interacts with your model via its predictions or embeddings (and other functions thereof). The more of these inputs you provide, the more types of issues Datalab can detect in your dataset/labels. If you provide a subset of these inputs, Datalab will output what insights it can based on the limited information from your model.

Note

This method is not intended to be used directly. Instead, use the Datalab.find_issues method.

Note

The issues are saved in the self.datalab.data_issues.issues attribute, but are not returned.

Parameters:
  • pred_probs (Optional[ndarray]) –

    Out-of-sample predicted class probabilities made by the model for every example in the dataset. To best detect label issues, provide this input obtained from the most accurate model you can produce.

    If provided for classification, this must be a 2D array with shape (num_examples, K) where K is the number of classes in the dataset. If provided for regression, this must be a 1D array with shape (num_examples,).

  • features (Optional[np.ndarray]) –

    Feature embeddings (vector representations) of every example in the dataset.

    If provided, this must be a 2D array with shape (num_examples, num_features).

  • knn_graph (Optional[csr_matrix]) –

    Sparse matrix representing distances between examples in the dataset in a k nearest neighbor graph.

    For details, refer to the documentation of the same argument in Datalab.find_issues

  • issue_types (Optional[Dict[str, Any]]) –

    Collection specifying which types of issues to consider in audit and any non-default parameter settings to use. If unspecified, a default set of issue types and recommended parameter settings is considered.

    This is a dictionary of dictionaries, where the keys are the issue types of interest and the values are dictionaries of parameter values that control how each type of issue is detected (only for advanced users). More specifically, the values are constructor keyword arguments passed to the corresponding IssueManager, which is responsible for detecting the particular issue type.

    See also

    IssueManager

Return type:

None

get_available_issue_types(**kwargs)#

Returns a dictionary of issue types that can be used in Datalab.find_issues method.

cleanlab.datalab.internal.adapter.imagelab.handle_spurious_correlations(*, imagelab_issues, labels, threshold, **_)[source]#
Return type:

Dict[str, Any]