Warning

Parts of this site uses JavaScript, but your browser does not support it.

data_issues#

Module for the DataIssues class, which serves as a central repository for storing information and statistics about issues found in a dataset.

It collects information from various IssueManager instances and keeps track of each issue, a summary for each type of issue, related information and statistics about the issues.

The collected information can be accessed using the ~cleanlab.datalab.internal.data_issues.DataIssues.get_info method. We recommend using that method instead of this module, which is just intended for internal use.

Classes:

DataIssues(data, strategy)

Class that collects and stores information and statistics on issues found in a dataset.

Functions:

get_data_statistics(data)

Get statistics about a dataset.

class cleanlab.datalab.internal.data_issues.DataIssues(data, strategy)[source]#

Bases: object

Class that collects and stores information and statistics on issues found in a dataset.

Parameters:

data (Data) – The data object for which the issues are being collected.
strategy (Type[_InfoStrategy]) – Strategy used for processing info dictionaries.

issues#

Stores information about each individual issue found in the data, on a per-example basis.

Type:: pd.DataFrame

issue_summary#

Summarizes the overall statistics for each issue type.

Type:: pd.DataFrame

info#

A dictionary that contains information and statistics about the data and each issue type.

Type:: dict

Methods:

`get_info`([issue_name])	rtype: `Dict`[`str`, `Any`]
`get_issues`([issue_name])	Use this after finding issues to see which examples suffer from which types of issues.
`get_issue_summary`([issue_name])	Summarize the issues found in dataset of a particular type, including how severe this type of issue is overall across the dataset.
`collect_statistics`(issue_manager)	Update the statistics in the info dictionary.
`collect_issues_from_issue_manager`(issue_manager)	Collects results from an IssueManager and update the corresponding attributes of the Datalab object.
`collect_issues_from_imagelab`(imagelab, ...)	rtype: `None`
`set_health_score`()	Set the health score for the dataset based on the issue summary.

Attributes:

statistics

Returns the statistics dictionary.

get_info(issue_name=None)[source]#

Return type:: Dict[str, Any]

property statistics: Dict[str, Any]#

Returns the statistics dictionary.

Shorthand for self.info[“statistics”].

get_issues(issue_name=None)[source]#

Use this after finding issues to see which examples suffer from which types of issues.

Parameters:

issue_name (str or None) – The type of issue to focus on. If None, returns full DataFrame summarizing all of the types of issues detected in each example from the dataset.

Raises:

ValueError – If issue_name is not a type of issue previously considered in the audit.

Return type:

DataFrame

Returns:

specific_issues – A DataFrame where each row corresponds to an example from the dataset and columns specify: whether this example exhibits a particular type of issue and how severely (via a numeric quality score where lower values indicate more severe instances of the issue).

Additional columns may be present in the DataFrame depending on the type of issue specified.

get_issue_summary(issue_name=None)[source]#

Summarize the issues found in dataset of a particular type, including how severe this type of issue is overall across the dataset.

Parameters:: issue_name (Optional[str]) – Name of the issue type to summarize. If None, summarizes each of the different issue types previously considered in the audit.
Return type:: DataFrame
Returns:: issue_summary – DataFrame where each row corresponds to a type of issue, and columns quantify: the number of examples in the dataset estimated to exhibit this type of issue, and the overall severity of the issue across the dataset (via a numeric quality score where lower values indicate that the issue is overall more severe).

collect_statistics(issue_manager)[source]#

Update the statistics in the info dictionary.

Parameters:: statistics – A dictionary of statistics to add/update in the info dictionary.
Return type:: None

Examples

A common use case is to reuse the KNN-graph across multiple issue managers. To avoid recomputing the KNN-graph for each issue manager, we can pass it as a statistic to the issue managers.

>>> from scipy.sparse import csr_matrix
>>> weighted_knn_graph = csr_matrix(...)
>>> issue_manager_that_computes_knn_graph = ...

collect_issues_from_issue_manager(issue_manager)[source]#

Collects results from an IssueManager and update the corresponding attributes of the Datalab object.

This includes: - self.issues - self.issue_summary - self.info

Parameters:: issue_manager (IssueManager) – IssueManager object to collect results from.
Return type:: None

collect_issues_from_imagelab(imagelab, issue_types)[source]#

Return type:: None

set_health_score()[source]#

Set the health score for the dataset based on the issue summary.

Currently, the health score is the mean of the scores for each issue type.

Return type:: None

cleanlab.datalab.internal.data_issues.get_data_statistics(data)[source]#

Get statistics about a dataset.

This function is called to initialize the “statistics” info in all Datalab objects.

Parameters:: data (Data) – Data object containing the dataset.
Return type:: Dict[str, Any]