data_issues#

Module for the DataIssues class, which serves as a central repository for storing information and statistics about issues found in a dataset.

It collects information from various IssueManager instances and keeps track of each issue, a summary for each type of issue, related information and statistics about the issues.

The collected information can be accessed using the get_info method. We recommend using that method instead of this module, which is just intended for internal use.

Classes:

DataIssues(data)

Class that collects and stores information and statistics on issues found in a dataset.

Functions:

get_data_statistics(data)

Get statistics about a dataset.

class cleanlab.datalab.internal.data_issues.DataIssues(data)[source]#

Bases: object

Class that collects and stores information and statistics on issues found in a dataset.

Parameters:
  • data (Data) – The data object for which the issues are being collected.

  • issues (pd.DataFrame) – Stores information about each individual issue found in the data, on a per-example basis.

  • issue_summary (pd.DataFrame) – Summarizes the overall statistics for each issue type.

  • info (dict) – A dictionary that contains information and statistics about the data and each issue type.

Attributes:

statistics

Returns the statistics dictionary.

Methods:

get_issues([issue_name])

Use this after finding issues to see which examples suffer from which types of issues.

get_issue_summary([issue_name])

Summarize the issues found in dataset of a particular type, including how severe this type of issue is overall across the dataset.

get_info([issue_name])

Get the info for the issue_name key.

collect_statistics(issue_manager)

Update the statistics in the info dictionary.

collect_issues_from_issue_manager(issue_manager)

Collects results from an IssueManager and update the corresponding attributes of the Datalab object.

set_health_score()

Set the health score for the dataset based on the issue summary.

property statistics: Dict[str, Any]#

Returns the statistics dictionary.

Shorthand for self.info[“statistics”].

Return type:

Dict[str, Any]

get_issues(issue_name=None)[source]#

Use this after finding issues to see which examples suffer from which types of issues.

Parameters:

issue_name (str or None) – The type of issue to focus on. If None, returns full DataFrame summarizing all of the types of issues detected in each example from the dataset.

Raises:

ValueError – If issue_name is not a type of issue previously considered in the audit.

Return type:

DataFrame

Returns:

specific_issues – A DataFrame where each row corresponds to an example from the dataset and columns specify: whether this example exhibits a particular type of issue and how severely (via a numeric quality score where lower values indicate more severe instances of the issue).

Additional columns may be present in the DataFrame depending on the type of issue specified.

get_issue_summary(issue_name=None)[source]#

Summarize the issues found in dataset of a particular type, including how severe this type of issue is overall across the dataset.

Parameters:

issue_name (Optional[str]) – Name of the issue type to summarize. If None, summarizes each of the different issue types previously considered in the audit.

Return type:

DataFrame

Returns:

issue_summary – DataFrame where each row corresponds to a type of issue, and columns quantify: the number of examples in the dataset estimated to exhibit this type of issue, and the overall severity of the issue across the dataset (via a numeric quality score where lower values indicate that the issue is overall more severe).

get_info(issue_name=None)[source]#

Get the info for the issue_name key.

This function is used to get the info for a specific issue_name. If the info is not computed yet, it will raise an error.

Parameters:

issue_name (Optional[str]) – The issue name for which the info is required.

Return type:

Dict[str, Any]

Returns:

info – The info for the issue_name.

collect_statistics(issue_manager)[source]#

Update the statistics in the info dictionary.

Parameters:

statistics – A dictionary of statistics to add/update in the info dictionary.

Examples

A common use case is to reuse the KNN-graph across multiple issue managers. To avoid recomputing the KNN-graph for each issue manager, we can pass it as a statistic to the issue managers.

>>> from scipy.sparse import csr_matrix
>>> weighted_knn_graph = csr_matrix(...)
>>> issue_manager_that_computes_knn_graph = ...
Return type:

None

collect_issues_from_issue_manager(issue_manager)[source]#

Collects results from an IssueManager and update the corresponding attributes of the Datalab object.

This includes: - self.issues - self.issue_summary - self.info

Parameters:

issue_manager (IssueManager) – IssueManager object to collect results from.

Return type:

None

set_health_score()[source]#

Set the health score for the dataset based on the issue summary.

Currently, the health score is the mean of the scores for each issue type.

Return type:

None

cleanlab.datalab.internal.data_issues.get_data_statistics(data)[source]#

Get statistics about a dataset.

This function is called to initialize the “statistics” info in all Datalab objects.

Parameters:

data (Data) – Data object containing the dataset.

Return type:

Dict[str, Any]