data_issues#
Module for the DataIssues
class, which serves as a central repository for storing
information and statistics about issues found in a dataset.
It collects information from various
IssueManager
instances and keeps track of each issue, a summary for each type of issue,
related information and statistics about the issues.
The collected information can be accessed using the
get_info
method.
Classes:
|
Class that collects and stores information and statistics on issues found in a dataset. |
- class cleanlab.datalab.data_issues.DataIssues(data)[source]#
Bases:
object
Class that collects and stores information and statistics on issues found in a dataset.
- Parameters:
data (
Data
) – The data object for which the issues are being collected.issues (
pd.DataFrame
) – Stores information about each individual issue found in the data, on a per-example basis.issue_summary (
pd.DataFrame
) – Summarizes the overall statistics for each issue type.info (
dict
) – A dictionary that contains information and statistics about the data and each issue type.
Attributes:
Returns the statistics dictionary.
Methods:
get_issues
([issue_name])Use this after finding issues to see which examples suffer from which types of issues.
get_issue_summary
([issue_name])Summarize the issues found in dataset of a particular type, including how severe this type of issue is overall across the dataset.
get_info
([issue_name])Get the info for the issue_name key.
Update the statistics in the info dictionary.
collect_results_from_issue_manager
(issue_manager)Collects results from an IssueManager and update the corresponding attributes of the Datalab object.
Set the health score for the dataset based on the issue summary.
- property statistics: Dict[str, Any]#
Returns the statistics dictionary.
Shorthand for self.info[“statistics”].
- Return type:
Dict
[str
,Any
]
- get_issues(issue_name=None)[source]#
Use this after finding issues to see which examples suffer from which types of issues.
- Parameters:
issue_name (
str
orNone
) – The type of issue to focus on. IfNone
, returns full DataFrame summarizing all of the types of issues detected in each example from the dataset.- Raises:
ValueError – If
issue_name
is not a type of issue previously considered in the audit.- Return type:
DataFrame
- Returns:
specific_issues
– A DataFrame where each row corresponds to an example from the dataset and columns specify: whether this example exhibits a particular type of issue and how severely (via a numeric quality score where lower values indicate more severe instances of the issue).Additional columns may be present in the DataFrame depending on the type of issue specified.
- get_issue_summary(issue_name=None)[source]#
Summarize the issues found in dataset of a particular type, including how severe this type of issue is overall across the dataset.
- Parameters:
issue_name (
Optional
[str
]) – Name of the issue type to summarize. IfNone
, summarizes each of the different issue types previously considered in the audit.- Return type:
DataFrame
- Returns:
issue_summary
– DataFrame where each row corresponds to a type of issue, and columns quantify: the number of examples in the dataset estimated to exhibit this type of issue, and the overall severity of the issue across the dataset (via a numeric quality score where lower values indicate that the issue is overall more severe).
- get_info(issue_name=None)[source]#
Get the info for the issue_name key.
This function is used to get the info for a specific issue_name. If the info is not computed yet, it will raise an error.
- Parameters:
issue_name (
Optional
[str
]) – The issue name for which the info is required.- Return type:
Dict
[str
,Any
]- Returns:
info
– The info for the issue_name.
- collect_statistics_from_issue_manager(issue_manager)[source]#
Update the statistics in the info dictionary.
- Parameters:
statistics – A dictionary of statistics to add/update in the info dictionary.
Examples
A common use case is to reuse the KNN-graph across multiple issue managers. To avoid recomputing the KNN-graph for each issue manager, we can pass it as a statistic to the issue managers.
>>> from scipy.sparse import csr_matrix >>> weighted_knn_graph = csr_matrix(...) >>> issue_manager_that_computes_knn_graph = ...
- Return type:
None
- collect_results_from_issue_manager(issue_manager)[source]#
Collects results from an IssueManager and update the corresponding attributes of the Datalab object.
This includes: - self.issues - self.issue_summary - self.info
- Parameters:
issue_manager (
IssueManager
) – IssueManager object to collect results from.- Return type:
None