data_issues#
Module for the DataIssues
class, which serves as a central repository for storing
information and statistics about issues found in a dataset.
It collects information from various
IssueManager
instances and keeps track of each issue, a summary for each type of issue,
related information and statistics about the issues.
The collected information can be accessed using the ~cleanlab.datalab.internal.data_issues.DataIssues.get_info method. We recommend using that method instead of this module, which is just intended for internal use.
Classes:
|
Class that collects and stores information and statistics on issues found in a dataset. |
Functions:
|
Get statistics about a dataset. |
- class cleanlab.datalab.internal.data_issues.DataIssues(data, strategy)[source]#
Bases:
object
Class that collects and stores information and statistics on issues found in a dataset.
- Parameters:
data (
Data
) – The data object for which the issues are being collected.strategy (
Type
[_InfoStrategy
]) – Strategy used for processing info dictionaries.
- issues#
Stores information about each individual issue found in the data, on a per-example basis.
- Type:
pd.DataFrame
- issue_summary#
Summarizes the overall statistics for each issue type.
- Type:
pd.DataFrame
- info#
A dictionary that contains information and statistics about the data and each issue type.
- Type:
dict
Methods:
get_info
([issue_name])- rtype:
Dict
[str
,Any
]
get_issues
([issue_name])Use this after finding issues to see which examples suffer from which types of issues.
get_issue_summary
([issue_name])Summarize the issues found in dataset of a particular type, including how severe this type of issue is overall across the dataset.
collect_statistics
(issue_manager)Update the statistics in the info dictionary.
collect_issues_from_issue_manager
(issue_manager)Collects results from an IssueManager and update the corresponding attributes of the Datalab object.
collect_issues_from_imagelab
(imagelab, ...)- rtype:
None
Set the health score for the dataset based on the issue summary.
Attributes:
Returns the statistics dictionary.
- property statistics: Dict[str, Any]#
Returns the statistics dictionary.
Shorthand for self.info[“statistics”].
- get_issues(issue_name=None)[source]#
Use this after finding issues to see which examples suffer from which types of issues.
- Parameters:
issue_name (
str
orNone
) – The type of issue to focus on. If None, returns full DataFrame summarizing all of the types of issues detected in each example from the dataset.- Raises:
ValueError – If issue_name is not a type of issue previously considered in the audit.
- Return type:
DataFrame
- Returns:
specific_issues
– A DataFrame where each row corresponds to an example from the dataset and columns specify: whether this example exhibits a particular type of issue and how severely (via a numeric quality score where lower values indicate more severe instances of the issue).Additional columns may be present in the DataFrame depending on the type of issue specified.
- get_issue_summary(issue_name=None)[source]#
Summarize the issues found in dataset of a particular type, including how severe this type of issue is overall across the dataset.
- Parameters:
issue_name (
Optional
[str
]) – Name of the issue type to summarize. If None, summarizes each of the different issue types previously considered in the audit.- Return type:
DataFrame
- Returns:
issue_summary
– DataFrame where each row corresponds to a type of issue, and columns quantify: the number of examples in the dataset estimated to exhibit this type of issue, and the overall severity of the issue across the dataset (via a numeric quality score where lower values indicate that the issue is overall more severe).
- collect_statistics(issue_manager)[source]#
Update the statistics in the info dictionary.
- Parameters:
statistics – A dictionary of statistics to add/update in the info dictionary.
- Return type:
None
Examples
A common use case is to reuse the KNN-graph across multiple issue managers. To avoid recomputing the KNN-graph for each issue manager, we can pass it as a statistic to the issue managers.
>>> from scipy.sparse import csr_matrix >>> weighted_knn_graph = csr_matrix(...) >>> issue_manager_that_computes_knn_graph = ...
- collect_issues_from_issue_manager(issue_manager)[source]#
Collects results from an IssueManager and update the corresponding attributes of the Datalab object.
This includes: - self.issues - self.issue_summary - self.info
- Parameters:
issue_manager (
IssueManager
) – IssueManager object to collect results from.- Return type:
None