issue_manager#
Classes:
|
Base class for managing data issues of a particular type in a Datalab. |
- class cleanlab.datalab.issue_manager.issue_manager.IssueManager(datalab, **_)[source]#
Bases:
ABC
Base class for managing data issues of a particular type in a Datalab.
For each example in a dataset, the IssueManager for a particular type of issue should compute: - A numeric severity score between 0 and 1,
with values near 0 indicating severe instances of the issue.
- A boolean
is_issue
value, which is True if we believe this example suffers from the issue in question.
is_issue
may be determined by thresholding the severity score(with an a priori determined reasonable threshold value), or via some other means (e.g. Confident Learning for flagging label issues).
- A boolean
The IssueManager should also report: - A global value between 0 and 1 summarizing how severe this issue is in the dataset overall
(e.g. the average severity across all examples in dataset or count of examples where
is_issue=True
).Other interesting
info
about the issue and examples in the dataset, and statistics estimated from current dataset that may be reused to score this issue in future data. For example,info
for label issues could contain the: confident_thresholds, confident_joint, predicted label for each example, etc. Another example is for (near)-duplicate detection issue, whereinfo
could contain: which set of examples in the dataset are all (nearly) identical.
Implementing a new IssueManager: - Define the
issue_name
class attribute, e.g. “label”, “duplicate”, “outlier”, etc. - Implement the abstract methodsfind_issues
andcollect_info
.find_issues
is responsible for computing computing theissues
andsummary
dataframes.collect_info
is responsible for computing theinfo
dict. It is called byfind_issues
, once the manager has set theissues
andsummary
dataframes as instance attributes.
Attributes:
Short text that summarizes the type of issues handled by this IssueManager.
Returns a key that is used to store issue summary results about the assigned Lab.
Returns a key that is used to store issue score results about the assigned Lab.
A dictionary of verbosity levels and their corresponding dictionaries of report items to print.
Methods:
find_issues
(*args, **kwargs)Finds occurrences of this particular issue in the dataset.
collect_info
(*args, **kwargs)Collects data for the info attribute of the Datalab.
make_summary
(score)Construct a summary dataframe.
report
(issues, summary, info[, ...])Compose a report of the issues found by this IssueManager.
- description: ClassVar[str]#
Short text that summarizes the type of issues handled by this IssueManager.
- issue_name: ClassVar[str]#
Returns a key that is used to store issue summary results about the assigned Lab.
- issue_score_key: ClassVar[str]#
Returns a key that is used to store issue score results about the assigned Lab.
- verbosity_levels: ClassVar[Dict[int, List[str]]]#
A dictionary of verbosity levels and their corresponding dictionaries of report items to print.
Example
>>> verbosity_levels = { ... 0: [], ... 1: ["some_info_key"], ... 2: ["additional_info_key"], ... }
- info: Dict[str, Any]#
- issues: DataFrame#
- summary: DataFrame#
- abstract find_issues(*args, **kwargs)[source]#
Finds occurrences of this particular issue in the dataset.
Computes the
issues
andsummary
dataframes. Callscollect_info
to compute theinfo
dict.- Return type:
None
- collect_info(*args, **kwargs)[source]#
Collects data for the info attribute of the Datalab.
Note
This method is called by
find_issues()
afterfind_issues()
has set theissues
andsummary
dataframes as instance attributes.- Return type:
dict
- classmethod make_summary(score)[source]#
Construct a summary dataframe.
- Parameters:
score (
float
) – The overall score for this issue.- Return type:
DataFrame
- Returns:
summary
– A summary dataframe.
- classmethod report(issues, summary, info, num_examples=5, verbosity=0, include_description=False, info_to_omit=None)[source]#
Compose a report of the issues found by this IssueManager.
- Parameters:
issues (
DataFrame
) –An issues dataframe.
Example
>>> import pandas as pd >>> issues = pd.DataFrame( ... { ... "is_X_issue": [True, False, True], ... "X_score": [0.2, 0.9, 0.4], ... }, ... )
summary (
DataFrame
) –The summary dataframe.
Example
>>> summary = pd.DataFrame( ... { ... "issue_type": ["X"], ... "score": [0.5], ... }, ... )
info (
Dict
[str
,Any
]) –The info dict.
Example
>>> info = { ... "A": "val_A", ... "B": ["val_B1", "val_B2"], ... }
num_examples (
int
) – The number of examples to print.verbosity (
int
) – The verbosity level of the report.include_description (
bool
) – Whether to include a description of the issue in the report.
- Return type:
str
- Returns:
report_str
– A string containing the report.