Creating Your Own Issues Manager#
This guide walks through the process of creating your own
IssueManager
to detect a custom-defined type of issue alongside the pre-defined issue types in
Datalab
.
See also
register
:You can either use this function at runtime to register a new issue manager:
from cleanlab.datalab.internal.issue_manager_factory import register register(MyIssueManager) # Defaults to task="classification" # register(MyIssueManagerForRegression, task="regression") # Alternative for regression tasks
or add as a decorator to the class definition (currently only works for classification tasks):
@register class MyIssueManager(IssueManager): ...
Prerequisites#
As a starting point for this guide, we’ll import the necessary things for the next section and create a dummy dataset.
Note
Using Datalab requires additional dependencies beyond the rest of the cleanlab
package. To install them, run:
$ pip install "cleanlab[datalab]"
For the developmental version of the package, install from source:
$ pip install "git+https://github.com/cleanlab/cleanlab.git#egg=cleanlab[datalab]"
import numpy as np
import pandas as pd
from cleanlab import IssueManager
# Create a dummy dataset
N = 20
data = pd.DataFrame(
{
"text": [f"example {i}" for i in range(N)],
"label": np.random.randint(0, 2, N),
},
)
Implementing IssueManagers#
Basic Issue Check#
To create a basic issue manager, inherit from the
IssueManager
class,
assign a name to the class as the class-variable, issue_name
, and implement the
find_issues
method.
The find_issues
method should mark each example in the dataset as an issue or not with a boolean array.
It should also provide a score for each example in the dataset that quantifies the quality of the example
with regards to the issue.
class Basic(IssueManager):
# Assign a name to the issue
issue_name = "basic"
def find_issues(self, **kwargs) -> None:
# Compute scores for each example
scores = np.random.rand(len(self.datalab.data))
# Construct a dataframe where examples are marked for issues
# and the score for each example is included.
self.issues = pd.DataFrame(
{
f"is_{self.issue_name}_issue" : scores < 0.1,
self.issue_score_key : scores,
},
)
# Score the dataset as a whole based on this issue type
self.summary = self.make_summary(score = scores.mean())
Intermediate Issue Check#
To create an intermediate issue:
Perform the same steps as in the basic issue check section.
Populate the
info
attribute with a dictionary of information about the identified issues.
The information can be included in a report generated by Datalab
,
if you add any of the keys to the verbosity_levels
class-attribute.
Optionally, you can also add a description of the type of issue this issue manager handles to the description
class-attribute.
class Intermediate(IssueManager):
issue_name = "intermediate"
# Add a dictionary of information to include in the report
verbosity_levels = {
0: [],
1: ["std"],
2: ["raw_scores"],
}
# Add a description of the issue
description = "Intermediate issues are a bit more involved than basic issues."
def find_issues(self, *, intermediate_arg: int, **kwargs) -> None:
N = len(self.datalab.data)
raw_scores = np.random.rand(N)
std = raw_scores.std()
threshold = min(0, raw_scores.mean() - std)
sin_filter = np.sin(intermediate_arg * np.arange(N) / N)
kernel = sin_filter ** 2
scores = kernel * raw_scores
self.issues = pd.DataFrame(
{
f"is_{self.issue_name}_issue" : scores < threshold,
self.issue_score_key : scores,
},
)
self.summary = self.make_summary(score = scores.mean())
# Useful information that will be available in the Datalab instance
self.info = {
"std": std,
"raw_scores": raw_scores,
"kernel": kernel,
}
Advanced Issue Check#
There could be different types of issues detected in a dataset. A local issue which affects individual data points in a dataset and can be tracked via Datalab.issues
dataframe (to see which data points are exhibiting this type of issue). Alternatively, a global issue which affects the overall dataset but is not easily attributable to individual data points (hard to say one data point exhibits the issue but another does not). Even for global issues, we recommend trying to assign a per data point score (and boolean) if possible, see the Non-IID IssueManager as an example of this. Note that a global issue must have num_issues greater than 0 in its issue_summary
, otherwise it won’t show up in Datalab.report()
by default.
Use with Datalab#
We can create a
Datalab
instance and run issue checks with the custom issue managers we created like so:
from cleanlab.datalab.internal.issue_manager_factory import register
from cleanlab import Datalab
# Register the issue manager
for issue_manager in [Basic, Intermediate]:
register(issue_manager)
# Instantiate a datalab instance
datalab = Datalab(data, label_name="label")
# Run the issue check
issue_types = {"basic": {}, "intermediate": {"intermediate_arg": 2}}
datalab.find_issues(issue_types=issue_types)
# Print report
datalab.report(verbosity=0)
The report will look something like this:
Here is a summary of the different kinds of issues found in the data:
issue_type score num_issues
basic 0.477762 2
intermediate 0.286455 0
(Note: A lower score indicates a more severe issue across all examples in the dataset.)
------------------------------------------- basic issues -------------------------------------------
Number of examples with this issue: 2
Overall dataset quality in terms of this issue: 0.4778
Examples representing most severe instances of this issue:
is_basic_issue basic_score
13 True 0.003042
8 True 0.058117
11 False 0.121908
15 False 0.169312
17 False 0.229044
--------------------------------------- intermediate issues ----------------------------------------
About this issue:
Intermediate issues are a bit more involved than basic issues.
Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 0.2865
Examples representing most severe instances of this issue:
is_intermediate_issue intermediate_score kernel
0 False 0.000000 0.0
1 False 0.007059 0.009967
3 False 0.010995 0.087332
2 False 0.016296 0.03947
11 False 0.019459 0.794251